What is statistical power?
Power is conventionally defined as the probability of rejecting the null hypothesis when the null hypothesis is false and the alternative hypothesis is true.
For example, say we have two populations whose parametric means are different.
 We sample the two populations and obtain sample means and variances.
 We carry out a statistical test to see if the means are significantly different
 We repeat the sampling and testing many times
In principle therefore, the power of the test is the proportion of those tests that correctly indicate that the two population means are significantly different.
In practice, as we note elsewhere, a better (and more general) definition of power is it simply the probability that test will class a specified treatment effect as significant.
Provided the statistic being tested has a 'known' distribution (e.g. normal) that test's power is as follows:
 Imagine that d_{A} is the distribution of your test statistic (e.g. Z) under the alternate hypothesis, H_{A}.
 Then the power of your test is simply the proportion of d_{A} which falls outside the lower and/or upper 'critical values' of your test  these being quantiles of d_{0}, which is that test statistic's distribution under the null hypothesis, H_{0}.
This works perfectly happily irrespective of the size of your treatment effect (it could be zero) but it does assume your treatment effect is fixed (cannot vary), and that the only difference between d_{0} and d_{A} is their location (so the treatment effect, δ, is μ_{A} − μ_{0}).
Of course, if d_{0} and d_{A} differ in other ways, or their distributions are unknown (or cannot be readily calculated) the only way to find the power may be empirically (in other words by simulation). In that case you repeatedly sample a defined population, apply your test to each sample, and find what proportion of results are 'significant'.
Lastly, notice that if the test statistic's distribution is not continuous (smooth), but is strongly discrete (stepped), employing conventional critical values can reduce the attainable power to the point of uselessness. In that situation, midP values behave better 'on average'  provided you accept your test could be either conservative or liberal.
Clearly we want the power of our statistical test to be as high as possible. So we need to know which other factors determine the power of a test:
Power will tend to be greater when:
 the effect size is large,
 the sample size is large,
 the variances of the populations being sampled are small,
 the significance level (α) is high (for example 5% compared to 1%),
 a onetailed rather than twotailed test is used.
Note that power can only be estimated reliably if all the assumptions for the statistical test are met.

For any particular statistical test there is a mathematical relationship between power, the level of significance, various population parameters, and the sample size. For some of the more important statistical tests we will provide the formulae for this relationship. But before we introduce the first of these (for the Ztest), we need to consider exactly what we are going to calculate using the relationship.
Estimating statistical power
There are two reasons for estimating the power of a test:
 To produce a power curve to predict how much information needs to be gathered to be reasonably sure (say 95%) you will obtain a significant result. This is a reasonable and productive exercise.
In practice one normally calculates required sample size directly for a given desired power, rather than producing a power curve. Nevertheless it can be very useful to examine a power curve because it can help with making a more rational decision on experimental design. Such a priori power predictions are worthwhile, although may be criticised if they are either based upon insufficient prior information (from too small a pilot study), or where too approximate (or inappropriate) a model is used to predict how the statistic to be tested is liable to vary. Somewhat perversely, referees tend to be very much more concerned about the precise mathematical model employed than the information to which it is applied  possibly because theoretical mathematical shortcomings are easier to solve, and their refinement provides interesting career prospects for mathematical statisticians.
 To obtain additional information about data that is already gathered and tested. Such posthoc power predictions are controversial, and generally not recommended for two reasons:
 You will always find that there is not enough power to demonstrate a nonsignificant treatment effect. This is because the estimated power is directly related to the observed Pvalue. In other words, it cannot tell you any more than a precise Pvalue.
Despite this objection, a number of standard textbooks (such as Zar (1996) and Thrusfield (2005)) recommend that power should be calculated if a difference turns out to be nonsignificant, as an aid to 'interpreting that difference'. If a test has insufficient power to detect that level of difference, they suggest the result should be classed as 'inconclusive'.
Unfortunately, posthoc power determinations have no theoretical justification and are not recommended. Power is a pretrial concept. We should not apply a preexperiment probability, of a hypothetical group of results, to the one result that is observed. This has been compared to trying to convince someone that buying a lottery ticket was foolish (the beforestudy viewpoint) after they hit a lottery jackpot (the afterstudy viewpoint).
 Calculating the power to demonstrate your observed treatment effect locks you into the significant / nonsignificant mindset with a rigid 0.05 significance level. Once you have the data, it is better to use the precise Pvalue to judge the weight of evidence, and to calculate a confidence interval around estimated effect size as a measure of reliability of that estimate.
These points accepted, there is one form of aftertheevent power calculation that can be very informative  the empirical power curve, or its equivalent Pvalue plot, or Pvalue function  which is equivalent to every possible confidence interval about the observed effect size. Whatever it is called, this function estimates the relationship between the probability of rejecting the null hypothesis and the effect size  given the data at hand. For simpler models this relationship can be predicted algebraically. Alternately, and more illuminatingly, the relationship can be estimated by 'test inversion'. Since test inversion exploits the underlying link between tests and confidence intervals, we explore this method in Unit 6.
Estimating required sample size for a given power
Predicting the sample size required for any particular statistical test requires values for the statistical power, the significance level, the effect size and various population parameters. You also need to specify whether the test is onetailed or twotailed. We will consider each of these components.
The values chosen for the statistical power and the significance level depend on the study. Conventionally, power should be no lower than 0.8 and preferably around 0.9. The commonest value used for significance level (α) is 0.05. However, there may be good reasons to diverge from these conventional values. If is more important to avoid a Type I error (that is a false positive result), then one may decrease the significance level to 0.01. If it is more important to avoid a Type II error (that is a false negative result), then one may increase the power to 0.95.
The relevant population parameters depend on the type of statistical test. If you are comparing means, you need to specify the population standard deviation. If you are comparing proportions you need to specify the baseline or control group proportion, which in turn allows one to estimate the standard deviation. Estimation of these parameters can usually be done from the literature, or failing that from a pilot study. Sometimes it is necessary to reevaluate these parameters part way through a study  although this is generally strongly disapproved of by statisticians on the grounds that it can introduce bias into the process.
The effect size (the smallest difference between the means or proportions that you consider it worthwhile to detect) is probably the most difficult parameter you have to determine because it is to some extent subjective. If one is comparing a new malarial treatment with the standard, how big an improvement is worthwhile? In deciding this one should take into account the frequency and severity of side effects, the relative cost of the new treatment, and the relative ease of administration. If the new drug is cheaper than the current one with fewer side effects, then even a small improvement in the cure rate (say 5%) is worthwhile. If it is much more expensive with similar side effects, one might consider that only a larger improvement (say 20%) would be worthwhile.
Do not just choose the effect size that gives you a convenient sample size!
Considerations about the choice of effect size should always be made explicit  a point which is not sufficiently stressed in the literature! All too often researchers do what is popularly known as the 'sample size samba'  which is to modify the effect size simply to give a convenient sample size. This is very foolish, because if one then finds a smaller effect size, one is committed to saying it is not worthwhile  even if it is!!

Lastly one has to decide whether to choose a onetailed or twotailed test. Sometimes a onetailed test is chosen simply as a means to reduce the required sample size, a practice strongly discouraged by statisticians. Nowadays the convention is that one should always estimate sample size for a twotailed test, even if a onesided test is subsequently used for the analysis.
There is one last important point!
Estimating the required sample size is never a precise science. It is always approximate, because you have to estimate (sometimes just guess) the variances of the populations involved. Hence the actual power you achieve may be well below what you intend.
It is therefore a good idea to use a somewhat larger sample size than that indicated by your power analysis.

Estimating power and sample size for the Ztest
Hypotheses and tails
We now consider how to estimate the statistical power of the Z test for comparing a value, Q, randomly selected from a test population with true mean (μ_{1})  with a known reference population mean (μ_{0}) and known standard error (σ_{d}). This standard error is assumed to be the same under both the null and the alternate hypothesis  and d = Q − μ_{1}.
 For a onetailed test, of the upper tail:
 The Null Hypothesis (H_{0}) is μ_{1} = μ_{0}
So δ = [μ_{1} − μ_{0}] = 0.
 The Alternative Hypothesis (H_{1}) is μ_{1} > μ_{0}
So δ = [μ_{1} − μ_{0}] > 0.
In other words, δ is the true difference between the null and alternate population means, and d is the difference we observe  which is an estimate of δ. We will only reject H_{0} if we observe a d lying within the upper tail of our null population. The bigger δ is, compared to σ_{d}, the higher is that probability.
 For a onetailed test, of the lower tail:
 H_{0} is the same, δ = 0.
 But H_{1} is μ_{1} < μ_{0} So δ < 0.
Here we can only reject H_{0} if d is observed in the lower tail of our null population.
 For a two tailed test:
 H_{0} is the same.
 Under H_{1} δ ≠ 0.
H_{0} can be rejected if d is observed in either tail of our null population.
Z Notation
To reduce the amount of computation, these comparisons are commonly performed using standardised values. Unfortunately this usually introduces some extra notation, which we would be wise to explain before proceeding.
In Unit 3 we used Z to refer to a normal probability density  from a standard normal distribution. Confusingly Z may also be used to denote randomly selected locations within that distribution. Predefined values (usually quantiles) within that distribution are indicated using a small z, with a subscript.
z_{α} (or +z_{α}) is the location of the critical value for α, above which lie 100α% of the null population.
 For an ordinary 1tailed significance test, of the upper tail, α = 0.05 and +z_{α} = +1.645.
 Because this distribution is symmetrical, for the lower tail −z_{α} = −1.645.
 For a 2tailed comparison, assuming a probability of α/2 in each tail and α=0.05, then −z_{α/2} = −1.960 and +z_{α/2} = +1.960.
Accordingly, if we standardise the difference between means, by dividing by the population standard error of d (σ_{d}), then z_{δ} = ^{δ}/σ_{d} or ^{[μ}1^{ − μ}0^{]}/σ_{d}
Power formulae
For the three tests listed above the probability of correctly rejecting the null hypothesis, with a predefined α, is as follows:
Algebraically speaking 
a. For a onetailed test, using the upper tail (treatment effect positive):
Power (1β) = P[Z > ( +z_{α} − z_{δ} )]
For example, if z_{δ} = +z_{α}, then half of all randomly selected results will exceed +z_{α}, causing H_{0} to be rejected  so the power (1β) will be 0.5
b. For a onetailed test, using the lower tail (treatment effect negative):
Power (1β) = P[Z < ( −z_{α} − z_{δ} )] = 1 − P[Z > ( −z_{α} − z_{δ} )]
Similarly, if z_{δ} = −z_{α}, then half of all randomly selected results will fall below −z_{α}  so the power (1β) will be 0.5
c. For a twotailed test, using both tails:
Power (1β) = P[Z > ( +z_{α/2} − z_{δ} )] + 1 − P[Z > ( −z_{α/2} − z_{δ} )]
Given which, if z_{δ} = −z_{α/2} or z_{δ} = +z_{α/2} then a little more than half of all randomly selected values will cause H_{0} to be rejected. If α/2 is larger, or z_{δ} is smaller the difference in power, compared to the 1tailed formula, is rather greater.
In all three cases, if δ = 0, then (1  β) = α/2, which is the proportion of type 1 errors where H_{0} is true.
Where:
 P is probability, determined from the cumulative normal distribution, as the proportion of the standard normal distribution greater than or less than Z. This can be obtained from the probability calculator on your computer statistical package. If you are using tables, some give the proportion of the distribution that is less than Z, whilst others give the proportion of the distribution that is greater than Z. Another variant is where the probability given in the table is from zero to Z, so you have to add 0.5 to get the correct value.
 Z is the standardised normal deviate,
 z_{α} is the location of the critical value for α, above which lie 100α% of the null population  and is obtained from your probability calculator or tables, given that P(Z < z_{α}) = 1α and α is the significance level.
 z_{δ} = ^{δ}/σ_{d} or ^{[μ}1^{ − μ}0^{]}/σ_{d}
 μ_{0} is the reference population mean (under H_{0}),
 μ_{1} is the test population mean
 σ_{d} = population standard error of d. For a Ztest
σ_{d} = σ/√n, the standard error of the reference population mean, which is usually calculated as the standard deviation of the reference population observations (σ) divided by the square root of the number of observations in the sample (n).

Estimating sample size
We rearrange the formula for power to give us the number of samples required to obtain a given power.
Algebraically speaking:
For a onetailed test:

n 
= 
(z_{α} + z_{β})^{2} σ^{2} 

(μ_{1} − μ_{0})^{2
} 
where
 (z_{α} is obtained from your probability calculator or tables given that P(Z < z_{α}) = 1 − α and α is the significance level.
 z_{β} is obtained from your probability calculator or tables, given that P(Z < z_{β}) = 1 − β and
1 − β is the power.
 μ_{0} is the known population mean,
 μ_{1} is the test population mean
 σ is the known population standard deviation of the observations
For a twotailed test, we use an approximation and use z_{α/2} in place of z_{α}. This ignores the possibility of a type III error, but for large treatment effects, will not usually introduce any serious error.

The following values of z_{α}, and z_{β} are those most frequently used in sample size calculations:
Significance Level

One tailed (z_{α})  Two tailed (z_{α/2}) 
5%  1%  5%  1% 
1.6449  2.3263  1.9600  2.5758 
 
Power (z_{β}) 
80%  90%  95% 
0.8416  1.2816  1.6449 
Assumptions
You are making a number of assumptions when you estimate power and required sample size. The first group of assumptions apply to all significance tests, namely:
 Samples are taken randomly, or individuals are allocated randomly to treatment groups.
 Observations are independent of each other.
The second set of assumptions apply specifically to the Ztest:
 The response variable approximates to a normal distribution
 The true mean and standard deviation of the population are known and not estimated from a sample.
Related
topics :
Efficiency of tests