Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Statistical Power and sample sizeWhat is statistical power? Estimating power Estimating required sample size for a given power Estimating power & sample size for a Z-test Assumptions
What is statistical power?
Power is conventionally defined as the probability of rejecting the null hypothesis when the null hypothesis is false and the alternative hypothesis is true.
For example, say we have two populations whose parametric means are different.
In principle therefore, the power of the test is the proportion of those tests that correctly indicate that the two population means are significantly different.
Provided the statistic being tested has a 'known' distribution (e.g. normal) that test's power is as follows:
This works perfectly happily irrespective of the size of your treatment effect (it could be zero) but it does assume your treatment effect is fixed (cannot vary), and that the only difference between d0 and dA is their location (so the treatment effect, δ, is
Lastly, notice that if the test statistic's distribution is not continuous (smooth), but is strongly discrete (stepped), employing conventional critical values can reduce the attainable power to the point of uselessness. In that situation, mid-P values behave better 'on average' - provided you accept your test could be either conservative or
Clearly we want the power of our statistical test to be as high as possible. So we need to know which other factors determine the power of a test:
For any particular statistical test there is a mathematical relationship between power, the level of significance, various population parameters, and the sample size. For some of the more important statistical tests we will provide the formulae for this relationship. But before we introduce the first of these (for the Z-test), we need to consider exactly what we are going to calculate using the relationship.
Estimating statistical power
There are two reasons for estimating the power of a test:
In practice one normally calculates required sample size directly for a given desired power, rather than producing a power curve. Nevertheless it can be very useful to examine a power curve because it can help with making a more rational decision on experimental design. Such a priori power predictions are worthwhile, although may be criticised if they are either based upon insufficient prior information (from too small a pilot study), or where too approximate (or inappropriate) a model is used to predict how the statistic to be tested is liable to vary. Somewhat perversely, referees tend to be very much more concerned about the precise mathematical model employed than the information to which it is applied - possibly because theoretical mathematical shortcomings are easier to solve, and their refinement provides interesting career prospects for mathematical statisticians.
You will always find that there is not enough power to demonstrate a nonsignificant treatment effect. This is because the estimated power is directly related to the observed P-value. In other words, it cannot tell you any more than a precise P-value.
Unfortunately, post-hoc power determinations have no theoretical justification and are not recommended. Power is a pretrial concept. We should not apply a pre-experiment probability, of a hypothetical group of results, to the one result that is observed. This has been compared to trying to convince someone that buying a lottery ticket was foolish (the before-study viewpoint) after they hit a lottery jackpot (the after-study viewpoint).
Calculating the power to demonstrate your observed treatment effect locks you into the significant / non-significant mindset with a rigid 0.05 significance level. Once you have the data, it is better to use the precise P-value to judge the weight of evidence, and to calculate a confidence interval around estimated effect size as a measure of reliability of that estimate.
These points accepted, there is one form of after-the-event power calculation that can be very informative - the empirical power curve, or its equivalent P-value plot, or P-value function - which is equivalent to every possible confidence interval about the observed effect size. Whatever it is called, this function estimates the relationship between the probability of rejecting the null hypothesis and the effect size - given the data at hand. For simpler models this relationship can be predicted algebraically. Alternately, and more illuminatingly, the relationship can be estimated by 'test inversion'. Since test inversion exploits the underlying link between tests and confidence intervals, we explore this method in
Estimating required sample size for a given power
Predicting the sample size required for any particular statistical test requires values for the statistical power, the significance level, the effect size and various population parameters. You also need to specify whether the test is one-tailed or two-tailed. We will consider each of these components.
The values chosen for the statistical power and the significance level depend on the study. Conventionally, power should be no lower than 0.8 and preferably around 0.9. The commonest value used for significance level (α) is 0.05. However, there may be good reasons to diverge from these conventional values. If is more important to avoid a Type I error (that is a false positive result), then one may decrease the significance level to 0.01. If it is more important to avoid a Type II error (that is a false negative result), then one may increase the power to 0.95.
The relevant population parameters depend on the type of statistical test. If you are comparing means, you need to specify the population standard deviation. If you are comparing proportions you need to specify the baseline or control group proportion, which in turn allows one to estimate the standard deviation. Estimation of these parameters can usually be done from the literature, or failing that from a pilot study. Sometimes it is necessary to re-evaluate these parameters part way through a study - although this is generally strongly disapproved of by statisticians on the grounds that it can introduce bias into the process.
The effect size (the smallest difference between the means or proportions that you consider it worthwhile to detect) is probably the most difficult parameter you have to determine because it is to some extent subjective. If one is comparing a new malarial treatment with the standard, how big an improvement is worthwhile? In deciding this one should take into account the frequency and severity of side effects, the relative cost of the new treatment, and the relative ease of administration. If the new drug is cheaper than the current one with fewer side effects, then even a small improvement in the cure rate (say 5%) is worthwhile. If it is much more expensive with similar side effects, one might consider that only a larger improvement (say 20%) would be worthwhile.
Lastly one has to decide whether to choose a one-tailed or two-tailed test. Sometimes a one-tailed test is chosen simply as a means to reduce the required sample size, a practice strongly discouraged by statisticians. Nowadays the convention is that one should always estimate sample size for a two-tailed test, even if a one-sided test is subsequently used for the analysis.
Estimating power and sample size for the Z-test
Hypotheses and tails
We now consider how to estimate the statistical power of the Z test for comparing a value, Q, randomly selected from a test population with true mean (μ1) - with a known reference population mean (μ0) and known standard error (σd). This standard error is assumed to be the same under both the null and the alternate hypothesis - and
To reduce the amount of computation, these comparisons are commonly performed using standardised values. Unfortunately this usually introduces some extra notation, which we would be wise to explain before proceeding.
For the three tests listed above the probability of correctly rejecting the null hypothesis, with a predefined α, is as follows:
Estimating sample size
We rearrange the formula for power to give us the number of samples required to obtain a given power.
The following values of zα, and zβ are those most frequently used in sample size calculations:
You are making a number of assumptions when you estimate power and required sample size. The first group of assumptions apply to all significance tests, namely:
The second set of assumptions apply specifically to the Z-test: