Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Statistical power and sample size: Use & misuse
(minimum detectable difference, sample size, cluster trials, post-hoc power analysis, effect size, reliability)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
Power is conventionally defined as the probability of rejecting the null hypothesis when the null hypothesis is false and the alternative hypothesis is true. For example, say we have two populations whose parametric means are different. We sample the two populations and obtain sample means and variances. We carry out a statistical test to see if the means are significantly different. We repeat the sampling and testing many times. In principle , the power of the test is the proportion of those tests that correctly indicate that the two population means are significantly different. In practice, a better (and more general) definition of power is it simply the probability that test will class a specified treatment effect as significant.
There are substantial differences between disciplines in the extent to which power is taken into account in the study design. In medical research, a power analysis precedes nearly all clinical trials in order to set sample size. It is also used prior to most observational studies. The only drawback to this is that it assumes you known the relevant population parameters, when in fact you only have estimates, and sometimes wildly inaccurate 'guesstimates'. Hence your estimates of required sample size will be similarly dodgy. Power analyses are much less common in veterinary research, although things are changing, at least for clinical trials. In ecological and wildlife research, the issue is still generally ignored - although with notable exceptions. In these disciplines, post-hoc power analyses are sometimes performed at the end of the study supposedly to aid interpretation of non-significant differences. Such analyses are not recommended, with the confidence interval being a more useful measure of the reliability of an observed effect.
In many disciplines the commonest misuse is lack of any consideration of power at all, leading to woefully inadequate sample sizes. Sample sizes of 1-5 per treatment group are not uncommon in veterinary and wildlife studies. Such studies are unethical, simply because they cannot give decisive results. Many experiments loose power by trying to compare too many treatments - if there are fewer treatments, you get more power for same total number of experimental units. Underpowered studies suffer especially from the problem of misinterpreting 'no significant difference' to mean 'no difference'. Such an interpretation is never valid, even if power is adequate, because the minimum detectable difference is never set at zero! It is better to focus on the effect size, with an estimate of its reliability, rather than debating whether a non-significant difference is 'real'.
There are a number of mistakes in determining sample size requirements. Probably the main one is set the minimum detectable difference, and/or the power, simply to get a convenient sample size. The rationale for choice of parameters, especially the minimum detectable difference, should always be made explicit at the start of the study. Changing the parameters for power calculations during the course of an experiment is generally unwise, although it may justified if it is done to correct estimates of the baseline level or the variances. In cluster trials it is important to get the balance right between the number of clusters (to maximise n) and the number of individuals in each cluster (to minimise random variation between clusters). A common misuse is to estimate required sample size without taking into consideration the numbers likely to drop-out from the trial, or not respond in a questionnaire. A, fortunately rare, misuse is to estimate power using different variable(s) from those being used in the study - for example estimating power from absolute population estimates, and then using relative population estimates in the study.
What the statisticians sayBart et al. (1998) gives an excellent introduction for ecologists to sample size requirements and power in Chapter 3. He also explains why it is unwise to carry out post-hoc power calculations. Zar (1996) gives only a very brief introduction to statistical power in Chapter 6. Thrusfield (2005) gives the principles for sample size selection and statistical power in Chapter 14, followed by the formulae for the basic study types in subsequent chapters. Cohen (1988) is the 'standard' text on power analysis and provides a convenient source of formulae and tables. More recent texts include Murphy & Myors (2003) and Bausell & Li (2002).
Schulz & Grimes (2005) provide a up-to-date review of power and sample size in clinical trials. Most of the review is excellent, although their justification of the 'sample samba' is flawed. Wittes (2002) gives a comprehensive review of methods for the choice of sample size and statistical power for randomized controlled trials. This is fairly advanced, but provides a good reference work.
Hoenig & Heisey (2001) and Thomas (1997) tackle the thorny issue of post-hoc power calculations. Goodman & Berlin (1994) provide a fairly readable account on why post-hoc power analysis is inappropriate, and why confidence intervals should be used instead. Assmann et al (2000) and Pocock et al. (2002) both look at the problems of carrying out subgroup analyses in randomised trials when such comparisons commonly lack statistical power. Altman & Bland (1995) provides an excellent summary of the problem of inadequate sample size and consequent lack of power in many clinical trials, together with some good examples.
Gerrodette (1995), Steidl et al. (1997) and Hayes & Steidl (1997) all look at statistical power analysis in wildlife research, and Thomas & Krebs (1997) review available software for statistical power analysis. Jennions & Moller (2003) carry out a survey of statistical power in articles from behavioural journals.
Wikipedia provides sections on statistical power, Type I and Type II errors, sample size, and effect size. NIST/SEMATECH e-Handbook of Statistics give a brief account of interval estimation and hypothesis tests. Hun Myoung Park, the Web Center for Social Research Methods and Jeremy Miles all look at statistical power.