InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site

 

 

What is statistical power?

Power is conventionally defined as the probability of rejecting the null hypothesis when the null hypothesis is false and the alternative hypothesis is true.

For example, say we have two populations whose parametric means are different.

  1. We sample the two populations and obtain sample means and variances.
  2. We carry out a statistical test to see if the means are significantly different
  3. We repeat the sampling and testing many times

In principle therefore, the power of the test is the proportion of those tests that correctly indicate that the two population means are significantly different.

In practice, as we note elsewhere, a better (and more general) definition of power is it simply the probability that test will class a specified treatment effect as significant.

Provided the statistic being tested has a 'known' distribution (e.g. normal) that test's power is as follows:

  • Imagine that dA is the distribution of your test statistic (e.g. Z) under the alternate hypothesis, HA.
  • Then the power of your test is simply the proportion of dA which falls outside the lower and/or upper 'critical values' of your test - these being quantiles of d0, which is that test statistic's distribution under the null hypothesis, H0.

This works perfectly happily irrespective of the size of your treatment effect (it could be zero) but it does assume your treatment effect is fixed (cannot vary), and that the only difference between d0 and dA is their location (so the treatment effect, δ, is μAμ0).

    Of course, if d0 and dA differ in other ways, or their distributions are unknown (or cannot be readily calculated) the only way to find the power may be empirically (in other words by simulation). In that case you repeatedly sample a defined population, apply your test to each sample, and find what proportion of results are 'significant'.

Lastly, notice that if the test statistic's distribution is not continuous (smooth), but is strongly discrete (stepped), employing conventional critical values can reduce the attainable power to the point of uselessness. In that situation, mid-P values behave better 'on average' - provided you accept your test could be either conservative or liberal.

 

Clearly we want the power of our statistical test to be as high as possible. So we need to know which other factors determine the power of a test:

Power will tend to be greater when:
  1. the effect size is large,
  2. the sample size is large,
  3. the variances of the populations being sampled are small,
  4. the significance level (α) is high (for example 5% compared to 1%),
  5. a one-tailed rather than two-tailed test is used.
Note that power can only be estimated reliably if all the assumptions for the statistical test are met.

 

For any particular statistical test there is a mathematical relationship between power, the level of significance, various population parameters, and the sample size. For some of the more important statistical tests we will provide the formulae for this relationship. But before we introduce the first of these (for the Z-test), we need to consider exactly what we are going to calculate using the relationship.

 

 

Estimating statistical power

There are two reasons for estimating the power of a test:

  1.  To produce a power curve to predict how much information needs to be gathered to be reasonably sure (say 95%) you will obtain a significant result. This is a reasonable and productive exercise.

    In practice one normally calculates required sample size directly for a given desired power, rather than producing a power curve. Nevertheless it can be very useful to examine a power curve because it can help with making a more rational decision on experimental design. Such a priori power predictions are worthwhile, although may be criticised if they are either based upon insufficient prior information (from too small a pilot study), or where too approximate (or inappropriate) a model is used to predict how the statistic to be tested is liable to vary. Somewhat perversely, referees tend to be very much more concerned about the precise mathematical model employed than the information to which it is applied - possibly because theoretical mathematical shortcomings are easier to solve, and their refinement provides interesting career prospects for mathematical statisticians.

     

  2.  To obtain additional information about data that is already gathered and tested. Such post-hoc power predictions are controversial, and generally not recommended for two reasons:

    1. You will always find that there is not enough power to demonstrate a nonsignificant treatment effect. This is because the estimated power is directly related to the observed P-value. In other words, it cannot tell you any more than a precise P-value.

        Despite this objection, a number of standard textbooks (such as Zar (1996) and Thrusfield (2005)) recommend that power should be calculated if a difference turns out to be non-significant, as an aid to 'interpreting that difference'. If a test has insufficient power to detect that level of difference, they suggest the result should be classed as 'inconclusive'.

      Unfortunately, post-hoc power determinations have no theoretical justification and are not recommended. Power is a pretrial concept. We should not apply a pre-experiment probability, of a hypothetical group of results, to the one result that is observed. This has been compared to trying to convince someone that buying a lottery ticket was foolish (the before-study viewpoint) after they hit a lottery jackpot (the after-study viewpoint).

    2. Calculating the power to demonstrate your observed treatment effect locks you into the significant / non-significant mindset with a rigid 0.05 significance level. Once you have the data, it is better to use the precise P-value to judge the weight of evidence, and to calculate a confidence interval around estimated effect size as a measure of reliability of that estimate.

    These points accepted, there is one form of after-the-event power calculation that can be very informative - the empirical power curve, or its equivalent P-value plot, or P-value function - which is equivalent to every possible confidence interval about the observed effect size. Whatever it is called, this function estimates the relationship between the probability of rejecting the null hypothesis and the effect size - given the data at hand. For simpler models this relationship can be predicted algebraically. Alternately, and more illuminatingly, the relationship can be estimated by 'test inversion'. Since test inversion exploits the underlying link between tests and confidence intervals, we explore this method in Unit 6.

 

 

Estimating required sample size for a given power

Predicting the sample size required for any particular statistical test requires values for the statistical power, the significance level, the effect size and various population parameters. You also need to specify whether the test is one-tailed or two-tailed. We will consider each of these components.

The values chosen for the statistical power and the significance level depend on the study. Conventionally, power should be no lower than 0.8 and preferably around 0.9. The commonest value used for significance level (α) is 0.05. However, there may be good reasons to diverge from these conventional values. If is more important to avoid a Type I error (that is a false positive result), then one may decrease the significance level to 0.01. If it is more important to avoid a Type II error (that is a false negative result), then one may increase the power to 0.95.

The relevant population parameters depend on the type of statistical test. If you are comparing means, you need to specify the population standard deviation. If you are comparing proportions you need to specify the baseline or control group proportion, which in turn allows one to estimate the standard deviation. Estimation of these parameters can usually be done from the literature, or failing that from a pilot study. Sometimes it is necessary to re-evaluate these parameters part way through a study - although this is generally strongly disapproved of by statisticians on the grounds that it can introduce bias into the process.

 

The effect size (the smallest difference between the means or proportions that you consider it worthwhile to detect) is probably the most difficult parameter you have to determine because it is to some extent subjective. If one is comparing a new malarial treatment with the standard, how big an improvement is worthwhile? In deciding this one should take into account the frequency and severity of side effects, the relative cost of the new treatment, and the relative ease of administration. If the new drug is cheaper than the current one with fewer side effects, then even a small improvement in the cure rate (say 5%) is worthwhile. If it is much more expensive with similar side effects, one might consider that only a larger improvement (say 20%) would be worthwhile.

Do not just choose the effect size
that gives you a convenient sample size!

Considerations about the choice of effect size should always be made explicit - a point which is not sufficiently stressed in the literature! All too often researchers do what is popularly known as the 'sample size samba' - which is to modify the effect size simply to give a convenient sample size. This is very foolish, because if one then finds a smaller effect size, one is committed to saying it is not worthwhile - even if it is!!

 

Lastly one has to decide whether to choose a one-tailed or two-tailed test. Sometimes a one-tailed test is chosen simply as a means to reduce the required sample size, a practice strongly discouraged by statisticians. Nowadays the convention is that one should always estimate sample size for a two-tailed test, even if a one-sided test is subsequently used for the analysis.

There is one last important point!

Estimating the required sample size is never a precise science. It is always approximate, because you have to estimate (sometimes just guess) the variances of the populations involved. Hence the actual power you achieve may be well below what you intend.

It is therefore a good idea to use a somewhat larger sample size than that indicated by your power analysis.

 

 

Estimating power and sample size for the Z-test

Hypotheses and tails

We now consider how to estimate the statistical power of the Z test for comparing a value, Q, randomly selected from a test population with true mean (μ1) - with a known reference population mean (μ0) and known standard error (σd). This standard error is assumed to be the same under both the null and the alternate hypothesis - and d = Q − μ1.

  1. For a one-tailed test, of the upper tail:
    • The Null Hypothesis (H0) is μ1  =  μ0
      So δ = [μ1   −  μ0] = 0
      .
    • The Alternative Hypothesis (H1) is μ1   >  μ0
      So δ = [μ1   −  μ0] > 0
      .
    In other words, δ is the true difference between the null and alternate population means, and d is the difference we observe - which is an estimate of δ. We will only reject H0 if we observe a d lying within the upper tail of our null population. The bigger δ is, compared to σd, the higher is that probability.

  2. For a one-tailed test, of the lower tail:
    • H0 is the same, δ = 0.
    • But H1 is μ1   <  μ0   So δ < 0.
    Here we can only reject H0 if d is observed in the lower tail of our null population.

  3. For a two tailed test:
    • H0 is the same.
    • Under H1 δ ≠ 0.
    H0 can be rejected if d is observed in either tail of our null population.

 

Z Notation

To reduce the amount of computation, these comparisons are commonly performed using standardised values. Unfortunately this usually introduces some extra notation, which we would be wise to explain before proceeding.

  • In Unit 3 we used Z to refer to a normal probability density - from a standard normal distribution. Confusingly Z may also be used to denote randomly selected locations within that distribution. Predefined values (usually quantiles) within that distribution are indicated using a small z, with a subscript.

  • zα (or +zα) is the location of the critical value for α, above which lie 100α% of the null population.
    1. For an ordinary 1-tailed significance test, of the upper tail, α = 0.05 and +zα = +1.645.
    2. Because this distribution is symmetrical, for the lower tail −zα = −1.645.
    3. For a 2-tailed comparison, assuming a probability of α/2 in each tail and α=0.05, then −zα/2 = −1.960 and +zα/2 = +1.960.

  • Accordingly, if we standardise the difference between means, by dividing by the population standard error of d (σd), then zδ = δd or 1μ0]d

     

    Power formulae

    For the three tests listed above the probability of correctly rejecting the null hypothesis, with a predefined α, is as follows:

     

    Algebraically speaking -

    a.   For a one-tailed test, using the upper tail (treatment effect positive):

    Power (1-β)   =  P[Z > ( +zα − zδ )]

    For example, if zδ = +zα, then half of all randomly selected results will exceed +zα, causing H0 to be rejected - so the power (1-β) will be 0.5

    b.   For a one-tailed test, using the lower tail (treatment effect negative):

    Power (1-β)   =   P[Z < ( −zα − zδ )]   =  1 − P[Z > ( −zα − zδ )]

    Similarly, if zδ = −zα, then half of all randomly selected results will fall below −zα - so the power (1-β) will be 0.5

    c.   For a two-tailed test, using both tails:

    Power (1-β)   =  P[Z > ( +zα/2 − zδ )]   +  1 − P[Z > ( −zα/2 − zδ )]

    Given which, if zδ = −zα/2 or zδ = +zα/2 then a little more than half of all randomly selected values will cause H0 to be rejected. If α/2 is larger, or zδ is smaller the difference in power, compared to the 1-tailed formula, is rather greater. In all three cases, if δ = 0, then (1 - β) = α/2, which is the proportion of type 1 errors where H0 is true.

    Where:

    • P is probability, determined from the cumulative normal distribution, as the proportion of the standard normal distribution greater than or less than Z. This can be obtained from the probability calculator on your computer statistical package. If you are using tables, some give the proportion of the distribution that is less than Z, whilst others give the proportion of the distribution that is greater than Z. Another variant is where the probability given in the table is from zero to Z, so you have to add 0.5 to get the correct value.
    • Z is the standardised normal deviate,
    • zα is the location of the critical value for α, above which lie 100α% of the null population - and is obtained from your probability calculator or tables, given that P(Z < zα) = 1-α and α is the significance level.
    • zδ = δd or 1μ0]d
    • μ0 is the reference population mean (under H0),
    • μ1 is the test population mean
    • σd = population standard error of d. For a Z-test σd = σ/√n, the standard error of the reference population mean, which is usually calculated as the standard deviation of the reference population observations (σ) divided by the square root of the number of observations in the sample (n).

     

    Estimating sample size

    We rearrange the formula for power to give us the number of samples required to obtain a given power.

    Algebraically speaking:

    For a one-tailed test:

    n   =   (zα + zβ)2 σ2
    1 − μ0)2
    where

    • (zα is obtained from your probability calculator or tables given that P(Z < zα) = 1 − α and α is the significance level.
    • zβ is obtained from your probability calculator or tables, given that P(Z < zβ) = 1 − β and 1 − β is the power.
    • μ0 is the known population mean,
    • μ1 is the test population mean
    • σ is the known population standard deviation of the observations

    For a two-tailed test, we use an approximation and use zα/2 in place of zα. This ignores the possibility of a type III error, but for large treatment effects, will not usually introduce any serious error.

    The following values of zα, and zβ are those most frequently used in sample size calculations:

    Significance Level
    One tailed (zα)Two tailed (zα/2)
    5%1%5%1%
    1.64492.32631.96002.5758

    Power (zβ)
    80%90%95%
    0.84161.28161.6449

     

     

    Assumptions

    You are making a number of assumptions when you estimate power and required sample size. The first group of assumptions apply to all significance tests, namely:

    1. Samples are taken randomly, or individuals are allocated randomly to treatment groups.
    2. Observations are independent of each other.

    The second set of assumptions apply specifically to the Z-test:

    1. The response variable approximates to a normal distribution
    2. The true mean and standard deviation of the population are known and not estimated from a sample.

    Related topics :

    Efficiency of tests

  •