InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Approach to multiple comparisons

Multiple comparisons remains a controversial issue. It is complicated by there being two main situations when adjustments are commonly made. These are (a) comparison of means of multiple treatment levels following an analysis of variance (what we are concerned with here); (b) use of multiple outcome measures in an experiment or observational study (for example, some clinical trials may have only one primary response variable but up to 20 secondary response variables; an ecological observational study may test the same hypothesis on many different species.) As we point out in the references, many (especially medical statisticians) are (sometimes vehemently) opposed to any adjustments to control the per-experiment error rate. Others accept adjustments for post-ANOVA comparisons, but not for multiple outcome measures. Others argue for rigorous adjustments to maintain a 5% Type I error rate.

If we focus on the post-ANOVA situation, several points come out strongly.

  1. It is always better to design a study with the aim of testing specific (alternate) hypotheses - rather than decide what you wish to test after having done the study. With a well-designed study, you can preplan a limited number of independent (or orthogonal) contrasts to test your hypotheses. An important advantage of this approach is that you do not need to make any adjustments to control per-experiment error when comparing such means. Hence the power of such tests is strengthened.
  2. Do not restrict yourself to pairwise comparisons. Very often combined mean comparisons can be much more interesting (for example, comparing response to a control with the mean of responses to two different, but related treatments). This principle is taken much further with factorial designs, but can often be applied after a one-way ANOVA.
  3. There is nothing wrong with making (a few) unplanned comparisons based on anything unexpected that emerges in the results. That is after all what research is all about! But in that situation you should use a test which provides adequate protection against type I error - and you should be able to defend your particular choice of test.

 

 

A.   Planned orthogonal comparisons

  1. Partitioning treatment sums of squares

    This method is recommended for carrying out a set of linear contrasts (both pairwise and combined mean comparisons) which are orthogonal (statistically independent). If the study has been well designed with clear hypotheses to test, this set of orthogonal contrasts may well include all the comparisons of interest.

      For example - a treatment (sample 3) is compared with two controls - a negative control (sample 1) and a procedural control (sample 2). A logical and meaningful set of contrasts would be to first compare the mean of the negative control with the mean of the procedural control, and then compare the combined mean of the two control samples with the treatment mean.

      1 versus 2    and    [1 + 2] / 2   versus 3

      These comparisons (known as linear contrasts) are orthogonal because no treatment error (i - ) appears in more than one of these comparisons. For k samples, there is a maximum of k - 1 independent contrasts - in this case two.

    The significance of each of these linear contrasts is assessed by partitioning the treatment sums of squares obtained in a standard analysis of variance. In order to do this you need to determine the contrast coefficients for each of the two contrasts. These are the coefficients in the linear equation describing a contrast (C), namely:

    C = c11 + c22 + c23

    Values of coefficients are determined thus:

    1. If there are the same number of sample means in each comparison group, assign coefficients of +1 to the members of one group and -1 to the other group.
    2. If there are a different number of sample means in each group, assign to the first group coefficients equal to the number of means in the second group and to the second group coefficients of the opposite sign equal to the number of means in the first group.
    3. Reduce coefficients to the smallest possible integers by dividing through, and then scale them so that the sum of positive coefficients equal one, and the negative coefficients equal minus one - so the overall sum of coefficients is zero.

    We can now rewrite the orthogonal set of contrasts above with their coefficients:

    ContrastNull hypothesisCoefficients
    C1 H0 : μ12 11 − 12 + 03
    C2 H0 : (μ12)/2 = μ3 1/21 + 1/22 − 13

    Note that some authorities prefer to express the coefficients as integers. A little practice should enable you to readily draw up an orthogonal set - but you should always carry out a formal check on whether the set is orthogonal. The procedure for doing this is detailed in the related topic on checking orthogonality.

    It is then simply a matter of partitioning the treatment sums of squares. The sums of squares for each contrast are given by:

    Algebraically speaking -

    SSCi    =    n(Σcii)2
    Σci2
    where:
    • Ci are the contrasts,
    • ci are the contrast coefficients,
    • i are the treatment means,
    • n is the number of replicates.

    For an orthogonal set of contrasts, each contrast will have one degree of freedom. Hence the sums of squares are equal to the mean squares for each contrast. The contrast mean squares are then each divided by the mean square error to obtain F-ratios, which are tested in the usual way.

     

  2. Fisher's protected least significant difference (LSD)

    This is the most widely (mis) used multiple comparison test. The protection referred to derives from the test only being used after finding a significant treatment effect in an ANOVA. However, this does not control the per-experiment error rate, and so should the test only be used under very specific circumstances - namely for comparisons that are both preplanned and orthogonal. Some authorities also insist it should only be used when there are less than four treatment means - but since, under the conditions specified above, the test is exactly equivalent to partitioning treatment sums of squares, it would seem unnecessary to add this requirement. The least significant difference is calculated as below:

    Algebraically speaking -

    Least significant difference (LSD)    =   tα(df)
    2 MSerror
    n
    where
    • tα is a quantile from the t-distribution for the chosen type I error rate (α) and the same number of degrees of freedom as MSerror,
    • MSerror is the mean square error from the ANOVA table (for one-way ANOVA df = k(N-1), where k is the number of treatments, and N is the total number of observations),
    • n is the sample size, assuming the same number of replicates in each group.

    Any of the preplanned contrasts greater than that difference is accepted as significant at the chosen level of α.

     

  3. Bonferroni & Dunn-Sidak

    These methods protect against type I errors by controlling the per-experiment error rate. Strictly speaking they are only appropriate for planned orthogonal contrasts. But unlike the LSD test, they are often recommended for multiple (>3) planned orthogonal comparisons. They are also often recommended for planned but non-orthogonal contrasts, but in this latter situation they are conservative. They should not be used for all possible pairwise comparisons.

    The Bonferroni correction can be applied using a modified least significant difference, namely:

    Algebraically speaking -

    Bonferroni corrected LSD    =   tb
    2 MSerror
    n
    where
    • tb is a quantile from the t-distribution at the adjusted α level (b). b is equal to α/r where r is the number of comparisons. It has the same number of degrees of freedom as MSerror,
    • MSerror is the mean square error from the ANOVA table (for one-way ANOVA df = k(N-1), where k is the number of treatments, and N is the total number of observations),
    • n is the sample size, assuming the same number of replicates in each group.

    The Dunn-Sidak correction can be applied in the same way, namely:

    Algebraically speaking -

    Dunn-Sidak corrected LSD    =   td (df)
    2 MSerror
    n
    where
    • t is a quantile from the t-distribution at the adjusted α level (d). d is equal to 1 − (1−α)1/r where r is the number of comparisons. It has the same number of degrees of freedom as MSerror,
    • MSerror is the mean square error from the ANOVA table (for one-way ANOVA df = k(N-1), where k is the number of treatments, and N is the total number of observations),
    • n is the sample size, assuming the same number of replicates in each group.

 

 

B.   Planned non-orthogonal comparisons

  1. Comparing treatments with a control - Dunnett's test

    Dunnett's test can be applied in the same way as the tests above, but using critical values tabulated by Dunnett in place of quantiles from the t-distribution, namely:

    Algebraically speaking -

    Dunnett's LSD    =   dα (df)
    2 MSerror
    n
    where
    • dα is the critical value for Dunnett's test for the chosen type I error rate (α). Critical values can be obtained on the web. It has the same number of degrees of freedom as MSerror,
    • MSerror is the mean square error from the ANOVA table (for one-way ANOVA df = k(N-1), where k is the number of treatments, and N is the total number of observations),
    • n is the sample size.

 

 

C.   Unplanned pairwise comparisons

  1. Tukey's Honestly Significant Difference

    Tukey's test is a simultaneous inference method. If sample sizes are equal, it uses one range value to calculate the same shortest significant range for all comparisons. It is the most widely used method to make all possible pairwise comparisons amongst a group of means. In its original form, sample sizes were assumed to be equal. Kramer modified the method so it could be used for unequal group sample sizes, using the harmonic mean of the sample sizes of the groups being compared. The first formulation below is for equal sample sizes, whilst the second is for unequal group sizes:

    Algebraically speaking -

    Tukey HSD    =   Qα(k,df)
    MSerror
    n
    Tukey-Kramer HSD    =    Qα(k,df)
    MSerror ( 1 + 1 )
    nA nB
    where
    • Qα(k,df) is the value of the studentized range statistic for the total number of treatments (k) at the chosen type I error rate (α). Values of Q can be obtained using R or on the web. Degrees of freedom are the same as for MSerror,
    • MSerror is the mean square error from the ANOVA table (for one-way ANOVA df = k(N-1), where k is the number of treatments, and N is the total number of observations),
    • n is the sample size.

    Tukey's HSD is well accepted in the literature, and its use is recommended. It is, however, conservative and one of the multiple stage tests may be preferred if the desire is to maximize power. Several other methods are available for uneven numbers of replicates including Spjøtvoll & Stollines T' method and Hochberg's GT2 method. However, both of these tend to be more excessively conservative compared to the Tukey-Kramer method. Full details can be found in Sokal & Rohlf (1995) if required.

     

  2. Student-Newman-Keuls Test

    This is described variously as a stepwise or multiple-stage test. The range statistic varies for each pairwise comparison as a function of the number of group means in between the two being compared. A different shortest significant range is computed for each pairwise comparison of means.

    Means are first ordered by rank, and the largest and smallest means are tested. If there is no significant differences, testing stops there and it is concluded that none is significantly different. Then means of the next greatest difference are tested using a different shortest significant range. Testing is continued until no further significant differences are found.

    Algebraically speaking -

    SSR    =   Qα(m,df)
    MSerror
    n
    where
    • Qα(k,df) is the value of the studentized range statistic for the number of means covered in the particular comparison (m_ at the chosen type I error rate (α). Values of Q can be obtained using R or on the web. Degrees of freedom are the same as for MSerror,
    • MSerror is the mean square error from the ANOVA table (for one-way ANOVA df = k(N-1), where k is the number of treatments, and N is the total number of observations),
    • n is the sample size.

    Such tests are valid only when group sample sizes are equal. If sample sizes were unequal, test results could be non-intuitive, for example A > B, B > C, but A is not significantly different from C. The Student-Newman-Keuls (SNK) test is more powerful than Tukey's method, so it will detect real differences more frequently.

    However, in some situations the Student-Newman-Keuls test offers poor protection against a type I error. This is especially the case when treatment means fall into groups which are themselves widely spaced apart. Differences between means within groups will be significant more often than they should be at the specified level of α.

     

    The Student-Newman-Keuls test is not as bad in this respect as another widely used test - Duncan's multiple range test. This is a modification of the Student-Newman-Keuls test that uses increasing α-levels to calculate critical values at each step of the above procedure. The test is implemented using tables prepared by Duncan which give the appropriate Q value for a given number of treatments (k). When k=2 the two procedures have identical values; for values of k larger than 2, the Duncan procedure has the smaller critical value.

    This means that the Duncan test is more liberal in detecting differences, a point defended by Duncan on the basis that the global null hypothesis is often (nearly always?) false, and hence most statisticians tend to overprotect it against type I errors. However, few statisticians support him in this, mainly because the test fails to control the family wise error rate at the nominal α-level. In addition, many journals will not accept it so the 'struggling' research scientist has little choice but to avoid the test.

     

  3. Ryan's Q Test

    Algebraically speaking -

    Ryan's SSR    =   Qb(m,df)
    MSerror
    n
    where
    • Qb(m,df) is the value of the studentized range statistic for the number of means spanned in the particular comparison (m) at the adjusted α level (b). b is equal to 1 - (1-α)m/k. Values of Q for the adjusted (non-standard) probabilities are most readily obtained using R. Degrees of freedom are the same as for MSerror,
    • MSerror is the mean square error from the ANOVA table (for one-way ANOVA df = k(N-1), where k is the number of treatments, and N is the total number of observations),
    • n is the sample size.

 

 

D.   Unplanned pairwise and combined mean comparisons

  1. Scheffé's method

    Scheffé's method is a simultaneous inference method that can be applied to all possible contrasts among the means, not just the pairwise differences. Each contrast of interest is set up such that the sum of the coefficients is equal to zero, and then estimated as follows:

    Algebraically speaking -

    Ci    =   Σ cii
    where
    • Ci is the ith contrast,
    • ci is the coefficient for the ith mean,i.

    The simultaneous 100(1-α) confidence limits for a contrast are given by:

    Algebraically speaking -

    CL    =   C1 ± √
    [(k - 1)Fα,k-1, N-k] sC
    where
    • C1 is the specified contrast,
    • k is the number of treatments or groups,
    • F is a quantile from the F-distribution,
    • N is the total number of observations,
    • sC is the standard deviation of the contrast which is equal to
      [ MSerror Σ ci2 ]
      ni

Note that an R function called pairwise.t.test computes all possible two group comparisons making adjustments for multiple comparisons if required.

e.g. pairwise.t.test (con,trt,p.adj="bonferroni");default adjustment is Holms method

 

 

Assumptions

All MCTs discussed thus far have the same assumptions as does ANOVA -- data within each treatment group are normally distributed, and each treatment group has equal variance. Violations of these assumptions will result in a loss of power to detect differences which are actually present.

Related
topics :

Checking orthogonality