Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Multiple comparison tests after parametric ANOVAApproach to multiple comparisons Planned orthogonal comparisons Fisher's protected least significant difference, LSD Bonferroni & Dunn-Sidak Planned non-orthogonal comparisons Unplanned pairwise comparisons Tukey's Honestly Significant Difference Student-Newman-Keuls Test Ryan's Q Test Unplanned pairwise & combined mean comparisons Assumptions
Approach to multiple comparisons
Multiple comparisons remains a controversial issue. It is complicated by there being two main situations when adjustments are commonly made. These are (a) comparison of means of multiple treatment levels following an analysis of variance (what we are concerned with here); (b) use of multiple outcome measures in an experiment or observational study (for example, some clinical trials may have only one primary response variable but up to 20 secondary response variables; an ecological observational study may test the same hypothesis on many different species.) As we point out in the references, many (especially medical statisticians) are (sometimes vehemently) opposed to any adjustments to control the per-experiment error rate. Others accept adjustments for post-ANOVA comparisons, but not for multiple outcome measures. Others argue for rigorous adjustments to maintain a 5% Type I error rate.
If we focus on the post-ANOVA situation, several points come out strongly.
A. Planned orthogonal comparisons
Partitioning treatment sums of squares
This method is recommended for carrying out a set of linear contrasts (both pairwise and combined mean comparisons) which are orthogonal (statistically independent). If the study has been well designed with clear hypotheses to test, this set of orthogonal contrasts may well include all the comparisons of interest.
For example - a treatment (sample 3) is compared with two controls - a negative control (sample 1) and a procedural
These comparisons (known as linear contrasts) are orthogonal because no treatment error
The significance of each of these linear contrasts is assessed by partitioning the treatment sums of squares obtained in a standard analysis of variance. In order to do this you need to determine the contrast coefficients for each of the two contrasts. These are the coefficients in the linear equation describing a
Values of coefficients are determined thus:
We can now rewrite the orthogonal set of contrasts above with their coefficients:
Note that some authorities prefer to express the coefficients as
It is then simply a matter of partitioning the treatment sums of squares. The sums of squares for each contrast are given by:
For an orthogonal set of contrasts, each contrast will have one degree of freedom. Hence the sums of squares are equal to the mean squares for each contrast. The contrast mean squares are then each divided by the mean square error to obtain F-ratios, which are tested in the usual way.
Fisher's protected least significant difference (LSD)
This is the most widely (mis) used multiple comparison test. The protection referred to derives from the test only being used after finding a significant treatment effect in an
Any of the preplanned contrasts greater than that difference is accepted as significant at the chosen level of α.
Bonferroni & Dunn-Sidak
These methods protect against type I errors by controlling the per-experiment error rate. Strictly speaking they are only appropriate for planned orthogonal contrasts. But unlike the LSD test, they are often recommended for multiple (>3) planned orthogonal comparisons. They are also often recommended for planned but non-orthogonal contrasts, but in this latter situation they are conservative. They should not be used for all possible pairwise comparisons.
The Bonferroni correction can be applied using a modified least significant difference, namely:
The Dunn-Sidak correction can be applied in the same way, namely:
B. Planned non-orthogonal comparisons
Comparing treatments with a control - Dunnett's test
Dunnett's test can be applied in the same way as the tests above, but using critical values tabulated by Dunnett in place of quantiles from the t-distribution, namely:
C. Unplanned pairwise comparisons
Tukey's test is a simultaneous inference method. If sample sizes are equal, it uses one range value to calculate the same shortest significant range for all comparisons. It is the most widely used method to make all possible pairwise comparisons amongst a group of means. In its original form, sample sizes were assumed to be equal. Kramer modified the method so it could be used for unequal group sample sizes, using the harmonic mean of the sample sizes of the groups being compared. The first formulation below is for equal sample sizes, whilst the second is for unequal group sizes:
Tukey's HSD is well accepted in the literature, and its use is recommended. It is, however, conservative and one of the multiple stage tests may be preferred if the desire is to maximize power. Several other methods are available for uneven numbers of replicates including Spjøtvoll & Stollines T' method and Hochberg's GT2 method. However, both of these tend to be more excessively conservative compared to the Tukey-Kramer method. Full details can be found in Sokal & Rohlf (1995) if required.
This is described variously as a stepwise or multiple-stage test. The range statistic varies for each pairwise comparison as a function of the number of group means in between the two being compared. A different shortest significant range is computed for each pairwise comparison of means.
Means are first ordered by rank, and the largest and smallest means are tested. If there is no significant differences, testing stops there and it is concluded that none is significantly different. Then means of the next greatest difference are tested using a different shortest significant range. Testing is continued until no further significant differences are found.
Such tests are valid only when group sample sizes are equal. If sample sizes were unequal, test results could be non-intuitive, for example A > B, B > C, but A is not significantly different from C. The Student-Newman-Keuls (SNK) test is more powerful than Tukey's method, so it will detect real differences more frequently.
However, in some situations the Student-Newman-Keuls test offers poor protection against a type I error. This is especially the case when treatment means fall into groups which are themselves widely spaced apart. Differences between means within groups will be significant more often than they should be at the specified level of α.
The Student-Newman-Keuls test is not as bad in this respect as another widely used test - Duncan's multiple range test. This is a modification of the Student-Newman-Keuls test that uses increasing α-levels to calculate critical values at each step of the above procedure. The test is implemented using tables prepared by Duncan which give the appropriate Q value for a given number of treatments (k). When k=2 the two procedures have identical values; for values of k larger than 2, the Duncan procedure has the smaller critical value.
This means that the Duncan test is more liberal in detecting differences, a point defended by Duncan on the basis that the global null hypothesis is often (nearly always?) false, and hence most statisticians tend to overprotect it against type I errors. However, few statisticians support him in this, mainly because the test fails to control the family wise error rate at the nominal α-level. In addition, many journals will not accept it so the 'struggling' research scientist has little choice but to avoid the test.
Ryan's Q Test
D. Unplanned pairwise and combined mean comparisons
Scheffé's method is a simultaneous inference method that can be applied to all possible contrasts among the means, not just the pairwise differences. Each contrast of interest is set up such that the sum of the coefficients is equal to zero, and then estimated as follows:
The simultaneous 100(1-α) confidence limits for a contrast are given by:
Note that an R function called pairwise.t.test computes all possible two group comparisons making adjustments for multiple comparisons if required.
e.g. pairwise.t.test (con,trt,p.adj="bonferroni");default adjustment is Holms method
AssumptionsAll MCTs discussed thus far have the same assumptions as does ANOVA -- data within each treatment group are normally distributed, and each treatment group has equal variance. Violations of these assumptions will result in a loss of power to detect differences which are actually present.