Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Multiple comparison tests after ANOVA: Use & misuse

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

Multiple comparison tests are little used by medical researchers - partly because multiple treatment experiments are less common, and partly because some medical statisticians are vehemently opposed to any adjustments for multiple comparisons. But they are used extensively (some would say far too extensively) by most other applied biologists. A great deal has been written in recent years about 'which test to use', but a review of the literature reveals continuing confusion.

Most texts make it clear that it is not justifiable to do multiple comparison of means if the treatment effect in the ANOVA is not significant. Nevertheless this misuse can still be found in the literature. But a much commoner problem is the use a multiple comparison test for comparing ordered means - in such situations a test for trend or regression analysis is much more informative (and powerful). There is also confusion about which measure of location to report especially following a transformation.  In general, detransformed means should be presented rather than the original untransformed means. A further issue is that most researchers only use pairwise tests. Orthogonal contrasts which include combined mean comparisons are often more appropriate. We give  examples (such as comparing shrimps in two habitat types) where the latter approach would have been both more informative and more powerful.

The other misuses of multiple comparison tests are to use excessively liberal tests or excessively conservative tests. In the first category comes Fisher's protected least significant difference. This test gives no real protection against an excessive type I error rate, and should only be used for pre-planned orthogonal comparisons. Also (generally) too liberal is Duncan's multiple range test - statisticians have taken a special (and not unjustified) dislike to this test and using it will almost certainly draw negative comments from a reviewer. The opposite problem arises if excessively conservative tests such as the Scheffé method are used for a small number of pairwise comparisons. In such a situation, the Tukey-Kramer HSD test is more appropriate.

Multiple comparison tests make the same assumptions as the original analysis of variance - and in fact are somewhat less robust to those assumptions being flouted. Non-independence of replicates is (as always) a common problem. Convenience  rather than random sampling is common (see for example the research on species richness in Cameroon forests). In the experimental situation, repeated observations over time cannot be used as independent replicates, nor can animals kept together in groups whether chickens in cages or fish in tanks. Just 'considering' replicates to be independent (a phrase commonly used) does not as Jean-Luc Piccard would have it 'make it so'! As with ANOVA, multiple comparison tests also assume variances are homogeneous and errors are normally distributed. Fewer than half of the examples we give here specify whether variances were homogeneous and errors normally distributed. Where some measure of dispersion is given, there is often evidence of heteroscedasticity of variances. We note that if variances are not homogeneous, there is a disturbing tendency for researchers to resort to the non-parametric Kruskal-Wallis test test. Unfortunately Kruskal-Wallis  still requires variances to be homogeneous - it only frees one from the normality assumption.


What the statisticians say

Underwood (1997) provides a fairly comprehensive account of the topic in Chapter 8, emphasizing how one needs to balance the risk of a Type I error with the need to maintain power. Sokal & Rohlf (1995), Zar (1999), Steel & Torrie (1960) and Winer et al. (1991) all have extensive coverage of multiple comparison tests.

Many medical statisticians reject any adjustments for multiple comparisons, for example Nelder (1971) , Rothman (1990) and Anonymous (2007) Also critical are Perneger (1995) and Bacchetti (2002) Others such as Feise (2002) put the views of both sides but still seem hesitant about corrections for multiple outcome measures. Bland & Altman (1995) accept the case for the Bonferroni adjustment in certain situations. Lowry reviewed the use and abuse of multiple comparisons in animal experiments.

For many years ecologists were much more at home with multiple comparison tests, with strong advocacy by Rice (1989) and Day & Quinn (1989) . However, Perry (1986) Gill (1990) and Stewart-Oaten (1995) and especially Crawley (2005) have all advocated restricting testing to orthogonal contrasts. In recent years opponents of adjustments have included Moran (2003), Garcia (2004) and Benjamini & Hochberg (1995).

The multiple comparison methods presented here are described in Fisher (1935) (protected LSD method), Ury (1976) (Dunn-Sidak method), Dunnett (1955) (Dunnett's method), Tukey (1953) (Tukey's HSD), Newman (1939) and Keuls (1939) (Student-Newman-Keuls method) and Einot & Gabriel (1975) (Ryan's Q test).

Recommended resources are two excellent lectures provided by California State University Northridge (1) (2) and a discussion on the topic by Gerard E. Dallal. Also available is Rajinder Parsad's account of multiple comparison procedures. Other online resources include the Handbook of Biological Statistics, and sections in the NIST/SEMATECH e-Handbook of Statistics on multiple comparisons, orthogonal contrasts, Tukey's method, Scheffé's method and Bonferroni's method.

Wikipedia provides sections on multiple comparisons, the Tukey-Kramer method, Scheffé's method, Student-Newman-Keuls method, and Duncan's multiple range test.