Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Pearson's chi square test of independence: Use & misuse
(independence of outcome, paired samples, cluster sampling, pooling from multiple tables, Yates' continuity correction)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and MisuseThe test vies with the t-test for being the most frequently used statistical test across virtually all disciplines. Various authors (in particular Sokal & Rholf) have argued for its replacement by the G likelihood ratio test . Such pleas have largely fallen on deaf ears (at least for analysing 2 × 2 tables), although we have included a couple of examples of its use. Perhaps because Pearson's chi square test is so extensively used, it is also one of the most misapplied of all statistical procedures. When applied to results from cluster randomized trials it can produce some of the most wildly misleading inference that can be found in the literature.
Lack of independence of outcome is the commonest factor invalidating the test. This can arise in a number of ways. If samples are paired (either in before-after studies or matched samples), then Pearson's chi square is not the appropriate test. We give examples of such misuse in a before-after study on malaria incidence, and presence/absence of two frog species in matched pairs of natural and artificial ponds. Another cause of non-independence of outcome is the use of cluster sampling. We give examples on sampling pigs from farms and sampling periwinkles using quadrats. In experiments the frequencies in the contingency table should apply to those number of units that were randomly allocated - it is not valid to change the experimental unit post randomization. We give examples where this was done in trials of a cholera vaccine and of insecticide impregnated mosquito nets. The same problem arises if mothers are randomly allocated, but the test is applied to the offspring whether of pigs or parasitoids.
A related problem is that of pooling of frequencies from multiple 2 × 2 tables. This can be very misleading if samples are not homogeneous, and is anyway throwing away data on variability between replicates. We give several examples including infection of pigs with tapeworms, and infection of squirrels with virus. Multiple tables should be analysed using Mantel Haenszel methods. Another type of pooling is to do it between categories of a contingency table - say by collapsing say a 3 × 2 table to three 2 × 2 tables. Great care must be taken in doing this to ensure that nonsense categories are not created in the process.
There is a real problem on how to deal with very small expected frequencies. The Yates' continuity correction makes the test too conservative. Until recently the conventional wisdom was to use Fisher's exact test even though the model (fixed marginal totals) is wrong. We show in the core text that a Monte Carlo exact test is greatly preferable. The differing opinions amongst statisticians introduces bias because one can pick and choose on what test to use - examples quoted include whether a parasitoid feeds preferentially on parasitized hosts and factors affecting infection with equine influenza. Lastly because Pearson's chi square test is often used as test of association, one should stress that association can never prove causation, however strong that association may be. We give an example which questions whether poor Vitamin D status make one more susceptible to tuberculosis or whether tuberculosis is responsible for the observed Vitamin D deficiency.
What the statisticians sayAgresti (2002) provides in depth coverage of Pearson's chi square test in Chapter 3. Armitage & Berry (2002) cover the material given here in Chapters 4 and 15 whilst Woodward (2004) covers the same topic in Chapter 3. Fleiss et al. (2003) looks at the wide range of statistical methods available for rates and proportions. Conover (1999) looks at the analysis of contingency tables in Chapter 4 whilst Everit (1992) devotes the entire text to the analysis of contingency tables. Sokal & Rohlf (1995) and Zar (1999) both cover much (but not all) of the same material.
Ludbrook (2008) tackles the issue of matching test to experimental design for a 2×2 table. Campbell (2007) favours the 'n - 1' chi-square test except when any expected frequency is less than one. Agresti (2001) looked at the continuing controversies in the use of exact inference for categorical data. Reed (2004) looks at the use of adjusted chi-square statistics for analyzing clustered binary data. Yates (1984) reviews the (never ending) controversy over how to analyze 2 × 2 tables up to 1984. Plackett (1983) explores the history of Karl Pearson's development of the chi-square test. Cochran (1954) and Armitage (1955) proposed the chi square test for trend. Barnard (1947) first identified the different models involved which generate 2× 2 tables.
Kraemer et al. (2004) advises against the use of weighted kappa for assessing agreement. Greenland & Robins (1985) demonstrate that the Mantel-Haenszel risk ratio is unbiased even when some categories have small frequencies. Peto (1978) criticizes a misleading survival analysis where only the distribution of times of death of those who died was considered.
Wikipedia has sections on Pearson's chi-square test, Yates' correction for continuity, the G-test, the chi-square test for trend and Cochran-Mantel-Haenszel methods (not well covered). The Handbook of Biological Statistics also describes Pearson's chi square test and the G-test of independence. The NIST/SEMATECH e-Handbook of Statistics has a useful general section on likelihood ratio tests.
David C. Howell provides a simple introduction to the use of Pearson's chi-square test for analysis of contingency tables. Ian Campbell gives the background information concerning choice of tests for two-by-two tables. Bruce Weaver summarizes some of the material produced by Ian Campbell. A staff member of University of Alberta gives the R code for log-likelihood tests of independence & goodness of fit with or without Williams' and Yates' corrections (although it is also provided in the Deducer package in R).