Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Goodness-of-fit tests for categorized data: Use & misuse

(Pearson's chi square, likelihood ratio G-test, independence of observations, sample size, exact tests)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

Applied ecologists and conservation biologists seem to make most use of these goodness of fit tests, mainly for testing sex and phenotype ratios, results of choice experiments, and availability versus use studies. Researchers in all disciplines use these tests for testing goodness of fit of discrete data to various discrete probability distributions, especially the Poisson  and negative binomial.  Neither test is usually appropriate for testing goodness-of-fit to a normal distribution, although examples of such use can still be found in the literature.

The key requirement that is flouted in many goodness-of-fit analyses is that observations are independent. We look  at one example of medical research where the sample was a convenience sample,  and another where a cluster (= school) sample design was used. It seems likely that the behaviour of children as regards drink, smoking and drugs is more similar within a school than between schools. We see the same potential problem in a choice experiment on butterflies where observations were carried out on two pairs at a time. This issue becomes critical in wildlife studies. Studies often use multiple non-independent sightings on the same individuals leading to pseudoreplication.  The resulting analyses tend to have (misleadingly) high values of the chi square statistic. We also have to assume that each animal is behaving independently - which is highly improbable in gregarious species.

Excessively small sample sizes often jeopardize the validity of tests, especially in choice tests and resource availability studies. Where categories are pooled, the choice for which ones to pool may not be very logical. More often no pooling is carried out, leaving a number of very small expected frequencies. The confusion amongst statisticians on lowest permissible expected frequencies is well reflected in the literature! Exact tests are coming in for small samples, but such tests are very conservative and mid-P-values would be preferable. In wildlife resource availability studies, it is important that the theoretical distribution is known without error - if that distribution is only estimated (as is often the case), then the test is no longer valid.

As we also note regarding the Kolmogorov-Smirnov test,  goodness of fit tests also feature strongly in cases of a more profound misunderstanding of the nature of significance testing. It is not possible to demonstrate that a 'fit is significant'; one can only test whether observed frequencies deviate significantly from the expected frequencies. Moreover, even if observed frequencies do not deviate significantly from expected, it does not prove that the model under test is 'correct', or even that it provides a good fit to the data. A small sample size will usually guarantee a 'good fit'! Another rather fundamental error we encountered is to analyze a table of observed and expected values as if it were a 2 x 2 contingency table.


What the statisticians say

Conover (1999) covers the chi square goodness of fit test in Chapter 4. He provides an good discussion of the approximation involved in calculating degrees of freedom by subtracting the number of estimated parameters. Sokal & Rohlf (1995) advocate use of the G likelihood ratio test in place of Pearson's chi square test for testing goodness of fit. Sprent (1998) looks at both the chi square and Kolmogorov tests in relation to the exact multinomial permutation test. Siegel (1956) makes the important point that pooling adjacent categories should only be done if the combination is meaningful.

Magnussen et al. (2006) warns against the use of Wald's test-statistic for simple goodness-of-fit tests under one-stage cluster sampling. Rao & Scott (1981) (1992) provide details of a simple adjustment to make to chi square for analysis of cluster data. Holt (1980) looks at the effects of using standard Pearson's chi square tests on complex survey data where observations are seldom independent. The effects of correlations are severe for tests of goodness of fit. Chernoff & Lehmann (1954) note that where the parameters of a distribution are estimated by maximum likelihood, subtracting the number of estimated parameters from the degrees of freedom may provide too severe a correction.

Wikipedia provides a comprehensive account of the test with a useful section of its relation to other tests.