 InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Goodness-of-fit tests for categorized data Pearson's chi square and likelihood ratio G-test  In our first worked example we revisit the study by Meyer et al. (1995) on the occurrence of portosystemic shunt in Irish wolfhounds. The authors described the distribution of ammonia concentrations in the venous blood as 'essentially normal', but performed no statistical (or graphical) assessments.

 Ammonia fi i <25 3035404550556065707580859095100105110115120>120 761526384332686910512712194978534252113117 6.26.311.218.8 29.342.958.875.4 90.4101.6 106.8 105.1 96.9 83.5 67.4 51.0 36.1 23.9 14.8 8.6 9.0

If we had the raw data available, then one of the cumulative rank methods or Shapiro-Wilks would have been the preferred method of testing normality. Since we only have access to the categorized distribution, we will use Pearson's chi square test - but with the reservation that we are loosing information by categorizing a continuous variable. The first figure below shows observed frequencies with the lowest and highest frequencies pooled in order to eliminate small expected frequencies.

The second figure shows a normal distribution with the same mean (73.8), standard deviation (19.4), interval width (5) and number of observations (1044). Expected probabilities were obtained using R by obtaining the cumulative probability up to each upper bound, subtracting the cumulative probability up to the upper bound below that, and multiplying by sample size.

Pearson's X2 can then be calculated using the general formula: Using
 X2   = (7 − 6.205)2 + .... (17 − 9.002)2 =  61.28  6.205 9.002

The number of degrees of freedom is given by [number of classes - 1 - number of parameters estimated] which in this case is [21-1-2] =18. Hence P = 0.00000127. We can therefore reject the null hypothesis and conclude that a normal distribution does not provide a good fit to the data.

If preferred, a likelihood ratio G-test could be used instead. Using
 G   = 2 × [ 7  ln ( 7 ) + .... 17  ln ( 17 )] =  66.6  6.205 9.002

The number of degrees of freedom is again given by [number of classes - 1 - number of parameters estimated] = 18. Hence P = .000000166. Again we reject the null hypothesis and conclude that observed distribution deviates significantly from a normal distribution.

Our second worked example is from a study by Greenwood & Yule (1920) that we first looked at in Unit 4. The table gives the observed frequency distribution of accidents per individual, together with expected distributions assuming either a Poisson or negative binomial distribution.

 Accidents experienced by 414 machinists over 3 months No. accidents fi iPoisson iNegativebinomial. 012345678 2967426844101 25612230510000 29969261152110

Number of accidents is a discrete variable, so either Pearson's chi square test or the G likelihood ratio test would be appropriate to assess goodness of fit. To avoid expected frequencies less than 5, we pool the higher categories as appropriate. Poisson distribution

Pearson's X2 can then be calculated using the general formula: Using
 X2   = (296 − 256)2 + .... (18 − 6)2 =  49.67  256 6

The number of degrees of freedom is given by [number of classes - 1 - number of parameters estimated] which, because a Poisson distribution has just one parameter, in this case is [5-1-1] =3. Using R's inverse chisquared probability function, pchisq(49.67, 3, low=FALSE), gives an upper tail P-value of 9.392416e-11. We can therefore conclude that the observed distribution deviates significantly from a Poisson distribution.

Negative binomial distribution

Pearson's X2 can then be calculated using the general formula: Using
 X2   = (296 − 299)2 + .... (10 − 9)2 =  1.32  299 9

The number of degrees of freedom is given by [number of classes - 1 - number of parameters estimated] which, since the negative binomial requires TWO parameters to be estimated (m & k, or p & q), in this case is [5-1-2] =2. Referring this value to R, gives a P-value of 0.516. We can therefore conclude that the observed distribution does not deviate significantly from a negative binomial distribution. Note the phrasing here - we have not proved that the negative binomial is a 'significantly good fit' to the data, as we cannot prove the null hypothesis.

Our third worked example is from a study on the striped skunk by Lariviére & Messier (1998) that we looked at in Unit 6. The observed distribution of natal dens between habitat types is compared with the expected distribution if selection were in accordance with availability. Three of the six expected frequencies are less than 5, so some categories should be pooled. However, the authors did not pool categories, so we will first calculate Pearson's X2 without pooling.

 Observed and expected frequencies of 47 natal den sites in different habitats Habitat fi i WetlandsFarmsteadsNesting areasRight-of-waysWoodlandMisc 1318 4 3 8 1 11.40 1.0212.50 4.1015.68 2.30

Using

We will use R for all the calculations here, as there is nothing new to learn from the calculations. R gives a X2-value of 293.46, similar to that obtained by the authors. Since the theoretical distribution is dictated by the availability of the different habitats which is known without error, we have (k-1)= 5 df and P < 0.00001. Note however that R gives a warning that the chi-squared approximation may be incorrect.

We must then consider which classes to pool to provide a more valid test. One might be tempted to pool all three classes with small expected frequencies but this would not be wise because farmsteads is one of the categories we are most interested in - and is also the category in which there is the widest divergence between observed and expected. We will therefore just pool the rights of ways and miscellaneous category - which at least seems to make sense.

Using

This gives almost exactly the same (highly significant) X2-value, albeit with one less degree of freedom. R still complains that the chi-squared approximation may be incorrect despite the fact that we have now met the condition that no more than 20% of the expected frequencies (that is 1 out of 5) be less than 5. Given it would be quite illogical to pool farmsteads with any other category, we could either present the analysis as is or (better) use Monte Carlo to estimate the null distribution of X2.

Using

This gives a P-value of 0.0005 indicating that the choice of natal den sites differs significantly from that expected on the basis of availability.

 Except where otherwise specified, all text and images on this page are copyright InfluentialPoints, all rights reserved. Images not copyright InfluentialPoints credit their source on web-pages attached via hypertext links from those images.