InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Goodness-of-fit tests for categorized data Pearson's chi square and likelihood ratio G-test

In our first worked example we revisit the study by Meyer et al. (1995) on the occurrence of portosystemic shunt in Irish wolfhounds. The authors described the distribution of ammonia concentrations in the venous blood as 'essentially normal', but performed no statistical (or graphical) assessments.

If we had the raw data available, then one of the cumulative rank methods or Shapiro-Wilks would have been the preferred method of testing normality. Since we only have access to the categorized distribution, we will use Pearson's chi square test - but with the reservation that we are loosing information by categorizing a continuous variable. The first figure below shows observed frequencies with the lowest and highest frequencies pooled in order to eliminate small expected frequencies.

Ammonia fi i
<25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115
120
>120
7
6
15
26
38
43
32
68
69
105
127
121
94
97
85
34
25
21
13
1
17
6.2
6.3
11.2
18.8
29.3
42.9
58.8
75.4
90.4
101.6
106.8
105.1
96.9
83.5
67.4
51.0
36.1
23.9
14.8
8.6
9.0

{Fig. 1}
figma1a.gif

The second figure shows a normal distribution with the same mean (73.8), standard deviation (19.4), interval width (5) and number of observations (1044). Expected probabilities were obtained using R by obtaining the cumulative probability up to each upper bound, subtracting the cumulative probability up to the upper bound below that, and multiplying by sample size.

Pearson's X2 can then be calculated using the general formula:

Using
X2   =   (7 − 6.205)2  + ....  (17 − 9.002)2   =  61.28
6.205 9.002

The number of degrees of freedom is given by [number of classes - 1 - number of parameters estimated] which in this case is [21-1-2] =18. Hence P = 0.00000127. We can therefore reject the null hypothesis and conclude that a normal distribution does not provide a good fit to the data.

If preferred, a likelihood ratio G-test could be used instead.

Using
G   = 2 × [ 7  ln (  7  )  + ....  17  ln (  17  )]    =  66.6
6.205 9.002

The number of degrees of freedom is again given by [number of classes - 1 - number of parameters estimated] = 18. Hence P = .000000166. Again we reject the null hypothesis and conclude that observed distribution deviates significantly from a normal distribution.

Accidents experienced by
414 machinists over 3 months
No.
accidents
fi i
Poisson
i
Negative
binomial.
0
1
2
3
4
5
6
7
8
296
74
26
8
4
4
1
0
1
256
122
30
5
1
0
0
0
0
299
69
26
11
5
2
1
1
0
Our second worked example is from a study by Greenwood & Yule (1920) that we first looked at in Unit 4. The table gives the observed frequency distribution of accidents per individual, together with expected distributions assuming either a Poisson or negative binomial distribution.

Number of accidents is a discrete variable, so either Pearson's chi square test or the G likelihood ratio test would be appropriate to assess goodness of fit. To avoid expected frequencies less than 5, we pool the higher categories as appropriate.

Poisson distribution

Pearson's X2 can then be calculated using the general formula:

Using
X2   =   (296 − 256)2  + ....  (18 − 6)2   =  49.67
256 6

The number of degrees of freedom is given by [number of classes - 1 - number of parameters estimated] which, because a Poisson distribution has just one parameter, in this case is [5-1-1] =3. Using R's inverse chisquared probability function, pchisq(49.67,3,low=FALSE), gives an upper tail P-value of 9.392416e-11. We can therefore conclude that the observed distribution deviates significantly from a Poisson distribution.

Negative binomial distribution

Pearson's X2 can then be calculated using the general formula:

Using
X2   =   (296 − 299)2  + ....  (10 − 9)2   =  1.32
299 9

The number of degrees of freedom is given by [number of classes - 1 - number of parameters estimated] which, since the negative binomial requires TWO parameters to be estimated (m & k, or p & q), in this case is [5-1-2] =2. Referring this value to R, gives a P-value of 0.516. We can therefore conclude that the observed distribution does not deviate significantly from a negative binomial distribution. Note the phrasing here - we have not proved that the negative binomial is a 'significantly good fit' to the data, as we cannot prove the null hypothesis.

Observed and expected frequencies of
47 natal den sites in different habitats
Habitat fi i
Wetlands
Farmsteads
Nesting areas
Right-of-ways
Woodland
Misc
13
18
 4
 3
 8
 1
11.40
 1.02
12.50
 4.10
15.68
 2.30
Our third worked example is from a study on the striped skunk by Lariviére & Messier (1998) that we looked at in Unit 6. The observed distribution of natal dens between habitat types is compared with the expected distribution if selection were in accordance with availability. Three of the six expected frequencies are less than 5, so some categories should be pooled. However, the authors did not pool categories, so we will first calculate Pearson's X2 without pooling.

Using

We will use R for all the calculations here, as there is nothing new to learn from the calculations. R gives a X2-value of 293.46, similar to that obtained by the authors. Since the theoretical distribution is dictated by the availability of the different habitats which is known without error, we have (k-1)= 5 df and P < 0.00001. Note however that R gives a warning that the chi-squared approximation may be incorrect.

We must then consider which classes to pool to provide a more valid test. One might be tempted to pool all three classes with small expected frequencies but this would not be wise because farmsteads is one of the categories we are most interested in - and is also the category in which there is the widest divergence between observed and expected. We will therefore just pool the rights of ways and miscellaneous category - which at least seems to make sense.

Using

This gives almost exactly the same (highly significant) X2-value, albeit with one less degree of freedom. R still complains that the chi-squared approximation may be incorrect despite the fact that we have now met the condition that no more than 20% of the expected frequencies (that is 1 out of 5) be less than 5. Given it would be quite illogical to pool farmsteads with any other category, we could either present the analysis as is or (better) use Monte Carlo to estimate the null distribution of X2.

Using

This gives a P-value of 0.0005 indicating that the choice of natal den sites differs significantly from that expected on the basis of availability.