Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Goodness-of-fit tests for categorized data Pearson's chi square and likelihood ratio G-test

Principles  Pearson's chi-square goodness of fit test  Likelihood ratio G-test  Assumptions 


These goodness of fit tests are designed to test the null hypothesis that an observed frequency distribution is consistent with a hypothesized or theoretical distribution. Pearson's chi square test is the oldest and most frequently used goodness of fit test. The likelihood ratio G-test is an alternative method which has been strongly advocated in recent years. The tests are primarily intended for categorical or discrete variables. They can be used for measurement variables (for example testing the fit of data to a normal distribution), but this results in a loss of information - and can be horribly biased - because the data must first be collapsed into class intervals (categories).

Although the test statistics are calculated in a different way for the two tests, in both cases the statistic approximates to the χ2 distribution in the asymptote. The number of degrees of freedom depends on whether the expected distribution is completely specified - for example an expected ratio of 9:3:3:1 of frequencies of phenotypic forms based on genetic theory - or whether parameters are estimated from the sample - for example an omnibus test of normality.

If the theoretical distribution is completely specified, the number of degrees of freedom is given by:

df = [number of classes − 1]
If the known theoretical distribution is derived from the proportionate availability of resources, then those proportions must be known without error.

If parameters of the theoretical distribution are estimated from the sample, the number of degrees of freedom is (approximately) given by:

df = [number of classes − 1− number of parameters estimated]

Note that this latter expression is only an approximation, and tends to reduce the number of degrees of freedom excessively, making the test too conservative. The distribution of the statistic lies somewhere between a chi square distribution with and without the number of estimated parameters subtracted.


Pearson's chi square goodness of fit test

The test statistic - X2 - can be calculated from the following general formula:

Algebraically speaking -

X2   =  Σ  (fi - i)2
  • X2 is Pearson's chi squared statistic, which under the null hypothesis is a random quantile of the χ2 distribution with (k - 1 - number of parameters estimated) degrees of freedom - k being the number of classes,
  • fi is observed frequency in the ith category,
  • i is its expected frequency in the ith category.


For more than two classes no continuity correction is required. For the special case of two classes, some statisticians feel that a correction for continuity should be applied if the overall number of observations (n) lies between 25 and 200. Other statisticians acknowledge that such a correction makes the test excessively conservative. Yates' correction to the general formula is achieved by subtracting 0.5 from the modulus of each difference between observed and expected values. If there are only two classes, and n is less than 25, the exact binomial test should be used instead.


Likelihood ratio G-test

The likelihood ratio G-statistic provides an asymptotically equivalent alternative to Pearson's X2. It can be obtained from the following:

Algebraically speaking -

G   = 2 Σ fi  ln (  fi  )


  • G is the likelihood ratio statistic, approximating to χ2 for large samples
  • fi and i are the observed and expected frequencies for the ith class.

For more than two classes no continuity correction is required. For the special case of two classes, some statisticians suggest you use Williams'' "continuity-correction", the 2x2 version of which is given in Unit 9.

In some circumstances the P-value obtained using this statistic is somewhat closer to that obtained using the exact multinomial.




  1. Sampling is random, so observations are independent.
    This assumption is not met if samples are obtained from clusters, nor if observations are derived from pooled samples.
  2. Mutual exclusivity
    Each observation is classified into one of several different mutually exclusive categories.
  3. Errors are normally distributed
    Cell values will be distributed more-or-less normally about their expected values providing expected frequencies are sufficiently large. Authorities differ over which test statistic performs better with small expected frequencies and over how large is 'sufficiently large' for each of the tests. The specifications given here were made for Pearson's chi square test; requirements for the likelihood ratio test are (probably) more stringent.
    • For only two categories, no expected cell frequency should be less than 5. For smaller expected frequencies the binomial test should be used.
    • For more than two categories, Cochran (1954) specifies that no more than 20% of the expected frequencies should be less than 5, and no expected frequency should be less than 1. Conover (1999) recommends that all expected values should be at least 0.5 and most larger than 1.0.
    Adjacent categories can be combined in order to meet these assumptions providing such combinations are meaningful.
      N.B. It is also assumed that categories are not set up so as to best match the expected proportions, or visa versa, and they are not repeatedly altered until the desired test outcome is obtained!

topics :

Testing homogeneity of replicates

Cluster sampling