Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Negative binomial distribution: Use & misuse
(contagious distribution, overdispersion parameter, truncation, log series, log normal distribution)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and MisuseThe negative binomial distribution, like the normal distribution, is described by a mathematical formula. The negative binomial distribution is commonly used to describe the distribution of count data, such as the numbers of parasites in blood specimens, where that distribution is aggregated or contagious. Also like the normal distribution, it can be completely defined by just two parameters - its mean (m) and shape parameter (k). However, unlike the normal distribution, the negative binomial does not naturally result from the use of large samples - nor does it arise from a single causal model.
Much like the binomial and Poisson distributions, the negative binomial distribution is increasingly being used in all disciplines to describe the error distributions when modelling count data. We consider this in the medical context by looking at its use for modelling the number of alcoholic drinks taken over a period of time, and in the veterinary context for modelling the incidence rate of mastitis in cattle. In the other examples we concentrate on the use of the negative binomial parameter k as an index of aggregation. We also look at other indices of aggregation (including b from Taylors Power law), and at the use of the log series for species abundance curves.
There are a number of common misuses. There is of course the usual one of the data being obtained by convenience sampling, which is often highly biased. Sampling only part of an animal for an ectoparasite can also lead to bias. Secondly there is often no attempt made to establish whether the data do actually fit a negative binomial, whether using a statistical test or graphically with q-q plots. Thirdly sample sizes are often far too small to get a reliable estimate of the overdispersion parameter and/or no information is given on sample size or standard errors.
Unless a 'natural' unit is being sampled, such as an individual animal, the problem of scale should be addressed since the pattern of dispersion is critically dependent on this. The issue of truncation sometimes causes confusion - truncation can take place by eliminating the zero category, or by progressively eliminating higher categories if it is suspected that they are under-represented in the sample. When more than one index of aggregation is used, the problems of the negative binomial parameter k often become more apparent, since the different indices may all give different pictures of the degree of aggregation. Sometimes a wholly different approach is more applicable - such as mapping each individual using a regular pattern of traps or quadrats and using GIS interpolation routines.
Use of the log series and lognormal distributions for species abundance curves also have their drawbacks. This is mainly because the two distributions tend to be indistinguishable statistically unless you have a very large sample (the sample size problem again). When it comes to Whittaker curves, assessing the shape of such curves is highly subjective, and it is often hard to decide where a line goes from being linear to S-shaped.
What the statisticians sayThe negative binomial is rather poorly covered in most general medical statistics texts, but Remington et al. (1985) give a reasonable treatment, and it is mentioned in others including Armitage & Berry (2002). Hilbe (2007) is an advanced text devoted to the negative binomial model and its many variations. Johnson et al. (2005) cover the negative binomial distribution in Chapter 5. Agresti (2002) deals with its use in generalized linear models. Gotelli & Ellison (2004) cover the negative binomial in Chapter 2; Krebs (1999) and Young (1998) both describe various measures of spatial distribution in ecology, including how to fit the negative binomial shape parameter.
Lloyd-Smith (2007) looks at the maximum likelihood estimation of the negative binomial dispersion parameter with respect to infectious diseases. Brooker et al. (2006) and Alexander et al. (2000) use the negative binomial distribution for spatial modelling of parasite counts. Glynn & Buring (1996) discuss use of the negative binomial distribution for event rates. White & Bennet (1996) review the use of the negative binomial for modelling count data in ecology. Crofton (1971) argues that the negative binomial distribution can be regarded as a fundamental as well as empirical model of parasitism. This is one of the classic papers of parasitology and is well worth reading! Anscombe (1950) & Bliss & Fisher (1953) give the mathematical background to the negative binomial, and Waters (1959) proposes it as a quantitative measure of aggregation in insects.
Kendal (2004) reviews the debate surrounding Taylor's law and discusses its causes. Hurlbert (1990) provides an excellent critique of indices of aggregation. Routledge & Swartz (1992) argue that the family of power curves have no advantage over the family of quadratic curves, which is disputed by Perry and Woiwood. Taylor et al. (1979) points out a number of shortcomings of the negative binomial as a measure of aggregation, and suggests the log variance log mean relationship as an alternative measure. Gotelli & Colwell (2001) review the procedures and pitfalls in the measurement and comparison of species richness. Hill & Hamer (1998) describe the problems of differentiating the log series and log normal distributions. May (1975) takes a penetrating mathematical look at measures of species abundance and diversity. Williams (1944) considers various distributions that can be used to represent the frequency of the number of individuals per species, including the log series.