Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Binomial and Poisson distributions: Use & misuse
(standard error of proportion and rate, assumptions, cluster sampling, pattern of dispersion, test of randomness)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
The mathematics of the binomial distribution provides a short-cut method to estimate the variance of a proportion derived from a simple random sample, given the values of p, q and n. This approach is heavily used in medical statistics to estimate the standard error (and from this the confidence interval) of disease prevalence and proportion cured. However, as with estimating any standard error, the sample must be random to ensure independence of observations. If one is dealing with a convenience sample, then estimation of the standard error is meaningless - yet we give several examples where this has been done. Similarly if clusters rather than individuals are sampled randomly, then one cannot use the simple binomial formulae to work out the standard error as this will underestimate the variability. Another use of the binomial distribution is to test whether choice between two alternatives is random by comparing observed frequencies with those expected under the binomial distribution. This is not done commonly, but we did find two examples in quite different disciplines.
The Poisson distribution is also used to estimate standard errors, in this case to frequencies and (especially) rates. Again it is essential that events are independent. A common misuse is to use the Poisson to attach a standard error to a rate derived from pooling events from different clusters, whether villages or herds. The Poisson is also much used to test randomness over space or time by compared observed frequencies with those expected under the Poisson distribution. We give several examples of this, most of which demonstrate some of the pitfalls in this approach. One such pitfall is that statistical assessment of goodness of fit is very dependent on sample size. Failing to demonstrate a significant difference between observed and expected frequencies is not the same as demonstrating a 'good fit'. Another pitfall is that it depends critically on scale - for example one will get quite different distributions if one looks at numbers per leaf, per branch or per tree.
Both the binomial and Poisson distributions are also very important in modelling relationships between response and explanatory variables - where in certain situations they describe the error structure much better than the normal distribution. We will get ahead of ourselves to give too much attention to this here, although in one veterinary example (cases of mastitis) the interest in the distribution resulted from just such a desire.
What the statisticians sayArmitage & Berry (2002) and Bland (2000) both provide good introductions to the binomial and Poisson distributions. Kotz et al. (1992) provide an in-depth treatment of the theory, derivation and application of probability distributions for count data. Sokal & Rohlf (1995) give a fairly detailed account of both the binomial and Poisson distributions in Chapter 5 - along with a very brief mention of other discrete probability distributions. Krebs (1999) and Young (1999) both warn of the inadequacies of the variance-mean ratio as an index of aggregation.
Griffiths (2006) and Griffiths & Haining (2006) describe how powerful Poisson-based modelling tools allowing for spatial autocorrelation have been developed for geographical analysis of count data. Glynn & Buring (1996) discuss the inappropriateness of the Poisson distribution for event rates in many situations because real data are usually over-dispersed. However, Westermeier & Michaelis (1995) find no reason to use other than a Poisson distribution to model cases of cancer in children in Germany. Kaplan et al. (1986) provide an interesting example of the use of the binomial distribution in establishing an association between high risk donors and transfusion-associated AIDS. Hurlbert (1990) debunks many of the myths surrounding the mean to variance ratio as an index of aggregation, and proposes an alternative index of departure from the Poisson.
Wikipedia provides sections on discrete probability distributions, the binomial distribution, the Poisson distribution, the Bernoulli distribution and the hypergeometric distribution. NIST/SEMATECH e-Handbook of Statistics gives details of the Binomial distribution and the Poisson distribution. Bret Larget (2003) and Anthony Tanbakuchi (2009) both provide help on using R to calculate probabilities associated with common distributions, as well as to graph probability functions. Sam Wang (2008) uses the binomial distribution to model vote counts in American elections. However, Andrew Gelman (2008) argues that the binomial distribution is inappropriate for this because votes are not independent.