Poisson regression is used to model counts, whether it be counts of a rare mammal in quadrats of forest, or counts of disease or mortality expressed as rates (numbers per unit person-year).

Poisson regression assumes that the mean of the Poisson random variable is a function of explanatory variables:

E(Y ) = μ = λ = exp(α + β_{1}X_{1} + ... + β_{k}X_{k})

Hence:

ln Y = Σ(Β_{j}X_{j}) + ε,

where ε (the random variation component of Y) is Poisson distributed about E(Y) with a variance equal to E(Y).

Because the log of this function produces a linear combination of the predictors, this model is said to have a log link function. The exponent of a Poisson regression coefficient is a rate ratio corresponding to a one unit difference in the explanatory variable.

If (for example) we are considering cases of disease, we may find that each individual is observed for a different period of time. In this case one wants to model rates (counts per unit of time). We therefore write:

ln Y = ln (time) + Σ(Β_{j}X_{j}) + ε,

where time is the number of person years

Then:

log(Y/time)= + Σ(Β_{j}X_{j}) + ε

where Y/time is the rate

The term ln(time) is known as the offset which is an explanatory variable with a known regression coefficient, in this case 1. It is the amount which must be added to estimate Y for any given X.

#### Assumptions of Poisson regression

These include:

- There is a linear relationship between the logarithm of the frequency or rate and equal increment changes in the explanatory variable.
- Changes in the rate from combined effects of different explanatory variables are multiplicative.
- At each level of the covariates the number of cases has variance equal to the mean (as in the Poisson distribution).
- Errors are independent of each other.

#### Parameter estimation and significance testing

Parameter estimation is carried out using a generalized linear model with a log link and Poisson errors. The log link ensures that all the fitted values are positive, while the Poisson errors take account of the fact that the data are integer and have variances equal to their means.

#### Assessing significance explained variation ('goodness of fit')

Procedures here are similar to those for logistic regression. Global goodness of fit tests can be made using the Pearson chi-squared and deviance test statistics. Large values of these statistics, and small *P*-values suggest that the model does not fit the observed data.

A simple Poisson regression model only allows for simple random variation and, as we pointed out in Unit 4, any variation unexplained by that simple model will cause overdispersion. The dispersion is usually considered to be the deviance statistic divided by its degrees of freedom. If there is no overdispersion, the ratio will be close to 1; if it is greater than 1 there is overdispersion. One way to deal with this in R is to use 'family = quasipoisson' distribution rather than 'family = Poisson'. One then selects an F test rather than a chi square test of deletion. The F test uses the empirical dispersion parameter as an estimate equivalent to the error variance (for full details see Crawley).

As a word of caution, Gerard E. Dallal notes that the dispersion measure estimated from the residual deviance underestimates the excess variability. The Pearson residuals statistic better captures the excess variability. Possibly the best approach is to switch to using negative binomial regression which is more appropriate for contagious distributions.