 InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Logistic regression: Use & misuse

## (logit transformation, model misspecification, stepwise selection, manual backward selection, interaction)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

### Use and Misuse

Logistic regression (more strictly binary logistic regression) assumes that the response variable is a binary response variable as where individuals are assigned to one of two classes (say infected or uninfected). A positive response (say infected) is coded as Y=1 (known as a success) and a negative response (say uninfected) by Y=0 (known as a failure). The mean of this response variable equals the probability of success p (that is the proportion infected). We then construct a regression type model such as p=a + b1X1 + b2X2 where X1 and X2 are explanatory variables and b1 and b2 are coefficients.

However, the response variable is bounded (p must fall between 0 and 1) and the typical response is not linear but sigmoidal. Hence we need an appropriate transformation. This is achieved using the logit transformation. The logit of p is not bounded and the relationship is linearized. The regression coefficients then represent log odds which are more interpretable in exponent form (exp b or eb) which converts them to odds ratios. Logistic regression is a generalized linear model with the parameters for the best fit model estimated using maximum likelihood rather than least squares. The overall significance of a logistic regression is assessed with a likelihood ratio test.

Our review of the literature suggests that model misspecification is a common problem. One needs informative explanatory variables, not vague indices and/or proxy variables. Sometimes one wonders whether the variable has been selected by chance. In one wildlife paper on wolves, year was clearly a factor - yet it was not included in the logistic regression model even though one would have expected the observed changing movement patterns of the wolves to show up in a year × corridor-use interaction. In other papers one often wonders whether transformations and/or polynomial functions were considered before the whole lot was dumped in a stepwise regression program!

The model-building process is unfortunately still dominated by stepwise selection methods. Whilst this is not quite as hazardous when the Akaike Information Criterion is used as the criterion for variable/model selection, stepwise selection is still not recommended. Fortunately many ecologists now tend to use manual backward selection which is a much better defence against the inclusion of spurious variables and incorrect confidence intervals. But probably the biggest problem in the literature is that there is very rarely any indication of the overall fit of models, let alone any attempt at model validation. AIC values give no information on (absolute) model fit and one does need some indication of whether one is explaining 0.6% or 6% or 60% of the variability.

Some authors still seem to have problems in interpreting odds ratios, especially if they are obtained from case-control designs. One still finds authors assuring us that the 'rare disease assumption is met' even in situations where it need not be met. If, for example, controls are sampled from the entire base population, and if the cohort is dynamic rather than fixed, then the odds ratio directly estimates the incidence rate ratio whether the disease is rare or common. What matters is whether we can assume a stable population - but there is rarely any discussion of this assumption.

As in other types of multiple regression, interaction still seems to cause problems. A common approach is to only use 'simple main effects' models - in other words pretend there is no interaction between variables and analyze it accordingly. If one has sufficient replication, one should always check for interaction and if necessary include it in the model.

### What the statisticians say

Hilbe (2009) presents an overview of the full range of logistic models as applied in medical and social sciences. Logistic regression it is also well covered by Agresti (2002) and Hosmer & Lemeshow (2000) Logistic regression is covered for ecologists in Zuur et al. (2007) and Quinn & Keough (2002). Logistic regression using R is covered by Logan (2010) and Crawley (2007), (2005). Nemes et al. (2009) warn that logistic regression overestimates odds ratios with small to moderate sample size, a serious problem if results from several small studies are pooled. Abreu et al. (2008) looked at the application of ordinal logistic regression models in quality of life studies. Biesheuvel et al. (2008) advocate greater use of polytomous logistic regression analysis in diagnostic research. King (2003) looks at alternatives to stepwise methods for running logistic regression models. Mittlböck & Schemper (2002), (1996) review measures of explained variation for logistic regression.

Steyerberg et al. (1999) highlights the danger of bias in stepwise selection in small data sets in logistic regression analysis. Bender & Grouven (1998) discuss using binary logistic regression models for ordinal data with non-proportional odds. Bender & Grouven (1996) review the poor presentation of logistic regression models in the medical research literature. Begg & Lagakos (1990) report on the consequences of model misspecification in logistic regression.

Nagy et al. (2010) consider tree-based methods as an alternative to logistic regression in revealing risk factors of crib-biting in horses. Rutherford et al. (2007) demonstrate the use of ordinal and multinomial logistic regression for modelling complex land cover changes. Boyce (2006) and Pearce & Boyce (2006) consider use of the case control design to evaluate resource selection functions. Keating & Cherry (2004) discuss the use of logistic regression to analyze wildlife studies employing case-control sampling designs. Trexler & Travis (1993) provides an earlier review of the use of logistic regresion for ecologists in their overview of 'non-traditional' regression analysis.

Wikipedia provides a (not very good) section on logistic regression, as well as short sections on polytomous logistic regression, the ordered logit model and the Hosmer-Lemeshow test. Laura Thompson (2009) provides a detailed R (and S-PLUS) manual to accompany Agresti's book on categorical data analysis which provides extensive coverage of logistic regression. Alan Agresti provides the data (and SAS code) for examples in his book. Other resources for doing logistic regression in R include Christopher Manning, Brian Everitt & Torsten Hothorn and Rossiter & Loza. Newsom (2010a) (2010b) gives a clear and concise introduction to logistic regression. See also Pia Veldt Larsen. UCLA Academic Technology Services provide a useful FAQ on 'What are pseudo R-squareds?' Gerard E. Dallal has sections on logistic regression and Poisson regression. 