Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Multiple linear regression: Use & misuse

(choice of explanatory variables, diagnostics, multicollinearity, variable selection, model simplification, model validation

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

Until recently any review of literature on multiple linear regression would tend to focus on inadequate checking of diagnostics because for years linear regression was used inappropriately for data that were really not suitable for it. The advent of generalized linear modelling has reduced such inappropriate use, although we do give an example where Poisson regression would have been much more appropriate than combined use of logistic regression  for presence/absence data and linear regression for number of cases.

A key issue seldom considered in depth is that of choice of explanatory variables. In the articles we reviewed we found several examples of fairly silly proxy variables - for example using habitat variables to 'describe' badger densities. Sometimes if the data do not exist, it might be better to actually gather some - in the badger case number of road kills would have been a much better measure. In a study on factors affecting unfriendliness/aggression in pet dogs, the fact that their chosen explanatory variables explained a mere 7% of the variability should have prompted the authors to consider other variables such as the behavioural characteristics of the owners.

As regards assumptions, multicollinearity between explanatory variables should always be checked using variance inflation factors and/or matrix correlation plots. Whilst it may not be a problem if one is (genuinely) only interested in a predictive equation, it is crucial if one is trying to understand mechanisms. Independence of observations is another very important assumption. Whilst it is true that non-independence can now be modelled using (for example) a random factor in a mixed effects model, it still cannot be ignored. We give a good example of this where relationships between lung surfactant protein and swimming pool use treated all 226 observations as independent when in fact they were clustered by school.

But perhaps the most important issue to consider is that of variable selection and model simplification. Despite the fact that automated stepwise procedures for fitting multiple regression were discredited years ago, they are still widely used and continue to produce overfitted models containing various spurious variables. As with collinearity, this is less important if one is only interested in a predictive model - but even when researchers say they are only interested in prediction, we find they are usually just as interested in the relative importance of the different explanatory variables. We give several examples where some of the explanatory variables selected were contradictory or just very unconvincing.

We should be able to end this section with a review of the issues arising from model validation procedures - but unfortunately such validation is very rarely done and virtually never on an entirely new set of data. One should commend the few authors we found who did use some form of model validation procedure.


What the statisticians say

Kirkwood & Sterne (2003) cover multiple regression and diagnostics for medical researchers in Chapters 11 and 12. Armitage & Berry (2002) cover regression models in Chapters 11 and 12. Bland (2000) introduces multiple regression in Chapter 18. Logan (2010) and Crawley (2007), (2005) both cover multiple regression for ecologists using R. Quinn & Keough (2002) also give extensive coverage of multiple linear regression. Rousseeuw & Leroy (2003) give a review of high-breakdown methods Burnham & Anderson (2002) is an important text on model selection and multimodel inference. Draper & Smith (1998) provide a great introduction to the fundamentals of regression analysis.

Slinker & Glantz (2008) write a useful tutorial on multiple regression for medical researchers. Royston et al. (2006) point out that dichotomizing continuous predictors in multiple regression is a bad idea. Austin & Tua (2004) note that automated variable selection methods result in models that are unstable and not reproducible. Harrell et al. (1996) reviews multivariate prognostic models in medical research. Richards (2005) looks at the use of the Akaike Information Criterion (AIC) for testing ecological theory. MacNally (2000), (2002) looks at regression and model-building in conservation biology, and attempts to reconcile 'predictive' and 'explanatory' models.

Gromping (2006), (2007) gives an update on measures of relative importance in multiple regression, along with an account of the package available in R. Johnson & Lebreton (2004) review the history and use of relative importance indices in multiple regression. Bring et al. (1994) discuss how regression coefficients should be standardized. Various measures of relative importance which involve averaging sequential sums of squares over all orderings of regressors are given by Lindeman et al. (1980), and Chevan & Sutherland (1991) Beyene et al. (2009) looks at recent developments in methods to develop and validate regression models. Whittingham et al. (2006) askes why (on earth) do we still use stepwise modelling in ecology and behaviour. Johnson & Omland (2004) review model selection in ecology and evolution. Breiman (1992) notes that (still current) practices in model selection have long been a quiet scandal in the statistical community. Henderson & Velleman (1981) stress dangers of automated model fitting.

Wikipedia provides sections on regression analysis, stepwise regression, Akaike information criterion, Mallows' Cp, and cross-validation. Julian Faraway provides an comprehensive guide to multiple linear regression in R including resistant and robust regression. Robert I. Kabacoff gives a useful Quick R guide to multiple linear regression. Peter Filzmoser (2008) gives an excellent tutorial on basic and advanced regression methods. Brian Ripley provides a useful account of robust regression. Gerard E. Dallal has extensive coverage of multiple linear regression under a number of headings in his "Little Handbook of Statistical Practice". Friendly & Kwan (2009) give a useful guide to visualizing collinearity diagnostics. Ishiguro et al. (1997) look at bootstrapping log likelihood and EIC, an extension of AIC. Rousseeuw (1991) gives a tutorial on robust statistics.