Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Simple linear regression: Use & misuse
(linearity, independence of errors, bias, regression to the mean, errors in variables)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and MisuseSimple linear regression provides a means to model a straight line relationship between two variables. In classical or asymmetric regression, one variable (Y) is called the response or dependent variable, and the other (X) is called the explanatory or independent variable. Values of X-variable are assumed to be fixed by the experimenter. Errors on the response variable are assumed to be independent and identically and normally distributed. The model is still valid if X is a sample of available values (as is often the case in ecological work), but only if X is measured with minimal error. If there is substantial measurement error on X, and the values of the estimated parameters are of interest, then errors-in-variables regression should be used.
We start though by considering what is probably the most fundamental assumption if one is fitting a straight line relationship - linearity. Fortunately most (but not all) of the examples we found met this basic requirement. Sometimes a transformation was used to linearize the relationship - this is fine providing the resulting model makes sense. But another approach to non-linearity was to use piecewise linear regression where different linear relationships are fitted on either side of a breakpoint - in neither of these cases are we convinced that this is a sensible approach. More widespread problems arose when we considered independence of errors. Some researchers now take this issue more seriously by testing for autocorrelation, but we found examples where there may be spatial autocorrelation (clustering of villages) or temporal autocorrelation (measurements in a time series). There is another possible cause of non-independence which is seldom considered - that is bias in the selection of sites in spatial studies. We found an example where convenience selection of sites led one to question the outcome of a study purporting to show that anthropogenic-modified habitats are 'good' for biodiversity. Another important assumption (at least for parametric testing) is homogeneity of variances over the range of X values, and again most of our examples met this assumption. But in one case - again the study on species richness - the variance of the response variable was strongly dependent on the explanatory variable. Lastly there is the issue of measurement error on the explanatory variable. Whilst in some studies this was not an issue, in other cases the attenuation of the slope may have led to erroneous inference.
Sometimes there is clearly an overwhelming desire to either prove or disprove the null hypothesis - at the expense or rationality. In a study looking at the relationship between bacterial resistance and general practice prescribing of antibiotics, there was clearly far more variability within countries than between countries, yet authors and commentators ignored this to focus on a relationship that 'should' be there (but unfortunately was not). In another study (on the relationship between small mammal trap catches and traffic volume) the authors appeared far to ready to accept the null hypothesis of no association, despite there being a high level of measurement error which would have greatly reduced the power of the test.
Lastly we found examples of two specific problems in linear regression. The first is that of spurious relationships caused by regression to the mean. If one plots population change against population size for a time series, the Y and X variables are sharing a large measurement error term. This makes such regressions prone to spuriously detecting density dependence. Despite being widely discussed in the literature, this problem was not addressed in the study on possible density dependence in Ethiopian wolf populations. The second issue is the use of simple linear regression for comparison of methods. We are not just interested in whether measurements are correlated - we are interested in whether they are the same. Moreover both variables are subject to error. Hence the Bland-Altman approach or errors-in-variables regression would have been much more appropriate - and more informative.
What the statisticians saySokal & Rohlf (1995), Zar (1999), Armitage & Berry (2002) and Snedecor & Cochran (1989) all cover simple linear regression in greater or lesser detail.
Pearson et al. (1897) first identified the problem of spurious correlations where variables shared a common term. This view was reinforced by Atchley & Anderson (1978) but Prairie & Bird (1989) reignited the issue by asserting that such correlations were not necessarily invalid. This was forcefully rebutted by Jackson & Somers (1991), and Keeney (1991), and (rather less forcefully) by Berges (1997). Brett (2004) later revisited the topic.
Kuo (2002) looks at the prevalence of unjustified extrapolation in scatterplots in the recent medical literature. Freckleton et al. (2006) reviews the situation for detection of density dependence from time series data. Gill (1987) looks at the biases in regression when prediction is inverse to causation
Wikipedia provides sections on linear regression, simple linear regression, least squares estimation of linear regression coefficients, coefficient of determination, nonlinear regression and autocorrelation. Julian Faraway is essential reading for carrying out regression using R. Gerard Dallal gives a useful review of techniques for regression diagnostics.