Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Pearson's correlation coefficient: Use & misuse
(scatterplot, bivariate normality, homogeneity of variances, linearity, causality, association versus agreement)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and MisusePearson's correlation coefficient is very widely used in all disciplines. It is commonly presented along with a scatterplot of the data - which at least allows some assessment of the validity of the analysis. If a parametric test of the correlation coefficient is being used, assumptions of bivariate normality and homogeneity of variances must be met. We give several examples where these assumptions are clearly not met - for example variables such as mortality rates and body condition scores are very unlikely to resemble a normal distribution. Moreover body condition score is an ordinal rather than a measurement variable - so again a non-parametric correlation coefficient would have been more appropriate.
Perhaps more important is whether the relationship really is linear. We give several examples where the relationship is instead curvilinear - for example the relationship of the percentage of late stage mature female fish to rainfall, and the relationship of the number of eagles to the number of fish. In another example an 'outlier' is omitted from the analysis in order to make the relationship linear. Non-linear but monotonic relationships should instead be analyzed using a non-parametric correlation coefficient. Another very important issue is whether the bivariate observations really are independent. Ecologists and epidemiologists commonly use the correlation coefficient to assess spatial or temporal relationships, and in such studies observations may be either spatially or temporally autocorrelated. We look at several examples of this including a study relating solar radiation in a state to the incidence of colon cancer, a study relating abundance of the small blue butterfly in a habitat to abundance of its foodplant, and a study relating reproductive traits of fish over time to environmental characters. In all these cases (and several others) the coefficient is likely to be biased towards unity giving a spurious correlation.
Although most authors freely admit that their observed correlation cannot prove causality, one still gets the feeling that many feel it damn well should prove it, and that it is only the cussedness of statisticians that prevents them from claiming causality. Yet (ignoring random variation) many apparent relationships could easily result from confounding factors - for example in the study relating the incidence of inflammatory bowel disease to a proxy variable for poverty. The apparent inverse relationship could just result from people in wealthier countries being more informed about the disease and being more ready to report symptoms, especially for a disease which does not automatically hospitalize or kill. Remarkably few authors rigorously apply the generally accepted criteria for causality to the matter at hand. In one example we even see evidence of bias in selection of observations in an attempt to demonstrate a relationship which may simply not exist - the relationship between homicide rates and suicide rates is wildly unconvincing to all but the author.
Lastly there are occasions when one wishes to test for agreement between two variables rather than just association. Correlation is not appropriate as a means to assess agreement between two measures, yet it is still widely used with this in mind. A scatterplot with a line of equality is a much better first step in such a situation, followed by a Bland Altman plot.
What the statisticians saySokal & Rohlf (1995), Zar (1999), and Snedecor & Cochran (1989) all provide extensive coverage of Pearson's correlation coefficient for biologists. Armitage & Berry (2002) , Woodward (1999) and Thrusfield (2005) provide similar material for medical and veterinary researchers. Chalmer & Whitmore (1986) provide a reasonably detailed account of how to test correlation coefficients and how to attach a confidence interval to the coefficient.
Rodgers & Nicewander (1988) provide thirteen ways to look at the correlation coefficient. Kraemer (2006) provides a useful review of measures of effect size indicating strength of correlation. Learner & Goodman (1996) look at qualitative expressions used to describe the strength of a correlation. Bland & Altman (1996) point out why the correlation coefficient should not be used to assess measurement error, nor in method comparison studies. See also Altman & Bland (1983) and Bland & Altman (1986) .
Adolph & Hardin (2007) explain how to correct for attenuation of the correlation coefficient caused by measurement error. Mudelsee (2003) describes how to calculate bootstrap confidence intervals when using the correlation coefficient for serially dependent time series.
Wikipedia provides sections on correlation, the Pearson product-moment correlation coefficient, the Fisher transformation, and correlation does not imply causation. The Handbook of biological statistics covers correlation and regression together arguing that they are often two parts of the same analysis. David C. Howell provides a useful account on randomization (permutation) tests for the correlation coefficient. R.J. Rummel looks at correlation from the viewpoint of a political scientist. Mark C. Chu-Carroll also looks at the correlation-causation issue.