![]() Biology, images, analysis, design... |
|
"It has long been an axiom of mine that the little things are infinitely the most important" |
|
Pearson's correlation coefficient: Use & misuse(scatterplot, bivariate normality, homogeneity of variances, linearity, causality, association versus agreement)Statistics courses, especially for biologists, assume formulae = understanding and teach how to do Use and Misuse
Pearson's correlation coefficient is very widely used in all disciplines. It is commonly presented along with a scatterplot of the data - which at least allows some assessment of the validity of the analysis. If a parametric test of the correlation coefficient is being used, assumptions of bivariate normality Perhaps more important is whether the relationship really is linear. We give several examples where the relationship is instead curvilinear - for example the relationship of the percentage of late stage mature female fish to rainfall, and the relationship of the number of eagles to the number of fish. In another example an 'outlier' is omitted from the analysis in order to make the relationship linear. Non-linear but monotonic relationships should instead be analyzed using a non-parametric correlation coefficient. Another very important issue is whether the bivariate observations really are independent. Ecologists and epidemiologists commonly use the correlation coefficient to assess spatial or temporal relationships, and in such studies observations may be either spatially or temporally autocorrelated. We look at several examples of this including a study relating solar radiation in a state to the incidence of colon cancer, a study relating abundance of the small blue butterfly in a habitat to abundance of its foodplant, and a study relating reproductive traits of fish over time to environmental characters. In all these cases (and several others) the coefficient is likely to be biased towards unity giving a spurious correlation. Although most authors freely admit that their observed correlation cannot prove causality, one still gets the feeling that many feel it damn well should prove it, and that it is only the cussedness of statisticians that prevents them from claiming causality. Yet (ignoring random variation) many apparent relationships could easily result from confounding factors - for example in the study relating the incidence of inflammatory bowel disease to a proxy variable for poverty. The apparent inverse relationship could just result from people in wealthier countries being more informed about the disease and being more ready to report symptoms, especially for a disease which does not automatically hospitalize or kill. Remarkably few authors rigorously apply the generally accepted criteria for causality to the matter at hand. In one example we even see evidence of bias in selection of observations in an attempt to demonstrate a relationship which may simply not exist - the relationship between homicide rates and suicide rates is wildly unconvincing to all but the author. Lastly there are occasions when one wishes to test for agreement between two variables rather than just association. Correlation is not appropriate as a means to assess agreement between two measures, yet it is still widely used with this in mind. A scatterplot with a line of equality is a much better first step in such a situation, followed by a Bland Altman plot. What the statisticians saySokal & Rohlf (1995),![]() ![]() ![]() ![]() ![]() ![]() ![]() Rodgers & Nicewander (1988) Adolph & Hardin (2007) Wikipedia provides sections on correlation,
|