Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Measures of relationship between variables: Use and misuse
(Risk ratio, odds ratio, rate ratio, scatterplots, correlation, regression)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
We first consider the use of summary measures to describe association between nominal variables. Odds and risk ratios are very heavily used in medical and veterinary applications, although less so by other applied biologists. Statisticians tend to differ in their attitudes to odds ratios - some consider them the best thing since sliced bread, others that they should only be used if there is absolutely no alternative. For case control studies there is no alternative to their use, but for prevalence and cohort studies the risk ratio is preferable.
We give several examples where odds ratios are used unnecessarily leading to overestimation of the effect size. There is also too great a readiness to collapse measurement variables to binary variables so that the association can be expressed as a simple ratio - this inevitably results in a loss of information and is seldom justified. Note that assessing whether associations are 'significant' or not will come later in the course.
When it comes to measurement variables, scatterplots are widely used to display the association between measurement variables. Further analysis is done using regression or correlation. These make certain assumptions about the distribution of each variable, as well as assuming that the relationship between them is truly linear. We give several examples of where these assumptions are not met, and where the analyses have been applied to non-linear relationships. Examples of 'influential' points in scatterplots abound in the literature, arguably more so than outliers, and we have several examples of this. Other common problems are extending regression lines beyond the limits of the observations, and only giving the regression line without the data points. Correlation and regression are sometimes wrongly used to assess agreement between variables rather than association - although the example we give does this correctly using a line of equality.
Irrespective of the measure of association, the commonest misuse in analysing relationships is to assume that a close association between two variables proves that changes in one variable cause changes in the other. Unfortunately association alone can never prove causation because there are many ways in which a spurious association can arise, including simple random selection. For each example we look at whether sufficient consideration has been given to the possibility of bias (often caused by non-random sampling) or the presence of confounding variables. The converse is to assume no relationship just because it is not significant - this may be because there really is no relationship, but it may equally well be because the sample size is too small. We also give examples where it is not clear whether X is causing Y, or vice versa.
What the statisticians sayWoodward (1999) introduces risk ratios and odds ratios for medical epidemiologists in Chapter 3, whilst Thrusfield (2005) does the same for veterinary epidemiologists in Chapter 15. Sokal & Rohlf (1995) and Zar (1999) introduce regression and correlation in Chapters 14 and 17 respectively. Jacoby (1997) covers the display of bivariate data in the second part of his book, whilst Griffiths et al. (1998) give an excellent account of exploratory data analysis of bivariate relationships using scatterplots, including use of the median trace.
Sistrom & Garvan (2004) provide a useful introduction to proportions, odds, and risk. Spitalnic (2006) and Bland & Altman (2000) are two of many available introductions to the odds ratio. Davies et al. (1998) note that odds ratios may mislead if they are interpreted as though they were relative risks. Schwartz. et al. (1999) highlights a case of such misinterpretation in a report by Schulman et al. (1999). Flegal et al. (2006) and Rockhill (1998) describe the use and misuse of population attributable risk proportion. Lee (1994) argues that the odds ratio should only be used for a case control study, and is not suitable for a cross sectional study.
Kuo (2002) looks at the prevalence of unjustified extrapolation in scatterplots in the recent medical literature, whilst Learner & Goodman (1996) provide an interesting note on the qualitative expressions used to describe the strength of a correlation - although they do seem to fall into the same trap that they are warning others about... Bland & Altman (1986) look at statistical methods for assessing agreement between two methods of measurement. They point out that the correlation coefficient is not appropriate for this and suggest that differences should be plotted against their mean.
Wikipedia provides sections on response and explanatory variables , contingency tables , risk ratio , odds ratio , scatterplots , correlation , linear regression and association and causation . Bandolier provides another take on the joys of odds ratios and risk ratios, whilst Cochrane-net give a useful account of summary statistics for dichotomous data. Various university courses such as Yale provide introductions to linear regression.