Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Measures of validity for measurement variables: Use and misuse
(Bland Altman plot, calibration, regression, correlation)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and MisuseData validation aims to assess measurement bias and ensure that a variable measures what it is supposed to measure. We deal elsewhere with the measures of validity used for nominal variables. Here we consider measurement variables.
For measurement variables, data validation consists of comparing two measurements each on the same subject. The first measurement is of the variable you are able to measure in practice - sometimes called the practical variable. The second measurement is of the true value of the variable, or at least as close as you can get to the true value. This variable is sometimes called the criterion variable, or gold standard. With questionnaire data, the response of the interviewee would be the practical variable - for example if you are asking a farmer to tell you the number of cattle he or she owns. The researcher may carry out a ground-truthing operation and count the number of cattle himself - that would be the criterion variable.
Most commonly each variable is measured in the same units, so we are looking to see whether values of the two variables are in agreement. An exception to this is if the practical variable is a proxy or surrogate variable, with quite different units. Here we are interested in whether there is a close and consistent relationship between the proxy and true values, so that we can predict one from the other.
Although validation studies of nominal variables frequently appear in the literature, validation studies of measurement variables are far less common. The most widely used analytical techniques for validating measurement variables are correlation and regression. Unfortunately the correlation coefficient is generally not helpful because it only indicates the degree of association - not of agreement. Simple linear regression can be used, providing the relationship is linear and the criterion variable is measured without error. Agreement is indicated if the slope does not differ from one, and intercept does not differ from zero. In the medical and veterinary literature, the publications of Bland promoting the so-called Bland-Altman plot as an alternative to regression have had a major impact. In this approach the first step is to plot a scatterplot of the practical variable against the criterion variable, together with a line of equality. Any bias is quantified by plotting for each pair of observations the difference between the two methods against the mean of the two methods.
We give several examples where the Bland-Altman approach is used. Sometimes it is done correctly, but often no initial plot is provided or it is only plotted for part of the measured range. In two examples no scatterplot is given, and only a correlation coefficient is provided. This is a misuse of the correlation coefficient. One trend in the medical and veterinary literature is to give both a regression line and a Bland-Altman plot. The commonest problem where regression is used is that both variables are subject to error hence attenuating the slope of the relationship below the expected value of one. For the Bland-Altman plot, the limits of agreement are sometimes estimated when the assumptions for these are not met - for example, when there is a clear trend in mean bias or when variability increases with the mean. We should note that the the Bland-Altman plot also has its critics - so the last word on this may yet to be spoken.
Use of the Bland-Altman plot seems not to have penetrated the ecological and wildlife literature where it is hard to find any examples of data validation! All too often it is simply assumed that the (proxy variables) trap catches or volunteer counts are meaningful measures of what they are supposed to be measuring - and this assumption is unchecked. Where validation studies are done, linear regression is by far the most commonly used method of analysis. This is perfectly acceptable if the two variables are measured in different units (for example trap catches versus population size), and the plot is being done to enable prediction of the criterion variable from a proxy variable. But where the two variables are measured in the same units there is clearly scope for more extensive use of the Bland-Altman method.
Validation studies suffer from some general problems not related to the specific statistical technique used. For example validation studies using questionnaires often only validate some minor point (such as for example herd size) which can be measured easily, and then assume that this 'validity' is somehow proven for all other aspects (such as disease diagnosis). Excessively small sample size is another common problem prevalent in wildlife studies relating different methods of estimating population size.
What the statisticians sayBland (2000) covers the use of Bland-Altman plots for comparing two methods of measurement in Chapter 15. Chambers et al. (1989) and Cleveland (1985) cover the Tukey mean difference plot. Glantz (2005) and Woodward (1999) both have short sections on measuring agreement between quantitative variables. Brown (1993) provides a (much) more advanced treatment of regression and calibration.
Bland & Altman (2007) and Myles & Cui (2007) consider statistical methods for assessing agreement between two methods of clinical measurement that can be used for repeated measures data. Halligan (2002) stresses the inadequacy of the correlation coefficient for assessing agreement. Altman & Bland (1983) and Bland & Altman (1986) are the (now) classic papers on how to assess agreement between measurement variables. Hopkins (2004) and Batterham (2004) put the opposing point of view.
Bland & Altman (2002) looks at methods of validating scales and indexes whilst Muldoon et al. (1998) argue that measures of quality of life should be validated against measures of directly observed behavioural performance. Lin (1989) provides a concordance correlation coefficient that does evaluate agreement rather than just the degree of correlation.
Wikipedia provides sections on calibration, and the Bland-Altman plot. Burke provides a good review of regression and calibration, along with methods to calculate standard errors and confidence intervals for values of X predicted from Y, whilst Davies & Fearn provide a useful note on calibration statistics. R statistics provides the code for doing a Bland Altman plot in R.