Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Measures of validity for binary and nominal variables: Use and misuse
(sensitivity, specificity, overall accuracy, likelihood ratios, area under the curve, spectrum bias, gold standard, measurement bias)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
Data validation is a process which assesses the bias component of measurement error. In other words, a variable is valid if it provides an accurate measure of what it is supposed to measure. The random error component of measurement error is assessed by repeatability. Measures of validity used for binary variables are sensitivity, specificity, overall accuracy, likelihood ratios and AUC (area under the curve). In the past these measures have mainly been used in the field of diagnostic tests for medical and veterinary applications. However, most have a potentially wider utility, especially in relation to assessing accuracy of mathematical models.
One key assumption for evaluation of a diagnostic test (or model) is that the study population is a representative ( = random) sample of the target population to which the test (or model) is to be applied. Unfortunately, in the literature we reviewed, it is rather rare for the test population to be a representative sample of the target population. In several examples the sample comprises all individuals in a defined population, but more often the criteria for inclusion of individuals in the sample are not even specified. Failure to use representative samples invariably results in spectrum bias when the test is applied in other populations. This aspect is routinely forgotten when ecologists use ROC curves for distribution models. We found just one example of 'best practice' where the model was developed with 'training sites', and then its accuracy was evaluated using a new set of randomly selected sites.
The problem of an inadequate (non-ideal) gold standard is evident in several of our examples. In some cases this could have been solved by more rigorous testing - for example parasitaemia can vary through the day, so the sensitivity of microscopy can be improved by taking three or four samples per individual. If there truly is nothing approaching an ideal gold standard, one can use experimental infections and known negatives - but such studies will suffer from spectrum bias. The possibility of measurement bias lurks behind every diagnostic test evaluation, yet measures to reduce bias are indeed rare! We found no examples of tests being given in random order to avoid bias, even though in some cases quite long time periods elapsed between the two tests. Use of a double blind system to avoid observer bias was only reported in two of our examples.
What the statisticians sayWoodward (2004) looks at diagnostic tests for medical epidemiologists in Chapter 2. Diagnostic tests for veterinary epidemiologists are covered by Pfeiffer (2010) in Chapter 8, Thrusfield (2005) in Chapter 17 and Dohoo (2003) .
Gilchrist (2009) argues that all indices of diagnostic accuracy should be adjusted to correct for chance effects. Grimes & Schulz (2005) explain and promote the use of likelihood ratios for the results of diagnostic tests, both for dichotomous and multiple category results. Loong (2003) uses a visual approach to explain sensitivity and specificity. Whiting et al. (2003), Knottnerus et al. (2002) and Reid et al. (1995) all review the key requirements for the validity of medical diagnostic studies. Sackett & Strauss (1998) argue that important information is lost by reducing diagnostic tests to dichotomies, and that it is better to use likelihood ratios over five levels. Altman & Bland (1994a) (1994b) (1994c) provide a useful series of three articles on diagnostic tests which cover most of the material we have included.
Gardner & Greiner (2006) advocate ROC curves and likelihood ratios as improvements over traditional methods for evaluating veterinary diagnostic tests, whilst Obuchowski (2004) reviews the uses and misuses of ROC curves in clinical chemistry. Stephan (2003) surveys available computer programs for carrying out ROC analysis. Greiner & Gardner (2000) look at the analysis of diagnostic data from a veterinary viewpoint with emphasis on methods used to adjust prevalence estimates for misclassification. Cannon (2001) looks at the problems of designing veterinary surveys based on an imperfect test.
Townsend Peterson et al. (2008) and Lobo (2007) criticize use of AUC to assess the performance of predictive distribution models. Lobo (2008) stresses the need for more representative data. McPherson et al. (2004) look at the practical use of various different measures of model accuracy including Cohen's kappa and the area under the curve (AUC) of receiver-operating characteristic (ROC) curves. Manel et al. (2001) compare overall accuracy, Cohen's kappa and area under the curve from ROC curves for evaluating distribution models in ecology.
Wikipedia provides sections on diagnostic tests, gold standard test, sensitivity and specificity, positive predictive value, negative predictive value, the kappa coefficient and ROC curves. Pfeiffer (2002) has a chapter on diagnostic tests which covers likelihood ratios and ROC curves. Graphpad has a useful section on interpreting ROC curves.