Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site

Estimating true prevalence




What do we mean by data validation?

Data validation is a process which assesses (and ideally reduces) the bias component of measurement error. In other words, a variable is valid if it provides an accurate measure of what it is supposed to measure. The random error component of measurement error is assessed by repeatability - we consider this in Unit 3 and again in Unit 11. Data verification is a less clearly defined process which aims to ensure that data that are free of human and instrument errors - including those which arise during data processing - and is considered later in this Unit. Be aware! this terminology is not universally accepted, and some authorities use data validation in the sense that we use data verification. The two processes should, however, be clearly distinguished.

Validity is assessed using various measures of validity which vary according to the type of variable you are measuring. In this More Information page we only consider those measures used for nominal variables, especially binary variables - namely sensitivity, specificity, overall accuracy and kappa-corrected accuracy.



Sensitivity and specificity

Sensitivity and specificity are measures of validity for the classification of a binary variable. They are most frequently used to describe the accuracy of diagnostic tests. Accuracy of a classification can only be evaluated if the true status of individuals can be determined using an 'ideal' gold standard. Given that, we can construct a 2 by 2 table comparing the true status with the test results for a number of individuals. Sensitivity and specificity are then defined as below:

True status
(from gold standard)

Proportion of true positives that test positive
   =   Sensitivity   =   (a) / (a)+(c)

Proportion of true negatives that test negative
   =   Specificity    =   (d) / (b)+(d)


(a) = true positives(b)=false positives
(c)=false negatives(d)=true negatives

Several other proportions can also be defined:

The proportion of false positives (true negatives that test positive) = (b) / (b)+(d), or 1 − specificity
The proportion of false negatives (true positives that test negative) = (c) / (a)+(c), or 1 − sensitivity
The proportion of all those tested that are truly positive (true prevalence) = (a)+(c) / N
The proportion of all those tested that test positive (test prevalence) = (a)+(b) / N

The test result may be recorded directly as a binary variable (for example positive or negative). But more often a reading is taken, and the variable is then collapsed to the binary scale using a cut-off value. Values on one side of the cut-off are taken as positive, whilst those on the other side (or equal to the cut-off) are taken as negative.

Obtaining the optimal value for the cut-off value involves a trade-off between sensitivity and specificity. If we assume that values above some cut-off indicate a positive, then raising the cut-off will increase specificity but decrease sensitivity. Similarly lowering it will make the test more sensitive but at the expense of its specificity. We look at ways to determine the optimal cut-off value below.

Although sensitivity and specificity are most commonly used in relation to assessment of diagnostic tests, they can also be used to assess the accuracy of classification of other binary variables. For example, how accurately does a mathematical model classify grid cells on a map as having a particular species of insect vector present or not present.



Positive and negative predictive values

Predictive values indicate the probability at the individual level that classification is indeed correct, whether positive or negative.

Proportion of individuals that test positive who really are positive
= Positive predictive value = (a) / (a)+(b)

Proportion of individuals that test negative who really are negative
= Negative predictive value = (d) / (c)+(d)

True status
(from gold standard)

Predictive values depend both on the characteristics of the test and on the prevalence of the characteristic being measured. If the prevalence is very low (say 1%), even a test with a high sensitivity and specificity (say 98%) will have a surprisingly low positive predictive value (in this case only 33%) simply because of the relatively high number of false positives.

There are short-cut formulae (based on Bayes theorem) to estimate predictive values for a given prevalence directly from sensitivity and specificity:

Positive predictive value    =   (sensitivity prevalence)
(sensitivity prevalence) + (1 − specificity) (1 − prevalence)
Negative predictive value   =   specificity (1 − prevalence)
specificity (1 − prevalence) + (1 − sensitivity) prevalence



Measures of Overall Accuracy

If one test has higher sensitivity and specificity than another, and costs the same, it is clearly a better test. However if one test has higher specificity, and the other has higher sensitivity, the choice is not so easy. One indicator of the quality of a test is the overall accuracy.

  • Overall accuracy

    Proportion of individuals for whom test result is correct whether positive or negative
    = Overall accuracy = (a) + (d) / N

    True status
    (from gold standard)

    However overall accuracy is misleading because it does not take account of chance agreement. The Kappa coefficient is considered by many (but not all) to be a more appropriate measure of overall accuracy.

  • Kappa coefficient

    Kappa coefficient    =    [ po  −  pe]
    [ 1  −  pe]


    • po is the observed amount of agreement (in other words the overall accuracy as defined above (a) + (d) / N)
    • pe is the expected amount of agreement if there is no association between true status and test result. Expected cell contents are estimated using proportions obtained from the overall margin totals (see worked example below for details).

    Note kappa scores are biased when prevalence is very high or very low.

  • Likelihood ratios

    The positive likelihood ratio is the proportion of true positives that test positive, divided by the proportion of true negatives that test positive - or equivalently:

    Positive likelihood ratio  =  Sensitivity
    1 − Specificity

    The positive likelihood ratio tells us how likely the test is to show a positive test result in a diseased, compared to a non-diseased, individual. If the likelihood ratio is equal to 1, the test cannot discriminate between infected and uninfected individuals. The greater the likelihood ratio is than 1, the better the performance of the test. The likelihood ratio may just be estimated for a single cut-off value, or multiple likelihood ratios may be calculated over a range of cut-off values.

    The negative likelihood ratio is the proportion of true positives that test negative, divided by the proportion of true negatives that test negative.

    Negative likelihood ratio  =  1 − Sensitivity

    The negative likelihood ratio tells us how likely the test is to show a negative test result in a diseased, compared to a non-diseased, individual. The smaller the likelihood ratio is than 1, the better the performance of the test.

  • Receiver operating characteristic plots

    A more widely used method of obtaining a global assessment of the performance of the test is to construct a receiver operating characteristic plot (or ROC plot). This is a plot of the proportion of true positives that test positive, against the proportion of true negatives that test positive, for a range of cut-off values.

    {Fig. 1}

    The area under the curve(or AUC) expressed as a proportion of the total gives a measure of the diagnostic accuracy of the test. If the test provides no discrimination between positive and negative, the curve will follow the 45 degree line with the AUC equal to 0.5. The more the curve 'crowds' the corner and the AUC approaches 1, the better the discrimination. In this case the AUC is equal to 0.89.

    Providing a test does provide good discrimination, the choice can be made of the best cut-off point. The criteria used for this are discussed below.



Assumptions and sources of bias

There are a number of assumptions made when estimating sensitivity, specificity and related measures:

  1. Study population is representative of target population

    The key assumption for evaluation of a diagnostic test is that the study population is a representative ( = random) sample of the target population to which the test is to be applied. A natural consequence of this is that selection criteria must always be clearly defined. Although this assumption is often given in textbooks, it is rarely adhered to in practice! As a result, various forms of selection bias frequently occur.

    The term spectrum bias has been used for the situation when the study population has a different demographic or clinical spectrum than the population in which the test is to be applied. This is common in veterinary studies where there is no 'ideal' gold standard test available. The only way to assess the accuracy of the test is to evaluate sensitivity on individuals experimentally infected, and specificity on individuals which cannot possibly have been exposed to the disease. In this situation, spectrum bias results in the overestimation of sensitivity and specificity relative to the target population. In medical research spectrum bias occurs if those selected for the evaluation have been already been selected as probable cases of the disease, yet the test is to be used on the general population.

  2. Gold standard has 100% specificity and sensitivity

    It goes without saying that you can only determine the accuracy of a test if you have some means of identifying true positives and negatives. Ideally this means having an 'ideal' gold standard test which can be performed on randomly selected individuals along with the test to be evaluated. Other methods (such as experimental infection to give the true positives) suffer from spectrum bias.

  3. Measurements are unbiased

    Observers should always be blinded to the result of the other test(s) in order to reduce observer bias. This cannot be eliminated completely - for example technicians may perform one test more carefully than the other - but blinding to other results is essential. The tests should also be applied to individuals in random order, and the time period between tests should be minimised so that no material change in infection status can occur.

Related topics :

Gold standard test

Bayes theorem