Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Some definitions

Data validation aims to assess (and ideally reduce) measurement bias and ensure that a variable measures what it is supposed to measure. Validity is assessed using various measures of validity which vary according to the type of variable you are measuring. We repeat the warning given previously that this terminology is not universally accepted, and some authorities use the term data validation to mean data verification. We have already dealt with the measures of validity used for nominal variables, namely sensitivity, specificity and overall accuracy. Now we consider measurement variables.

For measurement variables, data validation consists of comparing two measurements each on the same subject. The first measurement is of the variable you are able to measure in practice - sometimes called the practical variable. The second measurement is of the true value of the variable, or at least as close as you can get to the true value. This variable is sometimes called the criterion variable. With questionnaire data, the response of the interviewee would be the practical variable - for example if you are asking a farmer to tell you the number of cattle he or she owns. The researcher may carry out a ground-truthing operation and count the number of cattle himself - that would be the criterion variable.

Most commonly each variable is measured in the same units, so we are looking to see whether values of the two variables are the same. Sometimes however, the practical variable may be a proxy or surrogate variable, with quite different units. Here we are interested in whether there is a close and consistent relationship between the proxy and true values, so that we can predict one from the other.



The scatterplot and line of equality

Assessing the validity of a measurement variable is a more complex topic, and there are several different approaches. But whichever approach is used, the first step is always:

Plot the values of the practical variable against those of the criterion variable.

This visualization of the relationship is important. If variables are measured in the same units, it can be aided by drawing in the line of equality on the graph to indicate where the points should lie if the two variables are in perfect agreement.

{Fig. 1}

You can then assess whether two key assumptions are met:

  1. Does the variability in the relationship remain constant as the value of the variable increases? If not, as in the first figure here, it is best to carry out a log transformation before proceeding. The effect of this is shown in the second figure.
  2. Is the relationship between the two variables linear? If not, again a transformation may be required.

If you are dealing with counts you will often find that a log transformation makes the level of variability constant and linearizes the relationship.

What not to do

What you should not do is just attach the best-fit regression line or (worse) calculate the correlation coefficient. There are three reasons for this:

  1. The correlation coefficient only tells you how closely the two measures are correlated - it does not does it tell you how closely the two measures agree. You can have a very high correlation, but the actual values may differ greatly.
  2. Regression is more useful if you consider both the value of the slope and the value of intercept, and not just the significance level. But even these do not necessarily indicate that the relationship is linear. A linear function will nearly always provide a significant fit to a curvilinear relationship, even if it is totally inappropriate!
  3. In a calibration exercise the value of the correlation coefficient will depend on the range of observations - if you increase the range, you will increase the correlation coefficient.

Once you have carried out any necessary transformation to linearize the relationship, the correct approaches to validate your practical measure are detailed below:



  1. Linear regression and calibration

    If both practical and criterion variables are measured are in the same units, and the criterion variable is measured without error, we can use simple linear regression to examine the relationship between the two variables. This is commonly the case if one is validating a measuring instrument, such as a balance, against a set known standards in the laboratory. In this situation we can assess whether fixed bias or proportional bias are present by examining the intercept and slope of the regression equation. If only fixed bias is present, one method will give values that are higher (or lower) than those from the other by a constant amount, but the slope will not differ significantly from 1. The level of bias is given by the value of the intercept. If proportional bias is present, the slope of the relationship will differ significantly from one.

    Regardless of any bias present (and regardless of whether measurement of the criterion variable is without error), we can still use the regression line to obtain a calibration equation to estimate the criterion variable from the practical variable providing the fit is good. In Unit 12 we consider how to obtain a measure of the reliability of any predictions that are made from the calibration equation by calculating confidence limits.

    Although we have said that the values of the criterion variable should be measured without error, in practice this approach is commonly used when this assumption is clearly false. This does not affect use of the regression line for predictive purposes, but it does complicate interpretation of the slope, since it will be biased downwards. We must again wait until Unit 12 to see how this problem is dealt with using errors-in-variables regression. Often, however, the Bland-Altman approach (below) is a better option.



  2. Bland-Altman plot

    For a Bland-Altman plot it is assumed that the true value remains unknown, but that you are comparing a new measure (the practical variable) with the current standard measure (the criterion variable). Both measures may be subject to error. Practical and criterion variables are measured in the same units. The difference between the measures is plotted against the mean of the measures. This mean is taken as the 'best estimate' of the unknown true value. Note that for this reason both measures must be made in the same units - it is not appropriate for proxy variables where measures are in different units.

    Providing there is no relationship between the difference and the mean, the mean of the differences between the two measures provides an estimate of the mean bias. It is common practice to also indicate the 95% limits of agreement on the plot. We are getting ahead of ourselves to go into too much detail on this here, but suffice it to say you would expect these limits to enclose 95% of the differences between the two variables providing the differences are normally distributed. They are calculated very simply as 1.96 times the standard deviation of the differences. The rationale, and assumptions, underlying this approach should become clear when we explore some properties of the normal distribution in unit 5.



  3. Validating a proxy measure

    In this case practical and criterion variables are not measured in the same units. This is usually the case when the practical variable is a proxy variable - for example, trap catch of mosquitoes being used as a proxy measure of mosquito population size. In this situation, we are no longer trying to assess whether two measures are the same - they obviously cannot be because they are measured in different units. We are only interested in whether the two measures are associated, and (often) in predicting the criterion variable from the practical variable.

    |A scatterplot is still the first prerequisite in order to assess the linearity of the relationship. If the relationship (either before or after transformation) is linear, then (Pearson's) correlation coefficient is appropriate. Values of the criterion variables can be predicted from the criterion variables using linear regression. As before, the fact that the criterion variable is measured with error does bias the slope, but this does not affect use of the regression line for predictive purposes. If the relationship is non-linear but monotonic (in other words continually increasing or decreasing), then non-parametric correlation is appropriate.

topics :