Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



What is data verification?

The purpose of data verification is to ensure that data that are gathered are as accurate as possible, and to minimize human and instrument errors - including those which arise during data processing. Data verification is an on-going process which should start at the data gathering stage, and continue during data entry and analysis.

Be aware! Some authorities use the term "data validation" and "data verification" much more narrowly. Data validation is taken to refer to an automatic computer check that the data is sensible and reasonable, and "data verification" to refer to a check to ensure that data entered exactly matches the original source. Under these definitions neither term refers to

  1. whether the data actually measure what they are supposed to (the usual definition of validity)
  2. whether the data are free of errors (verification by our definition).

The lack of agreed terms may explain why there is so little interest in these two vitally important aspects of data analysis!



At the data gathering stage

At the data gathering stage it is probably best to make as few assumptions as possible about the accuracy of your equipment, or for that matter the human beings taking the readings. Common problems include mislabelling of samples, poor storage and transport of samples, and erroneous counts because of miscalibration and instrument error.

Observer bias is also common - one example is a carry-over effect where (for example) a set of samples containing high counts of eggs in faecal smears tend to be followed by excessively high counts even when numbers are low. Another example is a bias towards even numbers especially if one is estimating a reading half way between marked positions on the scale. This is sometimes termed digit preference bias. However, observer bias can take many forms - often quite unexpected! Only by appropriate checking can you be certain that the data are as accurate as possible. Familiarity with the type of data you are gathering, and the common errors, are both essential.

Data gathering using a questionnaire is especially liable to inaccuracies. Many errors and biases are introduced when a questionnaire is translated to another language - the only way to avoid this is to get someone (independent) to backtranslate the (translated) questionnaire and compare the two questionnaires. The other big problem if the questionnaire is given verbally is interviewer bias. Someone who has done hundreds (or thousands) of questionnaires will expect particular answers to certain questions, and will often stop listening (or even not ask the question) and just insert the expected (or desired) answer. This can only be detected if a sample of interviewees is re-interviewed shortly afterwards by independent interviewers. We consider questionnaire design and implementation in more depth in Unit 7.



At the data entry stage

At the data entry stage, a number of data checking packages are available. These commonly check that data are in a specified format (format check), that they lie within a user-specified range of values (range check) and (sometimes) that they are consistent - for example, that there is no milk yield for male cattle! They cannot tell you if some data have been missed out, nor can they detect errors within the accepted range. These can only be eliminated by a visual check (that is proof-reading) or (better) by using double data entry. With this method two data entry operators enter the data independently, and the two data files are compared using a computer programme. Even this method may not detect errors arising from misreading of carelessly written numbers (for example 6 and 0).



At the data analysis stage

  • Outlier detection and rejection

    The last opportunity to avoid errors in your data is at the analysis stage - usually by eliminating 'outliers'. Outliers are points that do not follow the general picture, whether in terms of the frequency distribution of your data or its relationship to another variable. Outlier rejection techniques assume that improbable values are in error, and omit them from the analysis. This may be the case, but if so it reflects a failure of your data validation process to detect the error earlier!

    The crucial problem with rejecting outliers, is that all data sets include a few 'odd' results. This is completely normal. The hard part is spotting which are genuine mistakes, and which are just odd data points. This is particularly risky, as it relies upon your expectations of what is 'reasonable'. It is much better to identify outliers as they arise. Then you stand some chance of finding out why that particular point is an outlier. The biggest source of bias in any study is the researcher's expectations. So, if an observation is not a clear error, it is most unwise to remove it! As we shall see, some 'abnormal' observations are normal, and you may learn more by understanding why some points are outliers, than by only looking at the 'normal' data points! A further problem with automatic outlier rejection is that is very difficult to allow for it in any subsequent statistical analysis - by removing the most extreme observations, you are artificially reducing your sample variation.

  • Trimmed means and robust estimators

    If you must remove a few extreme observations, you need to reduce the risk of bias as much as possible. To allow for this, a class of statistics were developed - known as robust estimators. The idea of a robust statistic is that, when all is well, it will behave nearly as well as more ordinary statistics - but when its assumptions are compromised, it will continue to behave more or less reasonably.

    To be valid, most robust estimators assume you are dealing with a reasonably distributed set of observations, contaminated by a small proportion of much more variable results. Of the various statistics which have been devised, the simplest to explain are the 'trimmed' means.

    Although a number of trimmed means have been devised, the most popular of them ensure the same number of unusually large and unusually small observations are removed. In other words, the mean is obtained from a symmetrically trimmed sample. The degree of trimming is usually expressed in terms of what proportion (or percent) of the most extreme observations have been removed either side of the median. An ordinary arithmetic mean is thus a zero (0%) trimmed mean. At the other extreme, the median is the 0.5 (50%) trimmed mean.

    Although the properties of robust estimators are fairly well understood, they are still relatively uncommon - partly because appropriate formulae are not readily available, although they are being increasingly assessed by simulation.


    As we noted above, a median is the most extreme trimmed mean. Generally, if you have data where you distrust extreme values, it is easier and more transparent to use medians. Although the tests for medians are less powerful than those for means, a good number are available - and, for reasonably large samples, the formulae for them are relatively straightforward. We consider how to compare medians in Unit 10.

topics :