Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Data verification: Use and misuse

(data quality, observer bias, double data entry, outliers, trimmed means)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

The purpose of data verification is to ensure that data that are gathered are as accurate as possible, and to minimize human and instrument errors - including those which arise during data processing. Data verification is an on-going process which should start at the data gathering stage, and continue during data entry and analysis. Be aware! Some authorities use the term "data validation" and "data verification" much more narrowly. Data validation is taken to refer to an automatic computer check that the data is sensible and reasonable, and "data verification" to refer to a check to ensure that data entered exactly matches the original source. Under these definitions neither term refers to whether the data actually measure what they are supposed to (validity) or whether the data are free of errors (verification). The lack of agreed terms may explain why there is so little interest in these two vitally important aspects of data analysis!

Data verification procedures pre-analysis are either very poorly reported in the literature or (perhaps more likely) simply not done. A few references to data checking, especially double data entry, can be found in the medical literature. Outlier exclusion, on the other hand, is much more frequently reported in all disciplines - although the justification for excluding such observations is seldom given. Robust estimators such as trimmed means are still rare except in a few types of analyses.

Two of our examples  look at levels of error in data, in one case government surveillance data of malaria incidence, in the other somatic cell counts of milk. Both studies revealed unacceptably high levels of error. Our other examples look at measures taken to detect errors, and (especially) outlier rejection. Mostly one suspects the offending data points are being excluded not because they are in error, but because they do not happen to fit the researcher's preconceptions. In only one case was there any attempt to re-measure the observations to establish that they were genuinely in error.


What the statisticians say

Batini & Scannapieco (2006) provide a comprehensive overview of data quality issues over a wide range of disciplines. Armitage & Berry (2002) and Thrusfield (2005) both have short sections on data verification procedures. Gott & Duggan (2003) give some rather basic information on measurement error, and how to avoid common errors in measurement.

Bowling (2005) covers issues of both data validation and verification in the administration of questionnaires. Ersbøll & Ersbøll (2003) have an excellent (and all too rare) section on data verification for the veterinary epidemiologist as an essential component of scientific research. Arts (2002) looks at improving data quality in medical registries. Festing & Altman (2002) also emphasize the importance of data verification. Day et al. (1998) challenge the notion that double data entry is either sufficient or necessary to ensure good quality data in clinical trials. Tsang et al. (1998) give a good review of the procedures required to run a randomized controlled trial including a section on data verification using monitors to oversee the progress of the trial.

Feng et al. (2004) look at the quality-control methods used on a daily meteorological dataset in China covering a fifty year period. Meek & Hatfield (1994) also consider the issue of data quality checking for meteorological databases.

Wikipedia provides sections on data verification (under the title data validation), two pass verification, outliers, trimmed means, winsorized means and the interquartile mean. NIST has a useful section on outliers.