Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Data verificationOn this page: Institute data checks Know the common error types Check data after entry Simple outlier identification Trimmed means
Institute data checks wherever possible
Get to know the common types of error
We will take gathering meteorological data as an example here, since such data are often essential for analyzing field data. Hence it pays to put some effort into ensuring that the data are of the highest quality. Despite this, gathering routine meteorological data is routinely assigned to (unsupervised) junior staff, sometimes with quite remarkable results.
Maximum-minimum thermometers may seem easy to use, but make sure that the people taking the readings really do know which end of the marker they should
If rainfall on one occasion is not recorded, it tends to be added to the next occurrence, or is entered when the next person comes on duty. You can sometimes resolve this by reference to your humidity readings from a wet-dry thermometer.
In the past the only way to obtain continuous measurements of temperature and humidity was to use a thermohygrograph, and many of these are still in operation. Measurements are recorded on paper charts, which are replaced either daily or weekly. Since they were originally designed for use in the laboratory, when used in the field they are notoriously inaccurate, and require frequent and regular calibration checks against a (reliable) thermometer over a range of
You might assume that electronic equipment, such as a modern 'data logger', is immune to these problems. In practice this is not the case, and you should never assume that electronic equipment is reliable and accurate. Some data loggers use the same sensors, such as horsehair for humidity, as were used fifty years ago. Modern electronic equipment does eliminate human error providing it is set up right in the first place. But it is generally more complex than older equipment, and when it malfunctions is more difficult and costly to put right. Check all sensors carefully at regular intervals. Wasp nests can produce very odd results, as can livestock or buffaloes if they collide with your equipment. Solarimeter sensors provide birds a convenient perch, so if you leave them unchecked for extended periods, you may find your solar radiation readings are steadily declining. We eventually tracked down one inexplicable rainfall (about 30 mm) to a casual worker relieving himself one dark night!
If you are using meteorological data gathered by someone else, remember:
Always check your data after computer entry
When it comes to entering data on the computer, remember everyone makes typographical errors. This happens especially when people are tired, bored, distracted, or in a hurry.
You should not have to think about whether to check your data once it has been input. The only question is how best to check it. We strongly recommend double data entry. The two data entry operators should enter the data quite separately, ideally in different orders. If funds do not stretch to employing two data entry operators, a less 'high-tech' (and less efficient) method is for one person to read the data from a printout, whilst another person checks it against the original data sheets.
Many data management packages now include data verification routines which check for 'out of range' values. For example, only body temperatures between 90 and 105 degrees may be accepted, or the package may put up an error message if you enter a letter when a number is expected. If you are using such a data verification programme, do test that it works. Try including some random 'silly' values, and remember that the program will only detect errors that you would probably have eventually spotted anyway. Such routines certainly do not dispense with the need for data checking by one of the methods mentioned above.
However carefully you ensure that what is on your data sheets is what has gone on to the computer, you will still have some errors that originate from when the data were recorded. Inspecting original 'rough' data collection records, is an indispensable way of resolving some anomalies. Field notes are an indispensable aid to tracking these
Some simple methods of outlier identification
One fairly obvious, and obviously quite serious, effect of measurement errors and human errors is the presence of outlying or extreme observations - for example where you type 80102 or .80102 instead of 80.102. Although these are not the most common, nor necessarily the most serious error, they are relatively simple to detect - and many software packages do so automatically. The principle difficulty lies in deciding what range of values 'ought' to be considered valid. In many instances the choice is entirely arbitrary.
One common way to identify outliers is to regard any observation that falls outside the appropriate quartile ± 1.5 × the interquartile range (IQR) as an outlier.
Sometimes a further category may be recognised - that of extremes. Extremes would fall outside the appropriate quartile ± 3.0 × the interquartile range. The quantities Q ± (1.5 × IQR) and Q ± (3.0 × IQR) are sometimes termed the inner and outer fences.
Note that identifying a point as an outlier or even as an extreme does not by itself justify dropping that point from the data set. But it may justify close examination of that point to confirm that there are no obvious errors, say at the data entry stage.
A similar approach is sometimes used with the arithmetic mean in place of the median, and the standard deviation in place of the interquartile range. However, these methods are only valid if the data are normally distributed. We will briefly review these in the next
A popular abuse of this technique is to automatically reject any outlying points identified by these methods, then proceed with the rest of your analysis as normal. This practice is not recommended because it artificially reduces natural variation and readily introduces bias. There is no generally available statistical technique that can allow for these problems - and results based upon such data should be regarded with extreme suspicion.
A more extreme form of this method, rejecting extreme points until the desired result is obtained, is sometimes referred to as 'fudging'. Like many other forms of intentional bias, it is probably more common than officially
Some trimmed means
The simplest trimmed means are obtained by simply deleting a (predetermined) number of observations from each end of the distribution.
For example, consider the ranked data set below. The arithmetic mean of all these data is 18.3.
Whereas, the 0.5 trimmed mean is the median, 15.
The interquartile mean for these data is 14.7.
A further variant of the trimmed mean is known as the trimean. This is calculated by adding together the upper and lower quartiles and twice the median, and dividing by four.
For example, taking the same data set again:
The lower quartile is 8.5, the median is 15, and the upper quartile is 21, so the trimean for these data is 14.9.
An early method of trimming was known as winsorizing. It consists of deleting the highest and lowest observation, and replacing them with the next highest and lowest values.
Let's take the same set of data again with an arithmetic mean of 18.3.
You should decide how many observations to winsorize before starting your
The winsorized mean of these data is now 17.4.
Because its use depends strongly upon your expectations of the data, many statisticians still regard trimmed means with caution. However, unlike outlier rejection, it is possible to correct for the effects of trimming upon your statistical analysis. The formulae for doing this are a little cumbersome, and have only been worked out for a few