Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site


  1. Institute data checks wherever possible

    • Errors are much better dealt with via an active preventative routine than a passive curative one. Do not postpone data verification to the analysis / publication stage.
    • Ensure you are familiar with the entire process of data collection. This makes it much easier to track down sources of errors and to forestall problems.
    • Wherever possible institute a series of checks at several stages of data collection process. If field assistants are gathering data join them in the field at frequent intervals and inspect data carefully as it is recorded.
    • Avoid 'recoding' and transcribing data. These are a common source of errors and an invitation to conceal mistakes.
    • Always provide a mechanism by which those gathering your data can record problems and comments - and make certain it is used! Such notes not only help you identify dubious observations, they may enable you to prevent them - indeed, when correcting data, a few scribbled notes in a margin can be worth their weight in gold.
    • If you send samples to a laboratory insert 'known' samples at random. Check test procedures by submitting the same samples twice, or send the same samples to several laboratories.
    • Watch out for the 'Friday afternoon, Monday morning effect', when people may not be working to their full capacity. Identify staff with consistently good or poor performance and treat them accordingly. Never tolerate data fabrication.




  2. Get to know the common types of error

    We will take gathering meteorological data as an example here, since such data are often essential for analyzing field data. Hence it pays to put some effort into ensuring that the data are of the highest quality. Despite this, gathering routine meteorological data is routinely assigned to (unsupervised) junior staff, sometimes with quite remarkable results.

    Maximum-minimum thermometers may seem easy to use, but make sure that the people taking the readings really do know which end of the marker they should read. Because such readings have to be taken every day, inexperienced staff are often put on to this job on public holidays. If your maximum temperature readings suddenly jump 7 degrees, or your minimum temperature readings drop by 7 degrees, then you probably have this problem! Wet-dry thermometers can be used to measure humidity - but only if the water reservoir for the wet bulb is kept full. Drying out of the wet bulb is usually revealed by a break-down of the usual inverse temperature-humidity relationship.

    If rainfall on one occasion is not recorded, it tends to be added to the next occurrence, or is entered when the next person comes on duty. You can sometimes resolve this by reference to your humidity readings from a wet-dry thermometer.

    In the past the only way to obtain continuous measurements of temperature and humidity was to use a thermohygrograph, and many of these are still in operation. Measurements are recorded on paper charts, which are replaced either daily or weekly. Since they were originally designed for use in the laboratory, when used in the field they are notoriously inaccurate, and require frequent and regular calibration checks against a (reliable) thermometer over a range of temperatures. If you find the readings are incorrect at lower or higher temperatures, make sure you have the right chart paper in the thermohygrograph (different models require different chart paper).

    You might assume that electronic equipment, such as a modern 'data logger', is immune to these problems. In practice this is not the case, and you should never assume that electronic equipment is reliable and accurate. Some data loggers use the same sensors, such as horsehair for humidity, as were used fifty years ago. Modern electronic equipment does eliminate human error providing it is set up right in the first place. But it is generally more complex than older equipment, and when it malfunctions is more difficult and costly to put right. Check all sensors carefully at regular intervals. Wasp nests can produce very odd results, as can livestock or buffaloes if they collide with your equipment. Solarimeter sensors provide birds a convenient perch, so if you leave them unchecked for extended periods, you may find your solar radiation readings are steadily declining. We eventually tracked down one inexplicable rainfall (about 30 mm) to a casual worker relieving himself one dark night!

    If you are using meteorological data gathered by someone else, remember:

    1. It is much more difficult to check other people's data.
    2. You will have to mainly rely on internal consistency of the data.
    3. Relatively small distances can markedly affect meteorological data.
    4. If at all possible, collect some of your own meteorological data for comparison. A low correlation between the two sets of data indicates one or more of the problems we have described.
    5. An unannounced site visit can be most illuminating.



  3. Always check your data after computer entry

    When it comes to entering data on the computer, remember everyone makes typographical errors. This happens especially when people are tired, bored, distracted, or in a hurry.

    You should not have to think about whether to check your data once it has been input. The only question is how best to check it. We strongly recommend double data entry. The two data entry operators should enter the data quite separately, ideally in different orders. If funds do not stretch to employing two data entry operators, a less 'high-tech' (and less efficient) method is for one person to read the data from a printout, whilst another person checks it against the original data sheets.

    Many data management packages now include data verification routines which check for 'out of range' values. For example, only body temperatures between 90 and 105 degrees may be accepted, or the package may put up an error message if you enter a letter when a number is expected. If you are using such a data verification programme, do test that it works. Try including some random 'silly' values, and remember that the program will only detect errors that you would probably have eventually spotted anyway. Such routines certainly do not dispense with the need for data checking by one of the methods mentioned above.

    However carefully you ensure that what is on your data sheets is what has gone on to the computer, you will still have some errors that originate from when the data were recorded. Inspecting original 'rough' data collection records, is an indispensable way of resolving some anomalies. Field notes are an indispensable aid to tracking these down. Conversely, do not assume all unusual observations are errors - all data contain some unusual observations.



  4. Some simple methods of outlier identification

    One fairly obvious, and obviously quite serious, effect of measurement errors and human errors is the presence of outlying or extreme observations - for example where you type 80102 or .80102 instead of 80.102. Although these are not the most common, nor necessarily the most serious error, they are relatively simple to detect - and many software packages do so automatically. The principle difficulty lies in deciding what range of values 'ought' to be considered valid. In many instances the choice is entirely arbitrary.

    One common way to identify outliers is to regard any observation that falls outside the appropriate quartile 1.5 × the interquartile range (IQR) as an outlier.

    {Fig. 4}

    Sometimes a further category may be recognised - that of extremes. Extremes would fall outside the appropriate quartile 3.0 the interquartile range. The quantities Q (1.5 IQR) and Q (3.0 IQR) are sometimes termed the inner and outer fences.

    Note that identifying a point as an outlier or even as an extreme does not by itself justify dropping that point from the data set. But it may justify close examination of that point to confirm that there are no obvious errors, say at the data entry stage.

    A similar approach is sometimes used with the arithmetic mean in place of the median, and the standard deviation in place of the interquartile range. However, these methods are only valid if the data are normally distributed. We will briefly review these in the next Unit.

    A popular abuse of this technique is to automatically reject any outlying points identified by these methods, then proceed with the rest of your analysis as normal. This practice is not recommended because it artificially reduces natural variation and readily introduces bias. There is no generally available statistical technique that can allow for these problems - and results based upon such data should be regarded with extreme suspicion.

    A more extreme form of this method, rejecting extreme points until the desired result is obtained, is sometimes referred to as 'fudging'. Like many other forms of intentional bias, it is probably more common than officially admitted.



  5. Some trimmed means

    The simplest trimmed means are obtained by simply deleting a (predetermined) number of observations from each end of the distribution.

    For example, consider the ranked data set below. The arithmetic mean of all these data is 18.3.

     1 3 5 6 8 912121315161718202230364362

    If we exclude the 3 most extreme observations from each end, the 0.16 trimmed mean is 15.2

     1 3 5 6 8 912121315161718202230364362

    Whereas, the 0.5 trimmed mean is the median, 15.


    Another type of trimmed mean is known as the interquartile mean - the mean of the numbers between the upper and lower quartiles, excluding the quartile values themselves:

     1 35 6 8 912121315161718202230364362

    The interquartile mean for these data is 14.7.


    A further variant of the trimmed mean is known as the trimean. This is calculated by adding together the upper and lower quartiles and twice the median, and dividing by four.

    For example, taking the same data set again:

     1 3 5 6 8 912121315161718202230364362

    The lower quartile is 8.5, the median is 15, and the upper quartile is 21, so the trimean for these data is 14.9.



    An early method of trimming was known as winsorizing. It consists of deleting the highest and lowest observation, and replacing them with the next highest and lowest values.

    Let's take the same set of data again with an arithmetic mean of 18.3.

     1 3 5 6 8 912121315161718202230364362

    You should decide how many observations to winsorize before starting your analysis. If we decide that we only need to winsorize the highest and lowest of the observations, we replace them with the next highest and the next lowest as below:

     3 3 5 6 8 912121315161718202230364343

    The winsorized mean of these data is now 17.4.

    Because its use depends strongly upon your expectations of the data, many statisticians still regard trimmed means with caution. However, unlike outlier rejection, it is possible to correct for the effects of trimming upon your statistical analysis. The formulae for doing this are a little cumbersome, and have only been worked out for a few analyses. Alternately, if you feel that extreme observations are unduly affecting the mean, it is sometimes simpler and better to transform your data.