Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Coping with missing dataOn this page: Avoid it Assess the degree of bias Estimate missing values More sophisticated approaches Constant missing responses & extreme case analysis
Avoid missing data in the first place
This is the best approach to coping with missing values. The key to avoiding missing observations is good experimental technique. This requires careful experimental design, clear protocols and rigorous checking of all data at the time of collection - as well as thereafter.
Assess the degree of bias
This of commonly done for questionnaire surveys. The principle of the approach is to compare responders and non-responders for characteristics for which you do have data. For example, you may have independent data on the size of farms, so you can assess whether this factor affects the proportion of farmers responding. If it does not, you then assume that this lack of bias on size extends to the main topic of the questionnaire. This approach is certainly better than nothing, but it is more of an 'act of Faith' than a scientific method of evaluating bias.
As well as comparing responders with non-responders, for which you have very little data, you can also compare early responders with late responders. This has the advantage that you have full data for both of these groups. But it does assume that factor which causes delay in response is the same as that causing non-response. This may be true but again this assumption is more an act of faith than science.
Another way to detect missing observations is to see if one is getting rather odd distributions. For trap catches, an unusual number of zero observations may lead one to suspect that all is not well. There is a specific way to check for missing studies when doing a systematic search of the literature for a meta-analysis - this is known as a funnel plot. Here the treatment effect recorded in each study is plotted against some measure of study size (usually either the total sample size or the standard error of the treatment effect). If all studies have been found, you should get a greater spread of values for the estimated effect among trials with a small sample size, than among trials with a large sample size. Moreover, the distribution should be symmetrical. A non-symmetrical distribution indicates that some studies are 'missing'.
Estimate missing values
Sometimes the best option is to try to estimate (or impute) what the value of the missing observation(s) would have been. Critically, all of these methods make certain assumptions.
How you estimate the missing value depends on the design of the study:
Time series data
Other than inserting some arbitrary value, such as the mean or zero, the simplest way to fill in time series data is the last observation carried forward method - where the previous observation in the time series is simply repeated. Although widely used in clinical follow-ups, this method cannot be recommended. If there is any trend in the data over time, or the last observation is atypical, the estimated values are liable to be biased - and the variability of the data set will be affected.
If there is any indication of a trend over time, a better estimate for repeated measures data is obtained using the arithmetic mean of the previous and following observation. This is known as linear interpolation - and is adequate, providing there is a roughly linear trend over the time period considered. If the trend is not linear, a curvilinear response can be fitted to the data, for example using spline fits. These are especially useful if the missing point is at a peak or trough. Alternatively, running means can be used to estimate the missing value.
The first figure below shows a data set of prevalence values with no missing values. Prevalence increases from January, to reach a peak in May, and then declines to a minimum in November. If there were a missing observation at a time of increase or decrease, then linear interpolation would give a good estimate of the missing value. But if there were a missing value in May, linear interpolation would give a rather poor estimate - as can be seen in the second figure:
The only way to get a better estimate would be to fit a function to the data. The third figure shows what we get using a spline fit to the data. This predicts that the prevalence peaks in May at 30%. This is a more reasonable estimate of the missing value.
Simple mean substitution can be used if you have replicated observations within each cell of your experimental design. Missing observations are replaced with the mean (or median) value of the group to which the individual belongs. This approach is used in a number of software packages that will accommodate missing data points. The advantage is that it can be used to restore balance to your design when you have no other information on which to base your estimate. The disadvantage of mean substitution is that it reduces the variability of the data. It is inappropriate for time series data.
In many experimental designs, widely used in ecological and agricultural applications, you have only a single replicate per cell. Such designs include the randomized block and Latin square designs which we describe in
If you have more than one missing value, you must use an iterative
More sophisticated approaches
Broadly speaking there are two rather more sophisticated multivariate methods to cope with missing values. However, they still generally assume that such values are missing at random, and may require knowledge of covariate values for the missing individual. These methods are therefore not appropriate in many situations.
These do not estimate missing values. Only complete records are analysed, but special weights are assigned to those records based on the pattern of missing data. One example of this is maximum likelihood estimation using the EM (expectation-maximization) algorithm. These methods have been widely available in software packages for some years
Note that when you use any of these methods, you have not recovered the missing information. All you have done is to make the best, or most unbiased, use of the remaining data. These methods all have one thing in common: your final analysis is only as good as your model, and the remainder of your data. If too much of your data is estimated, your analysis will reflect your assumptions, rather than your data.
It is sometimes best to carry out an analysis using several very different methods of coping with the missing data. If the conclusion remains the same, irrespective of the methods used, we can have considerably more confidence in it. This is known as sensitivity analysis.
Constant missing responses and extreme case analysis
These are the methods you should use when observations are not missing at random. For example in clinical trials, where withdrawals are caused by side effects of the drugs. In this situation missing responses are commonly assigned a constant missing response - namely as a treatment failure. This reduces the chance of falsely showing a difference between treatments when there is no difference - but it is a conservative approach. In other words you may fail to show a difference between treatments when one really exists.
An even more conservative method is extreme case analysis. It is the only way you can be certain that the missing observations are not affecting the outcome. The approach is mainly used where the response variable is measured on the binary or (rarely) the ordinal scale. Missing observations for the group faring better are classed as failures, those for the group faring worse are classed as successes, and the results re-analysed. If the conclusions of the trial are unaffected even by extreme case analysis, we can be reasonably certain that missing values are not biasing the result.