Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site



Avoid missing data in the first place

    This is the best approach to coping with missing values. The key to avoiding missing observations is good experimental technique. This requires careful experimental design, clear protocols and rigorous checking of all data at the time of collection - as well as thereafter.

      For example:
    • For questionnaires the responses should be checked through immediately to ensure that all questions have been answered. This makes it much easier to go back to the person in question to fill in the missing answer. For postal questionnaires, reminders should be sent to non-respondents, or they can be contacted by telephone. The response rate in postal questionnaires tends to be higher if the questionnaire is short and concise, if a respected sponsor is connected with study, if the subject is of importance to the respondent, and if confidentiality is respected.

    • When taking samples (for example blood samples) try to take them in duplicate, so that if one sample is lost you can still analyse the other. If you are running a field experiment, you may be able to continue for an extra day to make up for data missed on an earlier day.

    • In clinical trials special attention should be paid to participants who show signs of dropping out of the study. The investigator should contact such individuals before drop-out so that negotiations can take place about which parts of the study they are willing to complete. Hence information may at least be gained on some, if not all, of the study endpoints.



Assess the degree of bias

    This of commonly done for questionnaire surveys. The principle of the approach is to compare responders and non-responders for characteristics for which you do have data. For example, you may have independent data on the size of farms, so you can assess whether this factor affects the proportion of farmers responding. If it does not, you then assume that this lack of bias on size extends to the main topic of the questionnaire. This approach is certainly better than nothing, but it is more of an 'act of Faith' than a scientific method of evaluating bias.

    As well as comparing responders with non-responders, for which you have very little data, you can also compare early responders with late responders. This has the advantage that you have full data for both of these groups. But it does assume that factor which causes delay in response is the same as that causing non-response. This may be true but again this assumption is more an act of faith than science.


    Another way to detect missing observations is to see if one is getting rather odd distributions. For trap catches, an unusual number of zero observations may lead one to suspect that all is not well. There is a specific way to check for missing studies when doing a systematic search of the literature for a meta-analysis - this is known as a funnel plot. Here the treatment effect recorded in each study is plotted against some measure of study size (usually either the total sample size or the standard error of the treatment effect). If all studies have been found, you should get a greater spread of values for the estimated effect among trials with a small sample size, than among trials with a large sample size. Moreover, the distribution should be symmetrical. A non-symmetrical distribution indicates that some studies are 'missing'.



Estimate missing values

    Sometimes the best option is to try to estimate (or impute) what the value of the missing observation(s) would have been. Critically, all of these methods make certain assumptions.

    1. All simple methods (and most of the more sophisticated ones) assume that observations are missing at random. In other words, it is assumed that the likelihood of missing an observation is unrelated to its value.
    2. All simple methods assume that the underlying biological model is additive. In other words the treatment will increase the response variable by (say) 10 units, rather than double or treble it. If the underlying biological model is multiplicative, then the data should be log transformed before estimating missing values.
    3. When you come to methods for particular experimental designs, you will have to assume there is no interaction between the treatment factor and the blocking factor(s) in your experiment.

    How you estimate the missing value depends on the design of the study:

    1. Time series data

      Other than inserting some arbitrary value, such as the mean or zero, the simplest way to fill in time series data is the last observation carried forward method - where the previous observation in the time series is simply repeated. Although widely used in clinical follow-ups, this method cannot be recommended. If there is any trend in the data over time, or the last observation is atypical, the estimated values are liable to be biased - and the variability of the data set will be affected.

      If there is any indication of a trend over time, a better estimate for repeated measures data is obtained using the arithmetic mean of the previous and following observation. This is known as linear interpolation - and is adequate, providing there is a roughly linear trend over the time period considered. If the trend is not linear, a curvilinear response can be fitted to the data, for example using spline fits. These are especially useful if the missing point is at a peak or trough. Alternatively, running means can be used to estimate the missing value.

Worked example

Month JanFebMarApr MayJunJlyAug SepOctNovDec
Prevalence 15232529 33272218 159712

The first figure below shows a data set of prevalence values with no missing values. Prevalence increases from January, to reach a peak in May, and then declines to a minimum in November. If there were a missing observation at a time of increase or decrease, then linear interpolation would give a good estimate of the missing value. But if there were a missing value in May, linear interpolation would give a rather poor estimate - as can be seen in the second figure:

{Fig. 1}

The only way to get a better estimate would be to fit a function to the data. The third figure shows what we get using a spline fit to the data. This predicts that the prevalence peaks in May at 30%. This is a more reasonable estimate of the missing value.

  1. Designed experiments

    Simple mean substitution can be used if you have replicated observations within each cell of your experimental design. Missing observations are replaced with the mean (or median) value of the group to which the individual belongs. This approach is used in a number of software packages that will accommodate missing data points. The advantage is that it can be used to restore balance to your design when you have no other information on which to base your estimate. The disadvantage of mean substitution is that it reduces the variability of the data. It is inappropriate for time series data.

    In many experimental designs, widely used in ecological and agricultural applications, you have only a single replicate per cell. Such designs include the randomized block and Latin square designs which we describe in Unit 7. You can estimating missing values for these designs using weighted mean substitution. Formulae for a single missing value are given below.

randomized block design

Missing point    =    [aT + bB − S]
(a − 1)(b − 1)
  • a = number of treatments,
  • b = number of blocks,
  • A = sum of items in same treatment as missing observation,
  • B = sum of items in same block as missing observation,
  • S = sum of all observations in that square.

Latin square design

Missing point   =    [a (R + C + T) − 2S]
(a − 1)(a − 2)
  • a = number of treatments,
  • R = sum of rows containing missing observation,
  • C = sum of columns containing missing observation,
  • T = sum of treatments containing missing observation,
  • S = sum of all observations in that square.

If you have more than one missing value, you must use an iterative method. Using this method will enable you to get an unbiased estimate of the treatment means. However, further corrections and adjustments need to be made when assessing the significance of treatment effects. A worked example of estimating a missing point in a Latin square experiment is given in the examples.



More sophisticated approaches

    Broadly speaking there are two rather more sophisticated multivariate methods to cope with missing values. However, they still generally assume that such values are missing at random, and may require knowledge of covariate values for the missing individual. These methods are therefore not appropriate in many situations.

    1. Maximum likelihood methods

      These do not estimate missing values. Only complete records are analysed, but special weights are assigned to those records based on the pattern of missing data. One example of this is maximum likelihood estimation using the EM (expectation-maximization) algorithm. These methods have been widely available in software packages for some years

    2. Multiple imputation methods

      These do estimate missing values, but not just a single replacement value. Instead a number (n) of different possible imputed values are calculated for each missing value using a suitable statistical model derived from the patterns in the data. Multiple imputation gives n possible complete data sets, each of which are analysed in turn by the chosen statistical method (such as analysis of variance). The n intermediate results are then pooled to yield a final result, and an estimate of its uncertainty. Software packages are now available which adopt this approach.

    Note that when you use any of these methods, you have not recovered the missing information. All you have done is to make the best, or most unbiased, use of the remaining data. These methods all have one thing in common: your final analysis is only as good as your model, and the remainder of your data. If too much of your data is estimated, your analysis will reflect your assumptions, rather than your data.

    It is sometimes best to carry out an analysis using several very different methods of coping with the missing data. If the conclusion remains the same, irrespective of the methods used, we can have considerably more confidence in it. This is known as sensitivity analysis.

Constant missing responses and extreme case analysis

    These are the methods you should use when observations are not missing at random. For example in clinical trials, where withdrawals are caused by side effects of the drugs. In this situation missing responses are commonly assigned a constant missing response - namely as a treatment failure. This reduces the chance of falsely showing a difference between treatments when there is no difference - but it is a conservative approach. In other words you may fail to show a difference between treatments when one really exists.

    An even more conservative method is extreme case analysis. It is the only way you can be certain that the missing observations are not affecting the outcome. The approach is mainly used where the response variable is measured on the binary or (rarely) the ordinal scale. Missing observations for the group faring better are classed as failures, those for the group faring worse are classed as successes, and the results re-analysed. If the conclusions of the trial are unaffected even by extreme case analysis, we can be reasonably certain that missing values are not biasing the result.

    A good worked example of extreme case analysis is given in one of the medical examples.