Can we ignore missing data?
In any research, you nearly always end up with some missing data. These are observations which are specified in your study design, but for which you are unable to obtain a reading. This may occur when you randomly select a sample for a questionnaire survey - but many of the questionnaires are not returned. Or participants in a clinical trial may drop out of the study before it is complete. Or you may be unable to collect regular monitoring data because all the roads to your field site are flooded.
The commonest response to such missing data is (effectively) to ignore them. This can be done by using complete case analysis in which you only analyse cases for which there are complete data. Cases without full data on all explanatory variables are deleted - this is known as casewise deletion. Another approach is pairwise deletion in which cases are only deleted if the analysis is being carried out on the particular explanatory variable that is missing. If the missing observation is part of a matched group (for example, in veterinary trials animals in the different experimental groups are often matched by age), then the matched observations in the other groups are deleted (matched deletion).
However, as a rule, it is very unwise to just ignore missing observations - for the following reasons:
It can be very wasteful of data.
If you are gathering repeated measures data, over say 50 occasions, nearly all individuals may be affected by one or more missing observations. Discarding all data on a case as soon as one observation is missing may well result in very few cases remaining in the study.
In clinical trials it conflicts with the intention to treat principle.
Under this strategy, patients are compared in the groups to which they were originally randomly assigned, irrespective of whether they actually took part in the trial, whether they received the correct treatment, whether they withdrew from the trial, or whether they deviated from the trial protocol.
This is done for two reasons:
Not surprisingly this means there may be a number of patients which must be included in the analysis, but for which there are no data!
- To maintain the benefits of the original randomization in terms of lack of bias in treatment allocation.
- To allow for the fact that in clinical practice there will also be mistakes and withdrawals.
Some analytical methods cannot cope with missing observations.
This is especially true for some approaches to time series analysis, and for experimental designs with several blocking factors, such as Latin squares. If any observations are missing, the design becomes unbalanced and more difficult to analyse.
But the most important reason for not ignoring missing data is that it can severely bias the results.
This is especially so if
the observations are not missing at random.
Is it missing at random?
If an observation is missing at random, it means that its absence is independent of the outcome being studied. If it is not missing at random, then ignoring it will bias your result. Therefore we clearly need to assess why a particular observation is missing. We will look at three examples to make this point clear.
Consider a study on the incidence of a notifiable disease. Farmers are asked to complete a questionnaire on the number of cases they have had on their farm over the past five years. Five hundred farms are selected at random for the study. But responses are received from only two hundred and twenty of the farms. If the answers of the responders were similar to the answers of the non-responders, there would be no problem (apart from the smaller sample size). But what if the reason some of the farmers did not respond was precisely because the disease had occurred on their farm - and they had not reported it to the Ministry? In this situation the absence of a response is not independent of the outcome, resulting in non-response bias.
Now consider a clinical trial to compare two nasal spray preparations for asthma. During the trial one of the treatments (A)
gives rise to unpleasant side effects, causing some patients to withdraw from the trial. These patients do not therefore improve at all. For those that remain in the trial, spray A
proves to be rather more effective, as can be seen in the table below:
|Effect of treatment with two nasal spray preparations for asthma
|Treatment||Effective||Ineffective||Withdrawn||% cure for|
|% cure for
If we forget about those who had side effects to the spray, we get a cure rate of 73.3% for spray A compared to only 54.1% for spray B. But, because these withdrawals were not missing at random, this is an example of participation bias. It would be quite misleading to conclude that 73.3% of all patients receiving treatment A would be cured if many of them refused to take the treatment! In this case we can correct our analysis by including the withdrawals as treatment failures. If we do this, we find that there is little or no difference between treatment B and treatment A.
Lastly, consider a long term study to assess the effect of climatic factors on mortality rates of an insect pest. Population parameters are monitored using mark-release-recapture carried out every month over a year, except
during the months of April and May - when access is difficult because of flooding. If we just use the data from the remaining ten months, we have to assume that the relationship is the same in the two missing months. But the reason the data were missing was directly related to a factor that may affect the mortality rate - namely rainfall. So the absence of data may not be independent of the outcome.
Therefore, whenever missing observations occur, the most important thing is to assess why those observations are missing.|
If they genuinely are missing at random, you can then either ignore missing observations, or estimate missing values if you need to restore balance to your design.
If they are not missing at random, you need to adopt a different approach, using either a constant missing response or extreme case analysis.
But before you consider these methods in detail, we need to ask one more question...
Do you always recognize missing observations?
There is only one thing worse than having missing data - and that is having missing data, but not knowing about it! You might think this is very unlikely. But there are many instances where it is a serious problem.
- In ecological research, it is not unusual to find missing observations masquerading as zero readings. For example if a rain gauge has not been checked or has malfunctioned, a zero reading may be wrongly recorded, instead of a missing observation.
- In radio telemetry studies, cessation of the signal may be taken to wrongly indicate death of the animal when, instead, the radio tag has stopped working - or the animal has left the area being surveyed.
- In trap catches of insects, cages with holes may be wrongly recorded as having a zero catch, rather than as a missing data point.
- A similar situation arises when searching for studies to carry out a meta-analysis. The studies you fail to find may vary in some systematic way from the ones you do find - again biasing your result.
Cryptic missing data such as these can give very misleading results - and should be avoided at all costs. Unnoticed cryptic missing data contribute to an insidious and intractable form of variation and bias - sometimes referred to as 'measurement error'. This includes degraded, or 'partly' missing observations - such as where no one records when foraging ants remove most of the flies from your sampling traps!
Missing observations, like death and taxes, are an unavoidable fact of life.
Any large body of data can be expected to have some missing, or otherwise unusable, observations.
You cannot deal with missing observations if you do not anticipate them - and actively search for them.
Intention to Treat