Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Methods for coping with missing data: Use and misuse

(non-response bias, intention to treat, last observation carried forward, linear interpolation, multiple imputation)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

Missing data points occur universally in all studies, but they are only routinely reported in questionnaire surveys and in clinical trials. In most other studies they are 'quietly ignored' - and it has been suggested that most common practice is for researchers to fill in the missing value when nobody is looking! All too often this means unthinkingly inserting one or more zeroes, then simply forgetting the matter. In ecological studies it is rare to find that missing data are even admitted, let alone the reasons for their absence being discussed. Hence the researchers in the examples we selected should be applauded for at least acknowledging their existence - even if their solutions were not always the best.

On response bias, we have  some examples of good practice in maximizing the response rate. But in other cases questionnaires are much too long, and there is a high degree of complacency about resulting non-response bias. In some examples we find that authors focus on the response rate to the questionnaire as whole, rather than on the response rate to specific key questions - which may be much lower! Or we are told there is a high response rate, but only of participants who have already volunteered for the study! Too much faith is often put in a comparison of characteristics of responders and non-responders, when it has nothing to do with the questions being asked. Most importantly, there is seldom any discussion of why individuals did not respond.

Our examples on clinical trials includes a review which shows that the term 'analysed by intention to treat' continues to be misused. It sometimes appears to be purely for decoration, since individuals that withdrew are still excluded from the analysis. But we do also give examples of good practice, either where all withdrawals were treated as treatment failures, or (perhaps better) where extreme case analysis was used. In most other research missing at random tends to be just assumed, often unjustifiably, and a variety of methods used to estimate the missing value(s). We give examples ranging from the tolerably good (linear interpolation) to the bad (last observation carried forward) to the truly awful (mean substitution for time series data). The more sophisticated 'multiple imputation' approach requires the user to model the distribution of each variable with missing values, in terms of the observed data. This can give exellent results, but depends on such modelling being done carefully and appropriately!


What the statisticians say

Milliken & Johnson (1992) look at the analysis of data with missing points derived from designed experiments. Little & Rubin (1987) is the standard (technical) text for coping with missing data. McKnight et al. (2007) give a more gentle introduction to the topic, emphasizing the need to avoid missing data in the first place.

Peduzzi et al. (2002) and Hollis & Campbell (1999) explain what is meant by intention to treat, and also summarize methods to cope with missing data. Roland & Torgerson (1998) explain what pragmatic trials are. Schulz & Grimes (2002) and Knatterud (2002) look at how to deal with missing data in clinical trials. Barzi et al. (2006) provides a case study in multiple imputation of missing values in a cohort study. Acock et al. (2005) provides an excellent review of strategies for missing values in survey analysis whilst Edwards et al. (2002) looks at methods to increase response rates to postal questionnaires. Rubin (1976) provides a mathematical treatment of the issue of inference and missing data, focusing on the process that causes the missing data. Horton & Kleinman (2007) review methods to deal with missing data, whilst Sterne et al. (2009) consider the potential and pitfalls of multiple imputation for missing data in epidemiological and clinical research.

Csada et al. (1996) claim to have demonstrated the existence of publication bias against non-significant results in the ecological literature - although their evidence is disputed by Bauchau (1997). Shaw & Mitchell-Olds (1993) describes methods for ecologists to use when dealing with missing observations in analysis of variance (but note the issue of the process causing missingness is not considered). Shearer (1973) provides the standard formulae for estimating single missing points in various experimental designs, as well as the iterative techniques used when multiple points are missing.

Wikipedia provides sections on missing values, missing at random, imputation, participation bias and funnel plots. David C Howell and Burke both provide good reviews of missing data and multiple imputation. Other online sources of information include G. David Carson and Researcher Development Initiative.