Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Quantiles as summary statistics: Use and misuse
(box and whisker plots, interquartile range, outliers, quantile quantile plots)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and MisuseQuantiles are under-used by researchers in providing summary statistics, largely because of the addiction of most scientists to the arithmetic mean and the standard deviation. Thes latter statistics are used even when such measures are wholly inappropriate, such as with heavily skewed distributions. We give a number of examples of the correct use of quantiles and box-and-whisker plots, including several where the box-and-whisker plot is given alongside a dot plot - showing the full frequency distribution.
Whilst box-and-whisker plots can be very useful for comparative purposes, this is only true if the full distribution has been examined first. This is especially true if the range indicates maximum and minimum, since it provides no information on an important part of the distribution - between the quartiles and the range. Use of the 95%, 90% or 80% reference range, or 1.5 times the interquartile range, is preferable, with outliers shown individually. It is, of course, important to label exactly what the 'range' or whiskers represent - without labelling, the figure can be highly misleading.
A common problem is to find a mismatch between the statistics displayed (whether as a box-and-whisker plot or a range plot), and the reported statistical analysis. If it is the difference between arithmetic means that is being tested, then means should appear in the figure or table. If values are log transformed, then the geometric mean is the appropriate statistic. If one of the non-parametric tests is used, then the median is usually the most appropriate statistic.
We have given a few examples from the literature,of the use of quantile-quantile plots to assess distributions but it has to be admitted that they are still very rare! This is despite their advocacy in two well known texts on graphical display of data. Even where they are used, there is still a tendency to just condense the information to a simple summary statistic (such as the difference between the plot and the y = x line), rather than interpret them to bring out differences between the distributions.
What the statisticians sayChambers et al. (1983) and Cleveland (1989) (1993) provide excellent coverage of box-and-whisker plots, quantile scatterplots and quantile-quantile plots. Woodward (1999) has a good section on descriptive techniques for quantitative variables which includes quantiles, the five quantile summary and box-and-whisker plots. Griffiths et al. (1998) give a good account of the use of quantiles, box-and-whisker plots and the five quantile summary for summarizing data.
Altman & Bland (1996) provide a useful summary of the properties and uses of quantiles in medical research whilst Frigge et al. (1989) detail the uses of box-and-whisker plots for exploratory data analysis. Wartenberg & Northridge (1991) advocate the use of quantile-quantile plots in case-control studies for exploring and comparing the distribution of exposure among cases and controls.
Wikipedia has sections on quantiles , box-and-whiskerplots and quantile-quantile plots . NetMBA provides a short section on interpretation of boxplots. NIST/SEMATECH provides a good account of quantile-quantile plots.