Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Frequency distributions and their display: Use and misuse
(class based methods, histograms, frequency polygons, line and bar diagrams, pie charts, jittered dot plots, rank scatterplots, empirical cumulative distribution functions)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
One of the first things one learns in 'stats' is how to display frequency distributions - commonly using histograms for measurement variables and pie charts for nominal variables. Both methods are indeed heavily used, but are emphatically not recommended. Histograms are a poor way to display distributions because of the arbitrary definition of class intervals which is invariably biased. The more 'information-rich' jittered dot plots and (especially) rank scatterplots are much better suited for information display, but are greatly underused. Regrettably, it is hard to find any examples of rank scatterplots (= quantile scatterplots) in most disciplines. Dot plots are being used a little more, but the points are often arranged systematically, instead of being jittered to avoid bias.
Simple circular pie charts are still quite widely used for distributions of nominal variables. However, they are also unsuitable for scientific work because it is much more difficult to visually compare areas than length. Exploded and elliptical pie charts, which exaggerate one category of the distribution, are even worse and should only be used by journalists! Line diagrams or bar diagrams are optimal for displaying distributions of nominal variables - these can be reduced to dot charts if preferred. The other common problem with distributions of nominal variables is the missing category. If, for example, one is looking at the causes of death, there is a tendency to miss out the category of 'unknown cause', even though this may be the biggest category!
Unfortunately there is little consistency in terminology for the different graphical techniques even in biostats textbooks. Histograms are called bar diagrams and vice versa, and terms such as dot plots, dot charts and scatterplots can all refer (essentially) to the same types of display.
We have more information on using R to display univariate distributions, including histograms, jittered dotplots, rank scatterplots and qq-plots. For beginners, our basic statistics pages include pie charts and dotplots.
What the statisticians sayTufte (2003) is one of the best texts on graphical display. Jacoby (1997) gives an overview of the range of graphical techniques available. Griffiths et al. (1998) is one of the better introductory texts on display methods, while Stuart & Ord (1991) give an authoritative account of empirical frequency distributions and their display in their first chapter. Chambers et al. (1989) and Cleveland (1985) are two informative older texts on graphical display with extensive coverage of quantile scatterplots.
Hoffman et al. (2006) introduce two further methods of visual display of data - the dartboard and the roulette wheel - intended to make estimates of risk more understandable to patients. Jacoby (2006) introduces dotplots for nominal variables. Indrayan & Satyanarayana (2000) provide a summary of methods of visual display of numerical data. Some of the cautions are useful, although they should include a caution about the bias involved in using elliptical pie charts.
R Graph Gallery and Robert Kabacoff describe the different graphical techniques available in R along with the appropriate code whilst Michael Friendly provides a somewhat esoteric selection of the best and worst of statistical graphics.