Example, with R
Histograms are the most common way that elementary statistics textbooks display frequency distributions.
For comparison, we have overlaid that histogram with a dotplot of the same data.
Histograms are readily produced with R
- Histograms were devised for continuous or measurement data - not for discrete or nominal variables.
- By applying a histogram to our whole-numbers we imply values between them are possible.
- To cope with rounded or truncated data, values equal to the upper bound of an interval are included within that interval (for samples of continuous data the probability of that occurring is virtually zero).
Definition and Use
- Histograms are a type of bargraph where values within a variable are divided into intervals, the limits of each bar shows its interval, its height shows the frequency of values within that interval.
- Histograms are used to display the distribution of values of a continuous measurement variable.
- Provided intervals are of equal width, the area of each bar is a directly related to its height, and is proportional to the frequency of values in its interval.
- Histograms are sometimes are used to display the proportion within each interval (where p = f/n = f/sum(f)), rather than the frequency therein. The resulting histogram has the same shape, but the proportions sum to 1, not n.
- Histograms are held in high regard by statistical novices, and their teachers, but in much lower regard by most statisticians. Journal editors are generally uninterested in any attempt to examine distributions!
Tips and Notes
- By convention, each interval includes any value which equals its upper bound.
- No gaps should appear between intervals of a histogram.
- Discrete variables (such as frequency of car crashes) should be plotted as a bargraph or as a lineplot - where each bar
- represents one possible value. This rule is sometimes relaxed where discrete data are divided into broad class intervals.
- Where there are many class-intervals, or if several distributions need to be compared, frequency polygons may be a better way to display the same information.
- A frequency polygon is a line graph of interval-frequency on interval-midpoint.
- Histograms (and frequency polygons) may work tolerably well for large sets of 'normal' or fairly uniform data, but can produce very misleading results when used for other types of data (e.g. highly skewed, or polymodal) - especially if there are not huge numbers of values.
- Whilst recommendations exist for some sorts of data, the number and location of interval boundaries are inherently arbitrary - these can readily introduce artifacts, and obscure important fine-structure within a distribution.
- Although histograms are often used to check if data are normally distributed, they can be surprisingly insensitive to departures from the 'normal' distribution.
- Whether one is examining a single univariate distribution or comparing two or more distributions, rank scatterplots (quantile plots) or kernel density plots are much better options.
- Chambers, J.M. et al. (1983). Graphical methods for data analysis. Wadsworth International Group/Duxbury Press, Belmont & Boston.
- Describes histograms in Chapter 2. Correctly draws attention to the drawbacks of histograms for data analysis given the arbitrary choice of the number and placement of intervals and other artifacts.
- Jacoby, W.G. . Distributions. Full text
- Slides from a lecture on histograms, quantile plots and other techniques for display of univariate data.
- Kabacoff, R.I. (2012) Quick-R: Histograms and Density Plots. Full text
- Covers histograms, kernel density plots and comparing distributions with kernel density.
- Wikipedia: Bar chart.
- Useful text but lack of label on vertical axis of example makes it a classic example of how not to do it.