Example, with R
Cumulative frequency plots can be done with histograms. Below are a frequency histogram and a cumulative frequency histogram of the same data.
Cumulative histograms are readily produced with R
- Due to the heavy use of conventional histograms in elementary statistics, most statistical novices continue to employ them for the remainder of their career - and find cumulative plots difficult, if not impossible, to interpret.
- So although cumulative distributions are very important, and cumulative distribution plots are very useful, they are seldom used by non-statisticians.
Definition and Use
- Frequency histograms use each bar height to show the number of values in that interval.
- Cumulative frequency histograms use each bar height to show the number of values in that interval, plus the number of values in all lower intervals.
- Cumulative plots are especially useful because, once you can interpret them, they are a more robust way to examine distributions than histograms - especially when examining a small to moderate number of values.
Tips and Notes
- Although cumulative frequency histograms have advantages over conventional frequency histograms, they still suffer from the general disadvantages of histograms - namely that class intervals are entirely arbitrary and can lead to bias.
- Cumulative scatterplots, such as that shown below, commonly plot each item's rank (or proportional rank) against its value. Provided each item has a different value, its rank is the number of items whose value is less than or equal to that item.
- In which case the minimum value has a rank of 1, and the maximum has a rank of n (assuming there are n values)
The graph below shows a cumulative scatterplot superimposed on a cumulative histogram of the same data.
- Notice how, unlike the cumulative histogram, this scatterplot reveals the presence of 'tied' values.
- In addition to this advantage, cumulative scatterplots are simpler to plot and are less artifact-prone than cumulative histograms.
The textarea below shows one way to produce a cumulative scatterplot with R.
- plot(y, rank(y)) would give the same result, provided every value was different.
- By default R assumes the rank of tied values is their mean rank.
- Cumulative scatterplots have a variety of names: a rank scatterplot, a plot of rank on value, a quantile plot, or an empirical cumulative distribution function (ECDF).
What does the R code below do, and in what ways might this be useful?
Hint: try using that
code upon the data provided above.
- Chambers, J.M. et al. (1983). Graphical methods for data analysis. Wadsworth International Group/Duxbury Press, Belmont & Boston.
- An excellent older text on graphical display of frequency distributions which deals with quantile plots (equivalent to rank scatterplots) in Chapter 2.
- Wikipedia: Frequency distribution.
- Includes a short section on cumulative frequency distributions.