Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Displaying frequency distributionsWhat is a frequency distribution? Class intervals & histograms More powerful & less arbitrary approaches
What is a frequency distribution
Examining how values are distributed within a set of data, in other words its frequency distribution, is an important (albeit oft neglected) first step in analyzing that data:
Given their importance, many ways have been developed to view and quantify how values are distributed. Perhaps the simplest approach is to plot the observations as shown below.
The value of every observation (the x-axis) is shown by a point. For obvious reasons, this type of dot plot is known as a rugplot - the graph's y-axis is not labelled because it does not vary, so conveys no information. Note, it is sometimes convenient to transpose these axes.
These simple rugplots work very well if there are relatively few observations, and their values are not too similar. For example we can see that the mice weights were (more or less) symmetrically distributed around a weight of 20 g, whilst the distribution of parasite numbers is much more skewed. As we increase the number of observations - and in particular the number of ties - our simple display method is no longer adequate. Only about 20 of the 30 cow weights have been displayed because of tied values. This problem gets much worse in the fourth plot where we have lost most of the information contained in the raw data.
Where there are many observations the conventional approach is to subdivide the distribution into classes of similar values, and sum the frequencies within each class. Since this is currently the most popular approach, and its reasoning is closely linked to conventional statistical analysis, we consider that approach first.
We have summarized the main approaches
Class-intervals and histograms
Conventional methods require one or more predefined (arbitrary) breakpoints to divide all the values into non-overlapping class-intervals or bins - producing, for example a frequency table, bar diagram or histogram. Since the number of values in each group are summed or averaged, narrow intervals reveal the most detail about the distribution.
Consider the data above on the weights of 30 zebu cattle. Note that the readings are taken to the nearest 5 kg. The conventional approach to looking at the frequency distribution of a continuous variable is to put the observations into a number of classes (or ranges, or intervals). We have done this in the table below using a class interval of 20 kg, from 401 kg upwards. Note that the choice of class interval is arbitrary. Had we used a larger or smaller class interval, we might have obtained a quite different picture.
The tabulated distributions are then displayed using a histogram. Class frequencies are represented by the areas of bars centred on the class interval. If intervals are all the same width, then height represents the frequency. Use of a histogram assumes that the underlying distribution of a variable is continuous, even though the data will inevitably take discrete values - depending on the accuracy of measurement. A histogram can also be used to represent a cumulative frequency distribution. In this case, some or all of the verticals of the histogram may be dispensed with to give a step-plot.
An alternative display method is the frequency polygon. The frequencies are graphed against the class mid-points, and the points joined by straight lines. Frequency polygons can be useful for comparing several distributions, but like histograms the method is sensitive to the number and ranges of the class intervals chosen, even for relatively large amounts of data. A stem-and-leaf plot also provides a histogram-like display, but without loosing the detail of the original data.
Tabulated distributions of discrete variables are traditionally displayed using a bar diagram (also known as a 'bar chart' or a 'bar graph'). A bar chart is superficially similar to a histogram but the bars do not touch each other. Frequency is represented by the height (= length) of the bar. Conventional wisdom dictates that bar diagrams should not be used for continuous measurement variables, except in one situation. This is when there is a high degree of rounding, resulting in multiple observations of each measurement. Observations should then not be grouped into class intervals. Instead frequencies of each value should be shown separately using a line diagram. Histograms should generally not be used to display discrete measurement variables, although they frequently are, especially when some values are grouped together in classes.
More powerful and less arbitrary approaches
The class interval approach works well when applied to extremely large sets of values, every one of which is different, providing there is a suitable choice of breakpoints. But when applied to small sets, or discrete data, these methods can produce very misleading results. This is partly because they are sensitive to the choice of breakpoints which is essentially arbitrary. Judicious choice of breakpoints can make a non-symmetrical distribution look symmetrical or vice versa. Arbitrary elements in statistical analysis are always potentially biased and should be avoided. The other reason for misleading results is that a noticeable proportion of values may coincide with the breakpoints. This can be serious when dealing with rounded values (as for example the cow weight data).
An alternate approach is to examine how many values lie within a predetermined range of values. Where these ranges are non-overlapping and contiguous this is equivalent to the first approach (assuming equal-width bins) - but overlapping bins helps to smooth the distribution. This method can be thought of as describing the 'density' of values.
For small samples the jittered dot plot is quite popular - where each value is plotted against a (uniformly-distributed) random number. Assuming the plot's x and y axes are at right-angles (90 degrees), that random variation is independent of the data-variable. A Gaussian smoothed distribution (described in
The rank scatterplot and related plots are powerful, but underused, ways of exploring and comparing distributions - irrespective of sample size. The rank of an observation is defined as the number of observations that are less than or equal to that observation. In a rank scatterplot the rank of each observation is plotted against its value, which reveals the cumulative distribution. Plotting each observation's rank within a given value, against that value, produces something which looks like a bargraph. More sophisticated plots employ other measures of rank, such as the relative rank, corrected relative rank, cumulative proportion, or P-value. Conventionally these all use the rank order, or sequential rank - but for estimation and inference, a mean rank (or expected rank, or mid-P-value) may be less biased.
Values ordered by rank (or relative rank) divide up a distribution, and are known as quantiles. Accordingly a plot of rank on value is also known as a rank quantile plot, but when a step plot is added (to emphasize its discreteness) it is called an empirical cumulative distribution function (ECDF). Plotting quantiles of one variable against the same quantiles of another variable is known as a quantile-quantile plot - and can be used to compare two sample
What about nominal variables?
Bar diagrams are most commonly used to display the frequency distributions of nominal variables. Either frequency or relative frequency can be plotted on the y-axis. Multiple batches of nominal data can be displayed using either a multiple bar diagram or a stacked bar diagram. As such they can be used to show one frequency distribution within another frequency distribution. Another option is the dot plot. One axis of the dot plot (usually the horizontal) is a scale covering the range of quantitative values to be plotted. The other axis (usually the vertical) shows descriptive labels that are associated with each of the numeric values. The data objects usually are sorted according to the quantitative values.
A third option is the pie chart. They are called pie charts because each sector looks like the slice of a pie. The angle of each sector is proportional to the relative frequency of that category. Providing the 'pie' is circular, the area of the sector is then also proportional to that frequency. These are seen more often in newspapers, magazines and Government reports than in scientific journals - and are generally disapproved of by statisticians. This is because it is much harder to visually compare areas of segments than it is to compare heights of bars.