 InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)  ### What is a frequency distribution

Examining how values are distributed within a set of data, in other words its frequency distribution, is an important (albeit oft neglected) first step in analyzing that data:

1. Its frequency distribution will often indicate which is the most appropriate measure of location to use. For example, the arithmetic mean provides a suitable summary measure if values are distributed symmetrically, but if those values are skewed the geometric mean or median may be better.
2. It can be used in the process of data verification - in order to show up 'odd' points for special attention.
3. It can expose polymodal distributions, and enable you to assess any criteria used to divide them.
4. The frequency distribution of data may determine the most appropriate type of statistical analysis.
• Less obviously perhaps, understanding frequency distributions is central to interpreting the behaviour of statistics - and understanding statistical inference.

Given their importance, many ways have been developed to view and quantify how values are distributed. Perhaps the simplest approach is to plot the observations as shown below.

The value of every observation (the x-axis) is shown by a point. For obvious reasons, this type of dot plot is known as a rugplot - the graph's y-axis is not labelled because it does not vary, so conveys no information. Note, it is sometimes convenient to transpose these axes.

These simple rugplots work very well if there are relatively few observations, and their values are not too similar. For example we can see that the mice weights were (more or less) symmetrically distributed around a weight of 20 g, whilst the distribution of parasite numbers is much more skewed. As we increase the number of observations - and in particular the number of ties - our simple display method is no longer adequate. Only about 20 of the 30 cow weights have been displayed because of tied values. This problem gets much worse in the fourth plot where we have lost most of the information contained in the raw data.

Where there are many observations the conventional approach is to subdivide the distribution into classes of similar values, and sum the frequencies within each class. Since this is currently the most popular approach, and its reasoning is closely linked to conventional statistical analysis, we consider that approach first. We have summarized the main approaches here.  ### Class-intervals and histograms

Conventional methods require one or more predefined (arbitrary) breakpoints to divide all the values into non-overlapping class-intervals or bins - producing, for example a frequency table, bar diagram or histogram. Since the number of values in each group are summed or averaged, narrow intervals reveal the most detail about the distribution.

Consider the data above on the weights of 30 zebu cattle. Note that the readings are taken to the nearest 5 kg. The conventional approach to looking at the frequency distribution of a continuous variable is to put the observations into a number of classes (or ranges, or intervals). We have done this in the table below using a class interval of 20 kg, from 401 kg upwards. Note that the choice of class interval is arbitrary. Had we used a larger or smaller class interval, we might have obtained a quite different picture.

 Weight class (kg) Class mid- point (kg) Absolute frequency Relative frequency (%) Absolute cumulative frequency Relative cumulative frequency (%) 401-420 410 1 3.33 1 3.33 421-440 430 2 6.67 3 10.0 441-460 450 3 10.0 6 16.7 461-480 470 3 10.0 9 30.0 481-500 490 5 16.7 14 46.7 501-520 510 5 16.7 19 63.3 521-540 530 6 20.0 25 83.3 541-560 550 3 10.0 28 93.3 561-580 570 2 6.67 30 100.0

We have first given these data in the form of an absolute frequency distribution, in other words the number of observations in each weight class.

We then give it as a relative frequency distribution, in other words the percentage (or proportion) of observations in each class.

We then add together the frequencies of each successive class to give a cumulative frequency distribution, in other words the number of observations less than or equal to that class.

Lastly we give it as a relative cumulative frequency distribution, in other words the percentage (or proportion) of observations less than or equal to that class. Examination of the cumulative relative frequencies shows that the distribution is roughly symmetrical, with 46.7% of the observations at 500 kg or below, and 54% above 500 kg.

The tabulated distributions are then displayed using a histogram. Class frequencies are represented by the areas of bars centred on the class interval. If intervals are all the same width, then height represents the frequency. Use of a histogram assumes that the underlying distribution of a variable is continuous, even though the data will inevitably take discrete values - depending on the accuracy of measurement. A histogram can also be used to represent a cumulative frequency distribution. In this case, some or all of the verticals of the histogram may be dispensed with to give a step-plot.

An alternative display method is the frequency polygon. The frequencies are graphed against the class mid-points, and the points joined by straight lines. Frequency polygons can be useful for comparing several distributions, but like histograms the method is sensitive to the number and ranges of the class intervals chosen, even for relatively large amounts of data. A stem-and-leaf plot also provides a histogram-like display, but without loosing the detail of the original data.

 No.females Absolutefrequency Relativefrequency (%) Absolutecumulativefrequency Relativecumulativefrequency (%) 1 47 72.3 47 72.3 2 12 18.5 59 90.8 3 3 4.6 62 95.4 5 2 3.1 64 98.5 12 1 1.5 65 100.0

With discrete measurement variables (and ordinal variables) there is often no need to define arbitrary class intervals, although extreme classes are often pooled. For the vole data ( from Aars et al. (2001) ) falls naturally into 5 non-overlapping classes. Note that this distribution has a quite different shape to the distribution of cattle weights, with 90.8% (12+47 out of 65) of the observations in the first two classes.

Tabulated distributions of discrete variables are traditionally displayed using a bar diagram (also known as a 'bar chart' or a 'bar graph'). A bar chart is superficially similar to a histogram but the bars do not touch each other. Frequency is represented by the height (= length) of the bar. Conventional wisdom dictates that bar diagrams should not be used for continuous measurement variables, except in one situation. This is when there is a high degree of rounding, resulting in multiple observations of each measurement. Observations should then not be grouped into class intervals. Instead frequencies of each value should be shown separately using a line diagram. Histograms should generally not be used to display discrete measurement variables, although they frequently are, especially when some values are grouped together in classes.

### More powerful and less arbitrary approaches

The class interval approach works well when applied to extremely large sets of values, every one of which is different, providing there is a suitable choice of breakpoints. But when applied to small sets, or discrete data, these methods can produce very misleading results. This is partly because they are sensitive to the choice of breakpoints which is essentially arbitrary. Judicious choice of breakpoints can make a non-symmetrical distribution look symmetrical or vice versa. Arbitrary elements in statistical analysis are always potentially biased and should be avoided. The other reason for misleading results is that a noticeable proportion of values may coincide with the breakpoints. This can be serious when dealing with rounded values (as for example the cow weight data).

#### Density plots

An alternate approach is to examine how many values lie within a predetermined range of values. Where these ranges are non-overlapping and contiguous this is equivalent to the first approach (assuming equal-width bins) - but overlapping bins helps to smooth the distribution. This method can be thought of as describing the 'density' of values.

For small samples the jittered dot plot is quite popular - where each value is plotted against a (uniformly-distributed) random number. Assuming the plot's x and y axes are at right-angles (90 degrees), that random variation is independent of the data-variable. A Gaussian smoothed distribution (described in Unit 3 ) is easier to interpret visually - even though, in effect, this works by adding (normally-distributed) random variation to your data.

#### Rank-based methods

The rank scatterplot and related plots are powerful, but underused, ways of exploring and comparing distributions - irrespective of sample size. The rank of an observation is defined as the number of observations that are less than or equal to that observation. In a rank scatterplot the rank of each observation is plotted against its value, which reveals the cumulative distribution. Plotting each observation's rank within a given value, against that value, produces something which looks like a bargraph. More sophisticated plots employ other measures of rank, such as the relative rank, corrected relative rank, cumulative proportion, or P-value. Conventionally these all use the rank order, or sequential rank - but for estimation and inference, a mean rank (or expected rank, or mid-P-value) may be less biased.

 Be aware:P-values are seldom used for 'raw' data, but are commonly applied to statistics - or, to be more exact, to how those statistics are supposedly distributed. Statisticians sometimes use P-value plots to show how chance causes a statistic to be distributed, or as an alternative to confidence intervals. Most biologists assume P-values only refer to statistical tests, not confidence intervals or sample quantiles. If you wish to emphasize your plot represents observations, rather than statistics, you could describe it as an empirical P-value plot.

Values ordered by rank (or relative rank) divide up a distribution, and are known as quantiles. Accordingly a plot of rank on value is also known as a rank quantile plot, but when a step plot is added (to emphasize its discreteness) it is called an empirical cumulative distribution function (ECDF). Plotting quantiles of one variable against the same quantiles of another variable is known as a quantile-quantile plot - and can be used to compare two sample distributions, or to compare a sample with a theoretical distribution. Reason Relative frequency (%) Direct consumption 59 Bushmeat trade 14 Damage to poultry 18 Damage to crops 7 Other reasons 2

Cumulative frequencies are useful for any variable that you can rank, but not for nominal variables. This is a small data set from Carpaneto & Fusari (2000) on the reasons people gave for hunting wildlife in Tanzania. We can still express the data as a frequency distribution, but not as a cumulative distribution - because, although we can pool classes, there is no logical order in which to rank them.

Bar diagrams are most commonly used to display the frequency distributions of nominal variables. Either frequency or relative frequency can be plotted on the y-axis. Multiple batches of nominal data can be displayed using either a multiple bar diagram or a stacked bar diagram. As such they can be used to show one frequency distribution within another frequency distribution. Another option is the dot plot. One axis of the dot plot (usually the horizontal) is a scale covering the range of quantitative values to be plotted. The other axis (usually the vertical) shows descriptive labels that are associated with each of the numeric values. The data objects usually are sorted according to the quantitative values.

A third option is the pie chart. They are called pie charts because each sector looks like the slice of a pie. The angle of each sector is proportional to the relative frequency of that category. Providing the 'pie' is circular, the area of the sector is then also proportional to that frequency. These are seen more often in newspapers, magazines and Government reports than in scientific journals - and are generally disapproved of by statisticians. This is because it is much harder to visually compare areas of segments than it is to compare heights of bars.