Defining frequency distributions
Let us assume you wish to define the frequency distribution of n values of a variable 'X'.
- You might assume you could define a frequency distribution simply as the frequency (f) each value of X occurs, or (if n is very large) you might prefer to say the frequency distribution is the relative frequency of each value of X. So if x has n values, the relative frequency could be calculated as f/n for each possible value of X. In other words, the relative frequency is the proportion (p =f/n) with a given value, x.
That definition is OK provided X is a discrete variable, and n is moderate - but it does not work well if every value of X is different, especially if n is infinitely large.
- An alternate approach, much beloved by statisticians, is to define the frequency distribution in terms of what proportion of X is less than or equal to a given value, x. (Or, alternately, what proportion of X is less than or equal to a given value, x.) Notice that, although this definition refers to a cumulative distribution, it can be expressed as the proportion of values within a given interval - plus the proportion which equals that interval's upper bound (or lower bound).
Notice that when X can only have whole numbered values, if the bounds fall between each possible value of X, this method can produce the same results as those of the first definition.
- A third approach is give each value of X a rank (r), or a relative rank r/n. If no two values of X are the same this definition produces identical results to the previous one, but X is discrete (has tied values) this is only true if we use highest rank (or the lowest rank) of any tied values. Nevertheless, in some situations it more useful to use the mean rank of tied values.
- One other approach is to define a frequency distribution in terms of how commonly similar values occur. For example, given a possible value (x), we could find what proportion of X lies within a given range either side of x. In other words, this defines a frequency distribution in terms of how densely observations are distributed about each value. If that range is extremely small, this corresponds to the cumulative distribution's slope. If that range is large, and the density is averaged for all values of x, the frequency distribution is smoothed - akin to obtaining its running mean. Among other things, smearing out and averaging a sample distribution can usefully remove spurious fine structure.
There is of course, no reason why that range must be symmetrical about x, or calculated for all possible values of X - or provide observations distant from x the same weight as values close to x. If that range is sufficiently small, and we only consider observed values of x, then the results are identical to the very first approach, above.