Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Why smooth a distribution?
The most common reason for looking at a sample's distribution is to decide if "it is normal" - either to fulfil the assumptions of some statistical procedure, or to decide how it might be made normal. But, since no finite sample can ever be truly
One practical difficulty of answering such questions (given only the information available from a single finite sample) is that, aside from information about its source (or parent) population, a sample also includes random
Despite their shortcomings, histograms (and frequency polygons) are still highly popular. One reason for this it most people find them much easier to interpret and compare than jittered dotplots, cumulative distributions or QQ plots. Another important advantage of histograms is that, by summing (or averaging) the observed frequency of values within class intervals the fine structure within each interval is removed. Unfortunately this does not remove sampling variation between class intervals - which causes problems if you have a small sample or use many class-intervals. Furthermore, because class intervals are both discrete and arbitrary, they provide a biased estimate of the population
As a result, varying the width of class-interval within a histogram is generally frowned-upon, and while increasing these intervals may achieve some smoothing, it is unwise to deviate too far from the recommended number of
Let us therefore consider some other ways by which you might smooth a sample distribution.
Smoothing by running averages
Moving averages are seldom used to smooth frequency distributions - but this is not just because they require more arithmetic than a histogram. Nevertheless, to expose issues common to more sophisticated smoothing functions, let us consider how you might use running means (or medians) to smooth an observed frequency distribution.
All of this is particularly frustrating when you are trying to decide how normal your data is - or hope to use your sample as some sort of 'model' of a wider population from which it was (presumably) drawn. Simply 'fitting' a normal distribution might produce a nice smooth curve, but it makes an awful lot of assumptions - among the most important of which is your sample represents a normal distribution. So whilst the result is undoubtedly smooth, it completely ignores any skew or kurtosis in your sample, and can be horribly misleading.
Smoothing by jittering
Perhaps surprisingly, the simplest way of smoothing a sample distribution is to add a small amount of random error to each observation - a process known as 'jittering'. The difficult practical question is how should those errors be distributed? For applications such as jittered dot-plots a random uniform distribution is preferred. But errors uniformly distributed within an arbitrary range, but wholly absent outside of it, would not smooth a sample distribution - and (on average) give a stepped distribution. When smoothing a sample distribution, the optimal distribution for jittering depends upon how that sample's source population is distributed. For normal-ish populations (even poly-modal ones) the optimal error distribution is normal.
The distribution of these (n×R) jittered values is unarguably much smoother than that of the sample (of n observations). But, since we do not know what population these data represent, we cannot know how well the smoothed distribution approximated it - since this sample is clearly left-skewed, it seems unlikely it represents a normal or lognormal population. In which case, although jittering made the underlying distribution fairly obvious, we should accept that applying normally-distributed errors may have distorted the distribution slightly.
The main difference between these quantiles is the jittered distribution has longer tails which, in the absence of information about their parent population, may be perfectly reasonable. Of course if these data represented a lognormal population, applying normal errors would tend to reduce the skew a little - and could make the lowest values fall below zero (thus outside its bounds). In that situation case we should either use a lognormal jittering function - or transform the data to approximate normal prior to jittering.
Whilst jittering is reasonably straightforward it is computationally intensive for large samples, unless you want to construct some sort of 'model' population. Also, thus far, we have ignored any theoretical reasoning for it, and have said nothing about how we decided upon which parameters to use in our jittering function.
Mean probability density
Smoothing a sample distribution by adding random normal (jittered) error is only justifiable to the extent that the resulting values are possible - and if the variable is discrete, or strictly bounded, this may not be the case. But, provided you merely wish to gauge the overall shape, these concerns may not matter so much.
For normal (Gaussian) jittering, if we are to avoid changing the distribution's shape, those added errors need to be taken from a population whose mean is zero. Therefore, if you consider just one observation from your sample, we should expect its jittered values to be normally distributed - with a mean equal to that observation's value.
At first sight, you might assume the optimum standard deviation for a Gaussian smoothing function would be the same as the standard deviation of whichever sample you are jittering. However, applying that to extreme observations produces overlong tails - and, because values near the distribution's centre are close together, the smoothing function does not need to be so strong. In other words, using the sample standard deviation produces too large a bandwidth - and biases the smoothed distribution unduly towards normal. Even so, all else being equal, the optimum bandwidth is directly related to the sample standard deviation. But, because small samples tend to have the most unsmooth distribution, the optimum bandwidth must also allow for the sample size.
For random samples of a normal population the optimum bandwidth for Gaussian smoothing is 1.06×sy/n1/5
The histogram of jittered observations shown
Recall that, if the bandwidth were zero, their mean probability density would be equivalent to the unsmoothed sample distribution - but since those probability densities are infinitely high, we have rescaled them to a rugplot.
If you want R to plot a Gaussian smoothed sample distribution, plus a jittered dotplot, assign your sample to a variable called y then enter the instructions below. You do not need you to sort the data into ascending order, but there must be no missing values.