InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Gaussian smoothing

#### Why smooth a distribution?

The most common reason for looking at a sample's distribution is to decide if "it is normal" - either to fulfil the assumptions of some statistical procedure, or to decide how it might be made normal. But, since no finite sample can ever be truly normal, what is really of interest is whether that data represents a normal population.

One practical difficulty of answering such questions (given only the information available from a single finite sample) is that, aside from information about its source (or parent) population, a sample also includes random variation. In addition to that sampling variation, the distribution of any finite set of values is discrete - which is why an empirical cumulative distribution function is stepped. The net result of which is that, while most population distributions employed by statisticians have very simple shapes, samples of those populations contain information which has nothing to do with whatever population they represent. Especially in small samples, that unwanted 'fine-structure' obscures the overall shape of the sample's distribution.

Despite their shortcomings, histograms (and frequency polygons) are still highly popular. One reason for this it most people find them much easier to interpret and compare than jittered dotplots, cumulative distributions or QQ plots. Another important advantage of histograms is that, by summing (or averaging) the observed frequency of values within class intervals the fine structure within each interval is removed. Unfortunately this does not remove sampling variation between class intervals - which causes problems if you have a small sample or use many class-intervals. Furthermore, because class intervals are both discrete and arbitrary, they provide a biased estimate of the population distribution. Moreover, the wider the class intervals, the more information is lost.

As a result, varying the width of class-interval within a histogram is generally frowned-upon, and while increasing these intervals may achieve some smoothing, it is unwise to deviate too far from the recommended number of intervals. Even so, given a fixed number of intervals of the same width, any number of histograms can be plotted if you vary their start-point. Perhaps surprisingly, plotting many histograms of the same data does not smooth a sample distribution very well - as illustrated by the second graph below.

These data are observed serum calcium concentration in cattle homozygous for renal tubial dysplasia. The data are from Ohba et al. (2002).

{Fig. 1}

Let us therefore consider some other ways by which you might smooth a sample distribution.

#### Smoothing by running averages

If you are familiar with running means you might assume they would be an attractively simple way to smooth frequency distributions.

Moving averages are seldom used to smooth frequency distributions - but this is not just because they require more arithmetic than a histogram. Nevertheless, to expose issues common to more sophisticated smoothing functions, let us consider how you might use running means (or medians) to smooth an observed frequency distribution.

For example, the graph below shows a 3 point running mean obtained from the head capsule widths of bamboo borer larvae.

{Fig. 2}

• A common first step, as when doing histograms intervals by hand, is to arrange (or rank) your observations in ascending order.
• If your observations are a large sample of some discrete variable (y), such as the number of animals observed, because values occur at regular intervals you can calculate a running mean frequency in the usual fashion - which ignores what value of y each frequency corresponds to.
• This reasoning does not work for observations of a measurement variable because:
1. Being continuous, every value has the same observed frequency (one) - unless you truncate them, or round them, or otherwise group them as class intervals.
2. Those observed frequencies almost never correspond to equally-spaced values of y.

• The next step is to decide upon how many values to use for your smoothing function - in other words you need to decide upon a suitable bandwidth. Running means are conventionally calculated from odd-numbers (1,3,5,7,9...) of adjacent values. If you are still thinking in terms of histograms, this is equivalent to their class interval. Using just one point per mean achieves no smoothing at all, and has zero bandwidth. When you use too many points (a high bandwidth), the smoothed distribution will be excessively flattened - in the worst case this is equivalent to calculating the sample mean, or using one class interval whose boundaries enclose the sample minimum and maximum. For samples of a unimodal distribution (such as a normal population), the more variable your data, the more points the smoothing function will need to provide effective smoothing. For roughly normal data the optimal bandwidth will be related to the sample's standard deviation - but that may not work with polymodal data such as those cornborer headcapsule widths. Then again, if the data are strongly skewed (e.g. lognormal), the optimal bandwidth for normal data can unduly flatten the steeper side.

• Running means can run into difficulties with the ends of a sample distribution - because there may be no information outside of the observed range.
• Allocating values outside the observed range a frequency of zero, and using that to calculate a running mean assumes those values can be observed - or actually mean something. For example a calcium plasma concentration of zero may be simply impossible for a living animal. Using such values can unrealistically round-off the distribution's ends.
• Conversely, when you treat observations outside the observed range of values as not available (or missing) a simple 5-point running mean will truncate each end of the distribution by two values. One way round this problem is to ignore missing values when calculating those means, and accept the ends are less smoothed.

• Like class intervals, all the running means we have discussed thus far give the same weight to every observation within their remit, and ignore all observations outside. This all-or-nothing criterion is why running means, medians, or multiple sets of class intervals do not smooth samples of continuous distributions very well - as shown by Fig.1 above. These problems are worst for small samples.

All of this is particularly frustrating when you are trying to decide how normal your data is - or hope to use your sample as some sort of 'model' of a wider population from which it was (presumably) drawn. Simply 'fitting' a normal distribution might produce a nice smooth curve, but it makes an awful lot of assumptions - among the most important of which is your sample represents a normal distribution. So whilst the result is undoubtedly smooth, it completely ignores any skew or kurtosis in your sample, and can be horribly misleading.

#### Smoothing by jittering

Perhaps surprisingly, the simplest way of smoothing a sample distribution is to add a small amount of random error to each observation - a process known as 'jittering'. The difficult practical question is how should those errors be distributed? For applications such as jittered dot-plots a random uniform distribution is preferred. But errors uniformly distributed within an arbitrary range, but wholly absent outside of it, would not smooth a sample distribution - and (on average) give a stepped distribution. When smoothing a sample distribution, the optimal distribution for jittering depends upon how that sample's source population is distributed. For normal-ish populations (even poly-modal ones) the optimal error distribution is normal.

The graph below shows the result of applying 5000 random normal errors to each of the calcium concentrations we showed above.

{Fig. 3}

The distribution of these (n×R) jittered values is unarguably much smoother than that of the sample (of n observations). But, since we do not know what population these data represent, we cannot know how well the smoothed distribution approximated it - since this sample is clearly left-skewed, it seems unlikely it represents a normal or lognormal population. In which case, although jittering made the underlying distribution fairly obvious, we should accept that applying normally-distributed errors may have distorted the distribution slightly.

A simple way to compare the observed and jittered values is to plot a scatterplot of their quantiles, as shown below.

{Fig. 4}

The main difference between these quantiles is the jittered distribution has longer tails which, in the absence of information about their parent population, may be perfectly reasonable. Of course if these data represented a lognormal population, applying normal errors would tend to reduce the skew a little - and could make the lowest values fall below zero (thus outside its bounds). In that situation case we should either use a lognormal jittering function - or transform the data to approximate normal prior to jittering.

Whilst jittering is reasonably straightforward it is computationally intensive for large samples, unless you want to construct some sort of 'model' population. Also, thus far, we have ignored any theoretical reasoning for it, and have said nothing about how we decided upon which parameters to use in our jittering function.

#### Mean probability density

Smoothing a sample distribution by adding random normal (jittered) error is only justifiable to the extent that the resulting values are possible - and if the variable is discrete, or strictly bounded, this may not be the case. But, provided you merely wish to gauge the overall shape, these concerns may not matter so much.

Assuming your data are sort-of normal, the key issues are:
1. What are the optimum parameters for the smoothing function?
2. What is the quickest and best way to calculate and display the smoothed function?

For normal (Gaussian) jittering, if we are to avoid changing the distribution's shape, those added errors need to be taken from a population whose mean is zero. Therefore, if you consider just one observation from your sample, we should expect its jittered values to be normally distributed - with a mean equal to that observation's value.

At first sight, you might assume the optimum standard deviation for a Gaussian smoothing function would be the same as the standard deviation of whichever sample you are jittering. However, applying that to extreme observations produces overlong tails - and, because values near the distribution's centre are close together, the smoothing function does not need to be so strong. In other words, using the sample standard deviation produces too large a bandwidth - and biases the smoothed distribution unduly towards normal. Even so, all else being equal, the optimum bandwidth is directly related to the sample standard deviation. But, because small samples tend to have the most unsmooth distribution, the optimum bandwidth must also allow for the sample size.

For random samples of a normal population the optimum bandwidth for Gaussian smoothing is 1.06×sy/n1/5

where:
• h is the standard deviation for normal jittering, known as the

bandwidth parameter

• sy is the sample standard deviation of variable y,
• n is the number of observations in that sample,
• n1/5 is the fifth root of n

The histogram of jittered observations shown above was obtained by applying 5000 random normal errors, with those parameters, to each of the n observations in that sample. For large samples a quicker way to find this distribution is to calculate their mean probability density for a series of possible Y values (not just the ones you have observed). Below you can see the effect of three different values of the bandwidth parameter, h, when smoothing a sample of calcium concentrations - compare the green line to our histogram of jittered observations.

{Fig. 5}

Recall that, if the bandwidth were zero, their mean probability density would be equivalent to the unsmoothed sample distribution - but since those probability densities are infinitely high, we have rescaled them to a rugplot.

If you want R to plot a Gaussian smoothed sample distribution, plus a jittered dotplot, assign your sample to a variable called y then enter the instructions below. You do not need you to sort the data into ascending order, but there must be no missing values.

Or, using its 'density' function, R can plot a smoothed distribution as shown in Unit 1.