Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Measures of locationOn this page: Arithmetic mean Geometric mean Weighted mean Median, mode & mid-range Running means & medians
The table below shows the weights of thirty cattle.
As you can see, the distribution was a little skewed to the left, but with only thirty observations this is not surprising.
Using the individual observations, the arithmetic mean cattle weight
The arithmetic mean can also be estimated from data grouped in a frequency distribution by assuming the values are concentrated at the centre of the interval. If the distribution is symmetrical, this assumption will be
Using the grouped data, the arithmetic mean cattle weight
Note that the value estimated from grouped data is close to, but not identical to, the value calculated from the raw data.
We will take as an example data on the number of helminth eggs in a small sample of cattle faeces.
The geometric mean (G) is calculated as follows:
The geometric mean number of helminth eggs is 34, much lower than the arithmetic mean of 64. The geometric mean gives a more representative idea of central tendency, whilst the arithmetic mean is heavily influenced by one large number. If there are any zeros on the data, the conventional approach is to add one to each observation before taking logarithms, and then subtract one from the geometric mean after detransformation. Unfortunately , this process makes the (corrected) geometric mean a biased estimator - a fact we will return to later.
We will use the data from the BSDS story in Unit 1 to demonstrate calculation of a weighted mean. The table gives seroprevalence of BSDS in 11 herds, each with different numbers of cattle.
Before we obtained overall prevalence by dividing total number of infections (5) by the total number of animals (2370), and multiplying by 100 to give a prevalence of 0.211%.
But we get exactly the same
Median, mode and mid-range
We will use the same data on the weights of 30 cattle to work out the median, mode and mid-range.
To work out the median we first rank the weights of the 30 cattle from lowest to highest. The median is the centre-most value of the ranked data - in this case mid-way between the 15th and 16th value.
If we were to use the raw data to calculate the mode, we would find three modes at 520, 535 and 545 kg. However the mode should normally be calculated from grouped frequency data as shown here:
With this grouping of the data the mode is at 531-550 kg, or 540.5 kg.
Running means and medians
The data shown here represents the number of a species of butterfly observed each week along a transect.
In the first figure below we replaced each observation in the series by the mean of that observation, the two observations immediately preceding it, and the two observations immediately following it. This gives a 5-point running mean with the smoothed line starting at week 3. The second figure below shows the effect of using a 9-point running mean. Note that each mean must be centred on the observation it is replacing, so only odd-numbers of points are used. The last of the series in the figure above shows the effect of taking 3 consecutive 3 point running means. Note that it produces a smoother result than either the 5-point of 9-point running mean.
We next apply exponential smoothing to the same data. Each point is calculated as a weighted average of all preceding observations. Weighting was done in the following way.
If a is close to zero, then greater weight is given to previous observations, which results in a smoother curve. As a is increased, then more weight is given to the raw data point, so the curve becomes more irregular and more closely follows the unsmoothed data.
Running means are still sensitive to outlying values, so if there are a few very large (or very small) values in the data set , it is better to use running medians. The effect of using a 5-point running median for the butterfly data is shown below:
Note that for these data running medians have a disadvantage in that they tend to look rather jagged. The second plot, above, demonstrates a way to get the advantages of both running medians and running means. Running medians are used for the initial smoothing, and then running means are taken of the smoothed data. Alternately, because a median may be considered an extreme form of trimmed mean, running trimmed means can be used - for example, omitting (zero weighting) the maximum and minimum of each set of 5 observations, and calculating the mean of the remaining three.