The arithmetic mean
The arithmetic mean is the sum (or total) of a set of measurements divided by the number of measurements in that set (or group, or list). It is (debatably) the most commonly used type of average.
Algebraically speaking -
|Arithmetic mean () =
- ΣYi indicates the sum of each of the i values of the observations in variable Y,
- n is the number of observations in the set.
If you have any problems with mathematical notation, please refer to our (very simple) introduction to the topic:
The mean is a measure of location or measure of central tendency of the distribution on its scale. The term central tendency refers to the fact that, very often in a set of observations, some values are more common than others - and the most common values tend to be similar. The most central value of a set is termed their location. There are a number of ways of calculating this location - of which the arithmetic mean is most common.
The arithmetic mean gives equal weight to all of the measurements in a set. If you display the observations as a frequency histogram, the mean is the point at which the histogram would balance. You can test this for yourself by drawing a histogram on thick card, and cutting it out. In other words the mean divides the area under a histogram into equal halves.
With the data set on worm lengths below, the data are more or less symmetrically distributed, so we find the mean is more or less central in the distribution.
This is not the case in the next data set (based on data given by Shenoy et al. (1998)) on the number of microfilaria per ml of blood in patients suffering from brugian filariasis. The distribution is given in the first figure below. It is skewed to the right because of the presence of a few patients with very large numbers of microfilaria. For this distribution the arithmetic mean no longer reflects the commonest values, but is pulled to the right - in fact, about 73% (16/22) of the patients had fewer than the arithmetic mean number of parasites.
The second figure above shows the distribution of retail prices of tetracycline for veterinary application in Kenya. The distribution appears to be bimodal with peaks at around Ksh 150-300 and Ksh 450-600. The arithmetic mean lies between the two 'sub-distributions', at a point unrepresentative of either of the two groups of data.
We may conclude that if the distribution of a data set is (more-or-less) symmetrical, the arithmetic mean provides a reasonable measure of location for that data set. If, however, the distribution is skewed or bimodal, the arithmetic mean may be misleading.
We have only considered measurement variables above, but it is worth noting that a proportion is simply the mean of a binary variable if we denote the values of the binary variable as '0' or '1'. For example say we have 20 observations of a variable which can only take the value '0' (uninfected) or '1' (infected). If 5 individuals are infected, we have 5 1's and 15 0's. The arithmetic mean of the variable (= proportion infected) is equal to (Σ Yi)/n = 5/20 = 0.25.
Geometric and harmonic means
As we have seen, the arithmetic mean may not always be the best measure of location of a distribution. If the frequency distribution is skewed to the right, the arithmetic mean is biased by the few very large numbers. In this situation, the geometric mean is (usually) a more appropriate measure of location.
The geometric mean is defined as the nth root of the product of the individual observations. In other words, instead of adding the observations together, you multiply them and then take the nth root. We can write this as:
Algebraically speaking -
Geometric Mean (G) = n√ΠYi
- n is the number of observations,
- n√ is the nth root (e.g. if n = 2 then you take the square root),
- ΠYi is the product of all of the observations.
An easier way to calculate a geometric mean is to first transform the data by taking the logarithm of each observation. Then add the transformed values together and divide by the number of observations. The antilogarithm yields the detransformed or geometric mean.
Algebraically speaking -
|Geometric Mean (G)
- Σlog Yi indicates the sum of the log of each of the observations in variable Y,
- n is the number of observations in your set.
If you have any problems with logarithms, please refer to our (very simple) introduction to the topic:
The harmonic mean (H) may be used when analysing rates of change. It is the reciprocal of the arithmetic mean of the reciprocals of the observations.
One example of the correct use of the harmonic mean is to calculate 'average' speed. If for half the distance of a journey you travel at 60 kilometres per hour, and the other half you travel at 80 kilometres per hour, then - in one sense - the 'average' speed is the harmonic mean of the two, namely 68.6 kilometres per hour. This is because that is the speed you would have to travel if you travelled at the same speed for the whole trip.
In a simple, or arithmetic mean, all of the observations have equal weight. Sometimes we wish to weight our observations according to their importance or our confidence in them. M For example if we take a mean of percentages we should weight each percentage by the number of observations it is based upon.
For a weighted mean each observation is multiplied by its weight, and the mean is divided by the total weighting applied.
|Weighted mean =
Yi indicates the ith value of the observations in variable Y,
wi is the weight of each item in the set.
Obviously if w = 1, w cancels out and this becomes the same as the arithmetic mean. Alternately, the arithmetic mean could be estimated from the number of observations in each interval of a frequency distribution.
In which case:
- Y is the class interval mean,
- wi is the number of observations within the ith interval
The accuracy of this mean is set by the size of the class intervals.
If we have very little confidence in some points, we might decide to give them zero weight - in other words we omit the points altogether when computing the mean. One important example of this is where the most divergent values (such as the maximum and minimum ) are given zero weight - and their mean is known as a 'trimmed mean'. Although we look at trimmed means in more depth in Unit 2, it is worth noting that trimmed means assume that equal proportions of the highest and lowest values are removed - and their removal is done impartially.
It is often useful to smooth time series data to expose
underlying trends. The extent to which this is true will depend on the signal-to-noise ratio. If the variations (the noise) about the underlying trend (the signal) are small, then the trend will be clear. If, however, there is a low signal-to-noise ratio, then it may be hard to discern if there is any real trend. The solution is to use some form of smoothing. This involves replacing the value of each observation in the list with an estimate of it based on a 'window' of observations around it; for each point in the list that window moves along the list. The reasoning behind this is that, if the observations are serially correlated, observations close to a given point in the list also have information about that point.
This can be done by calculating running means. Each running mean , also known as a moving average, is calculated from an overlapping group of n values. There are two types of running means. A prior running mean is the unweighted mean of the previous n data points. This is the only type that can be used if the data is 'live' - in other words you want to produce the average for that day on that day. These are mostly used e.g. in stock markets. A running mean of this type always lags behind the latest observation. Another option more commonly used by biologists is the central running mean - the mean is taken of the day in question and of equal numbers of points both before and after that day. In this case n is always an odd number.
Algebraically speaking -
|Prior n-point running mean (PRMt) =
||Σ (Yt-1 + Yt-2 + ... + Yt-n)
|Central 5-point running mean (CRMt) =
||Σ (Yt+2 + Yt+1 + ... + Yt-2)
- Yt, Yt+1 and Yt-1 indicate the value of Y on the current time, the following time period and the previous time period respectively,
- n is the number of observations in the set.
Note that there are various equivalent algorithms for working out moving averages which do not involve a full summation every time.
One modification of the simple running mean is to take repeated running means - in other words, running means of running means. This may be done in preference to running means with a larger number of points since it tends to produce a much smoother result. If the process is carried out repeatedly, the time series will eventually stabilise so that further repeats have no effect.
Another modification of a simple running mean is to use weighted running means. These gives most weight to the central value and then progressively less weight to more outlying values. A popular form of weighted running means is exponential smoothing. Usually only previous observations (or the current and previous observations) are used to produce a smoothed value. Each point is calculated as a weighted average of all smoothed preceding observations.
Algebraically speaking -
|Exponential smoothed mean (ESMt) =
||aYt +(1-a) ESMt-1
Because the smoothed data are used, the weights decrease geometrically with the age of prior observations.
Assumptions of means
- You can calculate the mean of any measurement variable, whether continuous or discrete. For example, you can work out the mean height of a number of individuals, or the mean number of plants per quadrat.
It is usually not valid to calculate the mean of an ordinal (rank order) variable. If for example you score 0-10 insects as 1, 10-20 insects as 2 and more than 20 as 3, then you should not calculate a mean. If on the other hand you use what is called a visual analogue scale, for example to assess pain, then it is usually considered to reflect an underlying measurement scale. In this situation some statisticians are happy to use a mean, although others insist on the use of the median.
It is never valid to calculate the mean of a nominal variable. For example if you score psychotherapy as 1, chemotherapy as 2, and surgical intervention as 3, it makes no sense to add up all the ones, twos and threes, and then take a mean!
- For the arithmetic mean to be meaningful, the frequency distribution of the items should be symmetrical. If the frequency distribution is skewed as in the distribution of numbers of microfilariae, the arithmetic mean will give a biased estimate of the central tendency. The geometric mean may be more appropriate.
To give another example, let us say 19 laboratories have one fluorescent microscope each, and one laboratory has 41 fluorescent microscopes. On average, there are three fluorescent microscopes per laboratory. This mean does not describe most of the sample, because it is heavily influenced by one, very different, observation - or 'outlier'. Again a geometric mean might be more appropriate.
- The arithmetic mean will also be a poor measure of central tendency for a bimodal distribution, as in the distribution of drug prices (above). Here it would be better to describe the data with two means, which would summarize each of the 'sub-distributions'.
The median, mode and mid-range
- The median of a set of observations is the middle value when the data are arranged in order of magnitude. For an even number of ranked observations the median is usually taken to lie somewhere between the two centre-most values, so is estimated as their mean. The median corresponds to the mean rank, below which half the observations lie.
If you arrange a set of (n) different numbers in ascending order, their median has the [n + 1] / 2 highest value.|
The median can also be regarded as the most extreme type of trimmed mean. Generally, if you have data where you distrust extreme values, it is easier and more transparent to use medians than (slightly) trimmed means. Although the tests for medians may be less powerful than those for means, a good number are available, and (for simple analytical analyses) medians do not entail the same cumbersome corrections as trimmed means. The median is the most appropriate measure of location for an ordinal variable.
A running median is used for smoothing data. Running means are still sensitive to outlying values, so if there are a few very divergent values in the data set, it is better to use running medians. Plots of running medians can be rather jagged simply because successive running medians are often the same. But for contaminated distributions, running medians may be less variable than plots of running means.
- The mode is the most common value, or range of values.
Among a set of values the mode is either their most common value, or the most common class of values.
However, this nice neat theoretical principle runs into difficulties when applied to real data - because more than one mode are discernable.
- A distribution with a single mode is therefore unimodal,
- where there are 2 modes it is bimodal,
- and if there are many modes it is polymodal.
In practice it is often impossible to calculate the mode unless you group your data in some arbitrary fashion, or you assume your observations represent some (equally arbitrary) frequency distribution, or you smooth the observed frequency distribution - to fit your hopes and preconceptions.
- The mid-range is that value which is half way between the maximum and minimum.
Among a set of values the mid-range (or mean range) is their [maximum + minimum]/2|
Unfortunately, because the maximum and minimum are the least reliable observations, and diverge furthest from their midrange, the mean range provides a correspondingly unreliable measure of their location!
For large samples, where the frequency distribution is symmetrical, the arithmetic mean, median, mode and midrange are likely to be quite similar - on average, at least! Where the frequency distribution is skewed (such that one tail of the distribution is much the longer), the midrange will often be closest to the 'tail', followed by the arithmetic mean. The mode will be furthest from the 'tail', and the median will be in between the mode and the arithmetic mean. This can be seen in the worked example below.
Assumptions of the median, mode and mid-range
- You can calculate the median and mid-range of measurement and ordinal variables, but it is less precise than the mean for measurement variables. You cannot, validly, calculate the median or mid-range of nominal variables.
- The mode can be calculated for any type of variable, but tells you little where there are few observations, or too many classes, or no clear mode.
- The median may be an appropriate measure of location if the frequency distribution of the items is not symmetrical. The mode can also be useful in this situation, but the mid-range assumes a symmetrical frequency distribution.
Which measure of location to use?
Which measure of location is most useful depends upon
- the type of variable
- the distribution of values within the data,
- the underlying model - whether additive or multiplicative,
- the purpose for which you wish to use the measure.
If the distribution is fairly symmetrical, and your model is additive, the arithmetic mean generally provides the most useful and informative measure of location.
If the distribution is heavily skewed to the right, and/or your model is multiplicative, the geometric mean may be the most appropriate measure of location.
If the distribution is polymodal, irregular, or has very long tails, the median may be the best measure - although it may not be as precise as the arithmetic mean.
Which measure of location is best also depends upon the purpose for which you wish to use the data:
Consider the number of nematode eggs per gram of faeces in horses:
To assess efficacy of treatment for the horses
- the geometric mean is the best measure of location, as it will be less affected by a few very high numbers.
To assess impact of treatment on the subsequent worm load in the field
- the arithmetic mean is the best measure of location, as it will be directly proportional to the total number of eggs dropped in the field.
Ordinal and nominal variables
In general the median is the best measure to use for ordinal variables. This is because, for an ordinal variable, we usually cannot say that the difference between score 1 and 2 is equivalent to the difference between score 2 and 3.
Think very carefully before using an arithmetic mean for an ordinal variable!
The mode could also be used for an ordinal variable - but remember it may not be reliable, especially if the distribution is flattened (platykurtic). The mode is, however, the only measure of location that can be used for a nominal variable.
The 'average man'
Sample and parametric means