Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Variance and standard deviationOn this page: Variance definition Standard deviation definition Within-subject standard deviation Assumptions & Requirements
The variance provides a measure of spread or dispersion of a population. It is computed as the average of the squared deviations of the observations from their mean, hence its alternative name mean square error.
If you were able to measure every member of the population, this is the equation we would use. But more usually you take a sample, and use the variance of your sample to estimate the population variance.
If you work out the variance of your sample using the equation above, it will under-estimate the true value of the population variance. In other words it is a biased estimator of the population variance.
Correction for biasYou correct for this bias by dividing by n − 1 (where n is the number of observations), rather than by n. Hence the sample variance is given by the sum of the squared deviations of the observations from their mean divided by
Alternative formulae for the variance
Calculator / text-book formula
If you are working out the sample variance with a calculator or are writing a computer programme, it is easier to use the following formula. It is mathematically identical to the formula given above.
The advantage of this formula is it does not require you to work out the mean before you can work out the deviation of observations from it. It is also less vulnerable to rounding errors. Its disadvantage, when learning about statistics, it that it obscures what is going on.
Calculation of variance from frequency distribution
Sometimes it is much easier to work out the variance of observations that have been divided into class-intervals. The accuracy of the formula below depends upon the width of the class-intervals used - the wider the class-interval, the less accurate is their estimated variance.
Calculation of variance from relative frequency distribution
Occasionally you may need to estimate variance from the proportion of observations in each class interval. In other words, where you do not know the overall sample size (n). This is possible because for large samples (n > 30) n approaches n − 1, and s2 ≅ σ2.
The standard deviation of a population is simply the square root of the population variance. It can also be described as the root mean squared deviation from the mean.
Similarly the standard deviation of a sample is the square root of the sample variance:
Correction for bias
We noted above that the sample variance (s2) is corrected for bias by dividing by
Alternatively tables are available that give the correction factor directly for small
It is only really necessary to do this correction if you have small sample sizes, and you are quoting standard deviations. If you are estimating the standard deviation in order to then estimate the coefficient of variation or the confidence limits of the mean, you should not correct the standard deviation for bias. This is because the equations for these statistics include the necessary corrections.
With the availability of personal computers, few people still use a calculator for doing statistics. However, statistical packages often have some 'bugs'. So it is wise to give these packages some small 'test' data sets, so you can easily check the results 'by hand'. In addition some packages do not actually tell you whether or not certain corrections have been applied. You can only find out by running a test data set.
Within-subject standard deviation
This statistic provides a useful measure of both reproducibility (same test material sent to different laboratories) and repeatability (same test material analyzed by same person in same laboratory). In other words, it describes the random component of measurement error. It also goes under the rather misleading name standard error of measurement. (This may be abbreviated to SEM, but it has nothing to do with the standard error of the mean which is also abbreviated to SEM.)
For instance you might subdivide a single blood sample, and send each subsample to a different laboratory for haemoglobin assay. The standard deviation of their various results could be used to describe the measurement error.
In practice this sort of assessment uses a number of samples (say each from a different patient) - and each original sample is subsampled and independently assayed. Because the original samples were not identical, the results you obtain will include the variation between patients - in addition to the measurement error. Therefore, simply pooling the results, and calculating the standard deviation of the errors (about their, single, common mean) will overestimate the variation arising from measurement error.
There are two obvious ways of avoiding this problem:
Although this method works perfectly well if you have the same number of measurements for each individual, a more flexible approach to carry out what is called a one way analysis of variance. Taking the square root of the 'residual mean square' will give you the within-subject standard deviation. We cover the analysis of variance approach in
If there are only two measurements per original subject, there is a simpler formula because the variance of two observations is equal to half the square of their difference. So, the within-subject standard deviation can be obtained as follows:
This approach to getting a within-subject standard deviation is only valid if the standard deviation is independent of the mean. This can be checked by plotting the standard deviation for each individual against the individual's mean. If there is a relationship, the data should first be transformed (a log transformation is often effective) before replotting to ensure this assumption is met.
The within subject standard deviation can also be used to quantify measurement error in repeated measurements over time. However, this will only reflect measurement error alone if there is no trend over time. If there is, the within-subject standard deviation will overestimate the amount of measurement variation.
A common error is to use Pearson's correlation coefficient (see
Unfortunately, this correlation coefficient suffers from a problem - the more your original subjects (or samples) differ, the more this correlation coefficient will underestimate the measurement variation. Worse still, if there was any consistent trend in your results over time, that source of variation will bias the correlation coefficient. Fortunately there is a correlation coefficient which can be used - namely the 'intra-class correlation coefficient'. However, this can only be obtained after carrying out an analysis of variance which is described in
Assumptions and Requirements
The variance and standard deviation can be calculated for any variable - providing it can be ordered. But the standard deviation is only an appropriate measure of dispersion for a measurement variable, and only then if the data have a symmetrical distribution - and, in many cases, a normal one. Use of the standard deviation to display the variability of observations in range plots and box-and-whisker plots is misleading if these assumptions are not met. Assumptions about what proportion of observations are included within limits of agreement are also dependent upon this assumption.
For non-symmetrical (skewed) distributions there are two options:
For the within-subject standard deviation, it is assumed that the size of the deviation is not related to the magnitude of the measurement. This can be assessed graphically, by plotting the individual subject's standard deviations against their means.
We stress here that the standard deviation is only a measure of the variability of your observations. It is not a measure of the variability or the reliability of your estimated mean. We will come to measures of the variability and reliability of means later in this unit.