Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)





The variance provides a measure of spread or dispersion of a population. It is computed as the average of the squared deviations of the observations from their mean, hence its alternative name mean square error.

Algebraically speaking -

σ2     =    Σ ( μ − Yi )2
Where :
  • σ2 (or υ) is the population variance,
  • μ is the population mean,
  • Yi is the value of each observation in the population
  • μ − Yi is the difference between the value of each observation and their population mean,
  • Σ (μ − Yi)2 is the sum of the squares of those individual deviations from the mean, and
  • n is the number of observations.

If you were able to measure every member of the population, this is the equation we would use. But more usually you take a sample, and use the variance of your sample to estimate the population variance.

If you work out the variance of your sample using the equation above, it will under-estimate the true value of the population variance. In other words it is a biased estimator of the population variance.


Correction for bias

You correct for this bias by dividing by n − 1 (where n is the number of observations), rather than by n. Hence the sample variance is given by the sum of the squared deviations of the observations from their mean divided by n − 1.

Algebraically speaking -

s2 =    Σ ( − Yi )2    or    Σ ( y2 )
n  −  1 n  −  1
Where :
  • s2 is the sample variance,
  • is the sample mean,
  • Yi is the value of each observation in the sample
  • y = Yi  −  , the difference between each observation and its sample mean.
  • n is the number of observations in the sample,
    the quantity n  −  1 is known as the degrees of freedom.

Because you have squared the deviations from the mean, the variance is expressed in squared units. .


Alternative formulae for the variance

  1. Calculator / text-book formula

    If you are working out the sample variance with a calculator or are writing a computer programme, it is easier to use the following formula. It is mathematically identical to the formula given above.

    Algebraically -

      Σ ( Yi2 )   −    ( Σ Yi )2
    s2 =   
    n  −  1

    Where :

    • Σ ( Yi2 ) is the sum of the square of each observation,
    • ( Σ Yi )2 is the square of the sum of the observations.
    • s2, Yi and n are as above.

    The advantage of this formula is it does not require you to work out the mean before you can work out the deviation of observations from it. It is also less vulnerable to rounding errors. Its disadvantage, when learning about statistics, it that it obscures what is going on.


  2. Calculation of variance from frequency distribution

    Sometimes it is much easier to work out the variance of observations that have been divided into class-intervals. The accuracy of the formula below depends upon the width of the class-intervals used - the wider the class-interval, the less accurate is their estimated variance.

    Algebraically speaking -

    s2  =    Σ ( f y2 )
    n  −  1
    Where :
    • s2 is the sample variance,
    • f is the number (or frequency) of observations within a class-interval
    • Y is the difference between that class and the sample mean, .
      If a class covers a range of values use the class average, or class mid-point.
    • n = Σ( f ), the total number of observations in the sample


  3. Calculation of variance from relative frequency distribution

    Occasionally you may need to estimate variance from the proportion of observations in each class interval. In other words, where you do not know the overall sample size (n). This is possible because for large samples (n > 30) n approaches n  −  1, and s2 ≅ σ2.

    Algebraically speaking -

    s2  ≅    Σ ( f y2 )  =   Σ (  f/n  Y2  )
    Where :
    • s2 is the sample variance,
    • n is total number of observations in the sample
    • f is the number (or frequency) of observations within a class-interval
    • f/n is the proportion (or relative frequency) of the observations within that class interval
    • y is the difference between that class midpoint and the sample mean.



Standard deviation


The standard deviation of a population is simply the square root of the population variance. It can also be described as the root mean squared deviation from the mean.

Algebraically speaking -

σ    =   
Σ( μ − Yi )2
where :
  • σ is the population standard deviation,
  • μ, Yi, and n are as above.


Similarly the standard deviation of a sample is the square root of the sample variance:

Algebraically speaking -

s    =   
Σ( − Yi )2
n − 1
where :
  • s is the sample standard deviation,
  • , Yi, and n are as above


Correction for bias

We noted above that the sample variance (s2) is corrected for bias by dividing by n − 1 rather than n. Despite this, when we take the square root of the sample variance to obtain the sample standard deviation, we still get a biased estimate of the population standard deviation. If you wish to use the sample standard deviation as an estimate of the population standard deviation you should multiply s by a correction factor (Cn). For large samples (n > 30), this correction factor is readily obtained from the sample size:

Algebraically speaking -

Cn ≈     1 + 1
4(n − 1)

Where :

  • Cn is the correction factor,
  • ≈ means 'is approximately equal to',
  • n is the number of observations.

This correction, however, makes little difference to the estimate of the standard deviation so it is seldom applied.

For small sample sizes the correction is more important. Unfortunately the formula is more complicated and you will need to have tables of the 'gamma function', Γ (this symbol is the Greek letter G).

Algebraically speaking -

Cn  =      Γ({n − 1}/2)
(n − 1)/2 
where :
  • Cn is the correction factor,
  • n is the number of observations in your sample,
  • Γ(n) is the gamma function, which can be looked up in R or in mathematical tables.

Alternatively tables are available that give the correction factor directly for small samples. Another option is to use the jackknife to estimate the statistic corrected for bias - although in the case of the standard deviation the formula above (or tables) will be more accurate for small samples.

It is only really necessary to do this correction if you have small sample sizes, and you are quoting standard deviations. If you are estimating the standard deviation in order to then estimate the coefficient of variation or the confidence limits of the mean, you should not correct the standard deviation for bias. This is because the equations for these statistics include the necessary corrections.

With the availability of personal computers, few people still use a calculator for doing statistics. However, statistical packages often have some 'bugs'. So it is wise to give these packages some small 'test' data sets, so you can easily check the results 'by hand'. In addition some packages do not actually tell you whether or not certain corrections have been applied. You can only find out by running a test data set.



Within-subject standard deviation

This statistic provides a useful measure of both reproducibility (same test material sent to different laboratories) and repeatability (same test material analyzed by same person in same laboratory). In other words, it describes the random component of measurement error. It also goes under the rather misleading name standard error of measurement. (This may be abbreviated to SEM, but it has nothing to do with the standard error of the mean which is also abbreviated to SEM.)

For instance you might subdivide a single blood sample, and send each subsample to a different laboratory for haemoglobin assay. The standard deviation of their various results could be used to describe the measurement error.

In practice this sort of assessment uses a number of samples (say each from a different patient) - and each original sample is subsampled and independently assayed. Because the original samples were not identical, the results you obtain will include the variation between patients - in addition to the measurement error. Therefore, simply pooling the results, and calculating the standard deviation of the errors (about their, single, common mean) will overestimate the variation arising from measurement error.

There are two obvious ways of avoiding this problem:

  • The standard deviation of the results for each patient is given separately.
      The problem here is that the standard deviations will vary from individual to individual. What we want is an estimate of the common within-subject variability.

  • The set of standard deviations is combined into a single, overall measure of error.
      Provided the same number of results are obtained for each original sample, the common within-subject variability can estimated by estimating the mean of their individual variances - and then taking the square root of this, to give the within-subject standard deviation.

Algebraically speaking -

Within-subject standard deviation (sw)  =    
where :
  • si2 are the variances of measurements on each subject
  • n is the number of original samples, or subjects, in your set.

Although this method works perfectly well if you have the same number of measurements for each individual, a more flexible approach to carry out what is called a one way analysis of variance. Taking the square root of the 'residual mean square' will give you the within-subject standard deviation. We cover the analysis of variance approach in Unit 11.

If there are only two measurements per original subject, there is a simpler formula because the variance of two observations is equal to half the square of their difference. So, the within-subject standard deviation can be obtained as follows:

Algebraically speaking -

Within-subject standard deviation (sw)  =    
where :
  • di2 are the squared differences between measurements for each individual
  • n is the number of subjects, or items, in your set.


This approach to getting a within-subject standard deviation is only valid if the standard deviation is independent of the mean. This can be checked by plotting the standard deviation for each individual against the individual's mean. If there is a relationship, the data should first be transformed (a log transformation is often effective) before replotting to ensure this assumption is met.

The within subject standard deviation can also be used to quantify measurement error in repeated measurements over time. However, this will only reflect measurement error alone if there is no trend over time. If there is, the within-subject standard deviation will overestimate the amount of measurement variation.

A common error is to use Pearson's correlation coefficient (see Units 1 & 12 ) to quantify variation due to measurement error. The reasoning for this is simple enough - albeit flawed. If there is little measurement error, their pairs of results should be quite similar, so should (co)vary roughly to the same degree - and Pearson's correlation coefficient provides a standard measure of the degree of co-variance.

Unfortunately, this correlation coefficient suffers from a problem - the more your original subjects (or samples) differ, the more this correlation coefficient will underestimate the measurement variation. Worse still, if there was any consistent trend in your results over time, that source of variation will bias the correlation coefficient. Fortunately there is a correlation coefficient which can be used - namely the 'intra-class correlation coefficient'. However, this can only be obtained after carrying out an analysis of variance which is described in Unit 11.



Assumptions and Requirements

The variance and standard deviation can be calculated for any variable - providing it can be ordered. But the standard deviation is only an appropriate measure of dispersion for a measurement variable, and only then if the data have a symmetrical distribution - and, in many cases, a normal one. Use of the standard deviation to display the variability of observations in range plots and box-and-whisker plots is misleading if these assumptions are not met. Assumptions about what proportion of observations are included within limits of agreement are also dependent upon this assumption.

For non-symmetrical (skewed) distributions there are two options:

  1. Use the median as the measure of location and the interquartile range as a measure of dispersion.
  2. Transform the data (often by taking logarithms) so that the distribution of values is symmetrical, and then work out the standard deviation of the transformed data.

For the within-subject standard deviation, it is assumed that the size of the deviation is not related to the magnitude of the measurement. This can be assessed graphically, by plotting the individual subject's standard deviations against their means.

We stress here that the standard deviation is only a measure of the variability of your observations. It is not a measure of the variability or the reliability of your estimated mean. We will come to measures of the variability and reliability of means later in this unit.

topics :

Using range to estimate SD

Absolute deviations