InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site

 

 

Sample variance and Standard Deviation using R

Variance and SD

R can calculate the sample variance and sample standard deviation of our cattle weight data using these instructions:

Giving:

> var(y)
[1] 1713.333
> sd(y)
[1] 41.39243

    Note:
  • var(y) instructs R to calculate the sample variance of Y. In other words it uses n-1 'degrees of freedom', where n is the number of observations in Y.

  • sd(y) instructs R to return the sample standard deviation of y, using n-1 degrees of freedom.

  • sd(y) = sqrt(var(y)). In other words, this is the uncorrected sample standard deviation.

  • This var function cannot give the 'population variance', which has n not n-1 d.f. But, there are 2 simple ways to achieve that:

  • Remember if n=1 the second variance formula will always yield zero, because the mean of y will equal y, whereas the first formula will always yield NA, because 0/(1-1) = 0/0 and cannot be evaluated.

  • Similarly, to obtain the 'population' standard deviation, use:

 

 

Variance from frequencies and midpoints

R can calculate the variance from the frequencies (f) of a frequency distribution with class midpoints (y) using these instructions:

Giving:

[1] 143.8768

    Note:
  • y=c(110, 125, 135, 155) copies the class interval midpoints into a variable called y.

  • f=c(23, 15, 6, 2) copies the frequency of each class into a variable called f.

  • ybar=sum(y*f)/sum(f) creates a variable called ybar, containing the arithmetic mean - as calculated from these frequencies and midpoints.

    However, even if you have a more accurate arithmetic mean, calculated directly from the observations themselves, you need to use this formula. If you do not do this your estimated variance will be too high - because this formula gives the mean based upon the same assumptions as your variance will be calculated.

  • sum(f*(y-ybar)^2) / (sum(f)-1) calculates the sample variance from the frequencies, f, midpoints, y, and the mean estimated from them, ybar.

    Alternately, you could combine two of these instructions as: sum(f*(y-sum(y*f)/sum(f))^2)/(sum(f)-1)

  • Remember this only provides an estimate of the variance you would obtain from the original data - and is dependent upon the choice of midpoints, and the number of class intervals used.