InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

### Purpose

The primary purpose of survey sampling is to obtain estimates of population parameters. These may be of the absolute (or relative) density of an organism, or of some characteristic of that population - such as the proportion infected with a disease, or the mean packed cell volume. As well as the measure of location, we also want a measure of precision, usually the standard error.

Reliable estimates of population parameters can only be obtained with the use of probability sampling - that is where each sampling unit has a known probability of selection. We therefore concentrate on these approaches below - with the proviso that in multistage sampling the requirement for strict probability sampling is often relaxed in the lower levels.

### Random and systematic sampling

#### Simple random sampling

A simple random sample of n sampling units is one in which every possible combination of n units is equally likely to be in the sample selected. Alternatively, we can define it as a sample where each unit is selected one at a time and all units not in the sample have the same probability of being selected. Note that because we are usually sampling from finite populations, as more units are selected, the probability of each remaining unit being selected increases.

You have already encountered formulations for the mean and standard error of both means and proportions when simple random sampling is being used - we summarize them again here for completeness.

#### Algebraically speaking -

 = ΣYi SE()   = s n √n
Where :
• is the sample mean,
• Yi is the value of each observation in the sample,
• n is the number of observations in the sample,
• s is the sample standard deviation where s2 = Σ ( − Yi )2 / (n−1).
• SE() is the estimated standard error of the sample mean. If the sample constitutes more than 5% of the population, multiply by the finite population correction: √(1 - n/N) where N is the total population size.

#### Algebraically speaking -

 p   = f SE(p) = √ p  (1 − p) n (n − 1)
Where :
• p is the proportion with the character of interest,
• f is the frequency with the character of interest,
• n is the number of observations in the sample,
• SE(p) is the estimated standard error of the proportion. If the sample constitutes more than 5% of the population, multiply by the finite population correction: √(1 - n/N) where N is the total population size.

#### Systematic Sampling

A systematic sample is one in which the units selected for a sample occupy related positions in the sampling frame, the first unit being selected at random. For example, if you have a population (N) of 1000 units, you select the first unit at random (say unit 78) and then select (say) every 50th unit to get units 128, 178, 228 and so on. The sampling interval (k) in this case would be 50.

The sample mean and sample proportion are calculated in exactly the same way as for simple random sampling. These point estimates are unbiased providing (a) the first unit is selected at random and (b) the ratio of N/k is an integer and (c) there is no periodic variation. The first and third assumptions are critical. The second assumption is less so, and providing N is large, the bias resulting from N/k not being an integer is so small that it can be disregarded.

As regards actual precision, systematic samples are generally more precise than random samples - simply because they cover the population evenly. This is not, however, the case if there is periodic variation - in this case they can be much less precise than random samples. Estimating that precision is more problematical. Approximate values for standard errors are usually calculated in the same way as for simple random sampling, but they are invariably biased. For most situations estimated standard errors are greater than the true standard error, although with periodic variation the reverse is true.

One way round this problem is to take repeated sets of systematic samples. Say one wants to take a sample of say 30 units from a total population (N) of 240. Instead of a single systematic sample of 30 units with k=8, one would take 6 systematic samples with each containing 5 units and a sampling interval of 48. The start point for use with each of those samples would be chosen at random. The mean and standard error would then be calculated in the same way as with one stage cluster sampling (see below) using the 6 systematic sample means in place of the cluster means (i).

### One stage cluster sampling

In one stage cluster sampling the clusters are chosen by simple random sampling, and within each cluster all secondary (evaluation) units are selected. The advantage of one stage cluster sampling is that you only need to be able to list all clusters to make the initial selection, and then to be able to detect all secondary units in the selected clusters.

If cluster sampling is used, the formula for a simple random sample will overestimate the precision of your estimate. This is because that formula assumes that members of the sample have been drawn independently with equal probabilities, which is not the case when cluster sampling is used. In fact if secondary units within a cluster tend to be more similar to each other than to units in other clusters, then the true standard error of your estimates will be much higher than those obtained from simple random sampling.

#### Equally-weighted clusters

We first consider the situation where there are the same number of secondary units in each cluster. In many situations this is improbable. But it can occur - for example when sampling school classes, agricultural plots or cages of animals. The standard error of the overall mean is calculated very simply using the cluster means in place of the individual observations.

#### Algebraically speaking -

 = Σi SE()   = s n √n
Where :
• is now the overall mean of the clusters or, because each cluster has the same number of observations, the mean of means,
• i is the value of each cluster mean,
• n is the number of clusters in the sample,
• s is the standard deviation of the cluster means where s2 = Σ( i )2 / (n − 1).
• SE() is the estimated standard error of the overall cluster mean.

The standard error of the overall proportion is calculated in the same way.

#### Algebraically speaking -

 = Σpi = Σfi SE() = s n N √n
where
• is the overall (mean) sample proportion,
• pi is each individual proportion,
• n is the number of clusters in the sample,
• fi is the number with the character of interest in each cluster,
• N is the total number of observations,
• s is the standard deviation of the individual proportions, where s2 = Σ(pi)2/(n − 1).
• SE() is the estimated standard error of the overall proportion.

Note especially that the standard error of a proportion under cluster sampling is estimated quite differently from the way it was done under simple random sampling. In fact, it is handled exactly the same way as a mean - which of course is what it is (except of a binary rather than of a measurement variable) .

As a result, in cluster sampling the formulae for means and proportions are identical. Hence for the remainder of the section on cluster sampling we will only give the formulae for means. Simply substitute i with pi to obtain proportions.

#### Unequally-weighted clusters

We will next consider the situation where all members of each cluster are examined, but clusters contain different numbers of individuals. This would be the case if for example we were using schools as our sampling unit.

We can estimate the overall mean by weighting each individual cluster mean by its relative size. The standard error of those weighted means is estimated by squaring the difference between each weighted cluster from their overall weighted mean - and using that squared weighted deviation in the usual standard error formula. The formulation below is given by Snedecor & Cochran (1967), p 515.

#### Algebraically speaking -

 w   = Σ(iŚmi) SE(w) = sw Σmi √n

where

• w is the overall weighted mean,
• mi is the number of secondary units in each cluster
• n is the number of clusters in the sample,
• is the average cluster size (Σmi/n),
• sw is the weighted standard deviation where
 sw2 = 1 Σ{ ( mi ) 2 (i − w)2 } (n − 1)

As we noted previously, simply substitute i with pi to obtain proportions. This will give a formula that is identical to (but we hope more illuminating than) the usual formulation for proportions which was devised for use on calculators. Whichever formulation is used, note that it is only a good approximation to the true value of the standard error.

### Two stage cluster sampling

#### Equally-weighted clusters

In two stage cluster sampling, only a sample of secondary (evaluation) units is selected. We start with the equally-weighted clusters - in other words all clusters are of similar size. We have included the finite population correction in the formulation given below (see Bart et al. (1998) p 116 and Krebs (1999) p 296) because we can then see how the standard error is comprised of two parts: the first due to variation between the primary units (the clusters) and the second due to variation within the primary units:

#### Algebraically speaking -

 = Σ(i) n
where:
• is the overall mean
• i is the mean of the ith unit of the sample,
• n is the number of primary units.
 SE() = √ [ 1 − f1 ] s12 + [ f1 (1 − f2) ] s22 n mn
where:
• n is the number of primary units sampled out of a total of N,
• m is the number of secondary units sampled out of a total of M,
• f1 is the sampling fraction in the first stage (n/N),
• f2 is the sampling fraction in the second stage (m/M),
• s12 is the variance among the primary unit means,
• s22 is the variance among observations within the primary units.

This clearly has the potential to get quite complicated. However, usually only a very small proportion of primary units is sampled, so f1 tends the zero. Under these conditions the formula simplifies to being identical to that used for one stage cluster sampling. Hence the variance of the overall mean is estimated solely from the variation between cluster means. This has very important implications:

• Secondary samples need only provide unbiased estimates of the mean of each cluster - an unbiased estimate of the variance of each mean is not required. Hence systematic (rather than random) sampling can be used for this purpose - and it no longer matters that you cannot obtained an unbiased estimate of the variance using this type of sampling. The cluster means are used to estimate the standard error of the overall mean. Note however that it is vitally important to get an unbiased estimate of the mean of each cluster - hence, although systematic sampling is fine, convenience sampling is not!
• Because the variance of the overall mean is estimated solely from the variation between cluster means, it is important that the clusters represent the full range of variation present in the population.
• Standard errors can be estimated of statistics (such as diversity indices) that are difficult to estimate in a one stage sample.

#### Unequally-weighted clusters

Where clusters differ in size, they can be chosen either by simple random sampling or by probability proportional to size.

• If you select clusters by simple random sampling, you first have to determine the number of units in the selected clusters. The number of units sampled per cluster should then be proportional to the number of units in the cluster. Authorities differ on the optimal formulation to use for the standard error, but in practice it is usually assumed that only a very small proportion of primary units is sampled - so the same formulation for the standard error is used as when doing one stage cluster sampling.

• Using probability proportional to size simplifies the estimation of standard errors - but you have to know the number of secondary units in every cluster. A worked example of how to select by probability proportional to size is given below. An equal number of secondary units is then sampled per cluster. Use of probability proportional to size means that the sample mean per sub-unit is then an unbiased estimate of the population mean, and its standard error can be obtained by the formulation below (see Snedecor & Cochran (1967), p 536)

#### Algebraically speaking -

 SE() = √ 1 Σ i − )2 n (n − 1)

where

• is the overall weighted mean
• i is the mean of the ith unit of the sample,
• n is the number of primary units.

### Stratified sampling

Stratified random sampling is similar to systematic sampling in one respect - it provides a more even coverage of the population. However, it has a major advantage over systematic sampling in that estimates of variability in the population are straightforward and unbiased. A further advantage is that separate estimates can be obtained for each part of the study area. It can be done using either equal allocation (same number of units sampled in each stratum) or proportional allocation (same proportion of units sampled in each stratum); appropriate formulae for each are given below.

The overall mean () is estimated as the weighted mean of the stratum means:

#### Algebraically speaking -

 = ΣWii

where
• is the overall mean
• Wi is the proportion of the population in the ith stratum,
• i is the mean of the ith stratum.

For proportional allocation, this simplifies to the arithmetic mean of the individual strata:
 = Σi / N

where:

• N is the number of strata,
• and i are as above.

Estimation of the standard error for a mean obtained by stratified sampling is very different from that with cluster sampling. This is because with stratified sampling, the random element is within each stratum - with cluster sampling it is in the selection of the clusters.

In order to estimate the standard error of the mean from a stratified sample, we need to first estimate the sample variance in each stratum. Assuming that simple random sampling is used, this is given by the sum of the square of the deviations of individual measurements from the stratum mean divided by the number of observations per stratum less one. We then combine these variances in a weighted average to estimate the standard error of the overall mean as the square root of the sum of weighted stratum variances divided by the number of observations per stratum:

#### Algebraically speaking -

 SE() = √ Σ Wi2 si2 ni
where
• SE() is the standard error of the overall mean, ,
• si2 is the variance of the ith stratum, Σ(Yii)2/n − 1,
• ni is the number of observations in the ith stratum.

For proportional allocation this simplifies to:
 SE() = √ ΣWi si2 n
where

• Wi is the proportion of the population in the ith stratum,
• n is the total number of observations.

For proportions, we can estimate the overall proportion by multiplying the proportion for each stratum by its weight, and summing the result. Because sampling within each stratum is random, we can use binomial standard error (√(pq/n)) within strata, and then weight these to estimate the overall standard error. The standard error of the overall proportion is then obtained from:

#### Algebraically speaking -

 = ΣWipi

 SE() = √ Σ Wi2 piqi ni
where:

• is the overall (weighted) proportion,
• Wi is the weight of each stratum (obtained by dividing the number of sampling units in each stratum by the total number of units)
• pi is the proportion with the characteristic of interest in the ith stratum,
• SE() is the standard error of the overall proportion,
• qi is equal to 1 - pi.

For proportional allocation this simplifies to:  SE() = √ ΣWipiqi. n

Here an initial random sample of units is taken, but then additional sampling units are taken in the immediate neighbourhood of the 'positive' sampling unit. This creates a set of 'networks' of sampling units, each comprising different numbers of sampling units. Krebs (1999) gives the following as an unbiased estimator of the mean and standard error.

#### Algebraically speaking -

 = Σ i n
where:
• is an unbiased estimate of the overall mean per sampling unit.
• i is the mean of the ith network, given by ΣYi/mi for each network,
• Yi is the number in each sampling unit, and mi is the number of sampling units in each network,
• n is the number of initial sampling units selected by random sampling

Note that if a given network i includes k primary sample units, it is counted k times in the estimate of the overall mean.

The standard error of the mean is given by:

 SE() = √ Σ( − i)2 n(n − 1)
where:
• SE() is the standard error of the overall mean. If the sample constitutes more than 5% of the population, multiply by the finite population correction: √(1 − n/N) where N is the total possible number of sampling units.
• , i and n are as above.