Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Survey Sampling MethodsOn this page: Purpose Random & systematic sampling One stage cluster sampling Two stage cluster sampling Stratified sampling Adaptive cluster sampling
The primary purpose of survey sampling is to obtain estimates of population parameters. These may be of the absolute (or relative) density of an organism, or of some characteristic of that population - such as the proportion infected with a disease, or the mean packed cell volume. As well as the measure of location, we also want a measure of precision, usually the standard error.
Reliable estimates of population parameters can only be obtained with the use of probability sampling - that is where each sampling unit has a known probability of selection. We therefore concentrate on these approaches below - with the proviso that in multistage sampling the requirement for strict probability sampling is often relaxed in the lower levels.
Random and systematic sampling
Simple random sampling
A simple random sample of n sampling units is one in which every possible combination of n units is equally likely to be in the sample selected. Alternatively, we can define it as a sample where each unit is selected one at a time and all units not in the sample have the same probability of being selected. Note that because we are usually sampling from finite populations, as more units are selected, the probability of each remaining unit being selected increases.
You have already encountered formulations for the mean and standard error of both means and proportions when simple random sampling is being used - we summarize them again here for completeness.
A systematic sample is one in which the units selected for a sample occupy related positions in the sampling frame, the first unit being selected at random. For example, if you have a population (N) of 1000 units, you select the first unit at random (say unit 78) and then select (say) every 50th unit to get units 128, 178, 228 and so on. The sampling interval (k) in this case would be 50.
The sample mean and sample proportion are calculated in exactly the same way as for simple random sampling. These point estimates are unbiased providing (a) the first unit is selected at random and (b) the ratio of N/k is an integer and (c) there is no periodic variation. The first and third assumptions are critical. The second assumption is less so, and providing N is large, the bias resulting from N/k not being an integer is so small that it can be disregarded.
As regards actual precision, systematic samples are generally more precise than random samples - simply because they cover the population evenly. This is not, however, the case if there is periodic variation - in this case they can be much less precise than random samples. Estimating that precision is more problematical. Approximate values for standard errors are usually calculated in the same way as for simple random sampling, but they are invariably biased. For most situations estimated standard errors are greater than the true standard error, although with periodic variation the reverse is true.
One way round this problem is to take repeated sets of systematic samples. Say one wants to take a sample of say 30 units from a total population (N) of 240. Instead of a single systematic sample of 30 units with k=8, one would take 6 systematic samples with each containing 5 units and a sampling interval of 48. The start point for use with each of those samples would be chosen at random. The mean and standard error would then be calculated in the same way as with one stage cluster sampling (see below) using the 6 systematic sample means in place of the cluster means (i).
One stage cluster sampling
In one stage cluster sampling the clusters are chosen by simple random sampling, and within each cluster all secondary (evaluation) units are selected. The advantage of one stage cluster sampling is that you only need to be able to list all clusters to make the initial selection, and then to be able to detect all secondary units in the selected clusters.
If cluster sampling is used, the formula for a simple random sample will overestimate the precision of your estimate. This is because that formula assumes that members of the sample have been drawn independently with equal probabilities, which is not the case when cluster sampling is used. In fact if secondary units within a cluster tend to be more similar to each other than to units in other clusters, then the true standard error of your estimates will be much higher than those obtained from simple random sampling.
We first consider the situation where there are the same number of secondary units in each cluster. In many situations this is improbable. But it can occur - for example when sampling school classes, agricultural plots or cages of animals. The standard error of the overall mean is calculated very simply using the cluster means in place of the individual observations.
The standard error of the overall proportion is calculated in the same way.
Note especially that the standard error of a proportion under cluster sampling is estimated quite differently from the way it was done under simple random
As a result, in cluster sampling the formulae for means and proportions are identical. Hence for the remainder of the section on cluster sampling we will only give the formulae for means. Simply substitute i with pi to obtain proportions.
We will next consider the situation where all members of each cluster are examined, but clusters contain different numbers of individuals. This would be the case if for example we were using schools as our sampling unit.
We can estimate the overall mean by
As we noted previously, simply substitute i with pi to obtain proportions. This will give a formula that is identical to (but we hope more illuminating than) the usual formulation for
Two stage cluster sampling
In two stage cluster sampling, only a sample of secondary (evaluation) units is selected. We start with the equally-weighted clusters - in other words all clusters are of similar size. We have included the finite population correction in the formulation given below (see Bart et al.
This clearly has the potential to get quite complicated. However, usually only a very small proportion of primary units is sampled, so f1 tends the zero. Under these conditions the formula simplifies to being identical to that used for one stage cluster sampling. Hence the variance of the overall mean is estimated solely from the variation between cluster means. This has very important implications:
Where clusters differ in size, they can be chosen either by simple random sampling or by probability proportional to size.
Stratified random sampling is similar to systematic sampling in one respect - it provides a more even coverage of the population. However, it has a major advantage over systematic sampling in that estimates of variability in the population are straightforward and unbiased. A further advantage is that separate estimates can be obtained for each part of the study area. It can be done using either equal allocation (same number of units sampled in each stratum) or proportional allocation (same proportion of units sampled in each stratum); appropriate formulae for each are given below.
The overall mean () is estimated as the weighted mean of the stratum means:
Estimation of the standard error for a mean obtained by stratified sampling is very different from that with cluster sampling. This is because with stratified sampling, the random element is within each stratum - with cluster sampling it is in the selection of the clusters.
In order to estimate the standard error of the mean from a stratified sample, we need to first estimate the sample variance in each stratum. Assuming that simple random sampling is used, this is given by the sum of the square of the deviations of individual measurements from the stratum mean divided by the number of observations per stratum less one. We then combine these variances in a weighted average to estimate the standard error of the overall mean as the square root of the sum of weighted stratum variances divided by the number of observations per stratum:
For proportions, we can estimate the overall proportion by multiplying the proportion for each stratum by its weight, and summing the result. Because sampling within each stratum is random, we can use binomial standard error (√(pq/n)) within strata, and then weight these to estimate the overall standard error. The standard error of the overall proportion is then obtained from:
Adaptive cluster sampling
Here an initial random sample of units is taken, but then additional sampling units are taken in the immediate neighbourhood of the 'positive' sampling unit. This creates a set of 'networks' of sampling units, each comprising different numbers of sampling units.