What is pseudoreplication?
Pseudoreplication in its various forms is one of the commonest errors in the design and analysis of biological research. It is defined as the use of inferential statistics to test for treatment effects with data from experiments or observational studies where either treatments are not replicated (though samples may be) or replicates are not statistically independent. Strictly speaking this definition is inaccurate because there are a few situations where inferential statistics can be used when treatments are not replicated. But in most situations to test for treatment effects you must have (independent) replications of your treatments.
In medical and veterinary research pseudoreplication is usually referred to as 'using the wrong unit of analysis'. It is commonly encountered where the herd or village is the sampling or experimental unit, yet analysis is performed using the individual as the unit of analysis. For example, say people in two villages are given mosquito nets and the incidence of malaria is compared with that in two control villages without nets. Using individual people as replicates (for example attaching a confidence interval to the incidence rate with 'n' equal to the number of people) would be pseudoreplication as individuals within a village are not independent replicates. No matter how many individuals there are in each village, you only have two replicates of each treatment level since treatments were allocated to villages - not to individuals.
In most other disciplines, such as crop protection, forestry and wildlife research, the term pseudoreplication is used. For example, say fertilizer is applied to one plot of cabbages and not to another (control) plot. If 10 cabbages were sampled randomly from each plot and mean yield determined, then it would be quite wrong to use inferential statistics to demonstrate a difference between treatments. This would be pseudoreplication because the cabbages within a plot are not independent replicates. If (as in this case) the cabbages were sampled randomly, one could test for a difference between the two individual fields - but not between the two treatments which is what one is interested in. The same would apply if one took repeated samples over time from the same two fields, and tried to use the repeated samples as treatment replicates.
One might (wrongly) conclude from the points above that there is therefore no point in taking multiple samples over space or time. To return to the cabbages, perhaps it would be better to only sample two (or even one) cabbage(s) per field and use lots of replicate fields! The problem with that is that you would have a very low level of precision in measuring the yield for each replicate. Exactly the same would apply when comparing prevalences of malaria in treated versus untreated villages. If there were only ten people living in one of the villages, your estimate of prevalence there is inevitably going to be imprecise. In practice therefore we need to have enough observations to guarantee an adequate level of precision for each sampling 'level'.
Types of pseudoreplication
Hulbert described three ways in which pseudoreplication can arise which we detail below. It is possibly not the best way to categorise the different types, but we will stick to it since the terms are widely used in the literature.
Simple pseudoreplication
This is where there is only a single experimental unit (= replicate) per treatment, but multiple measurements are made on each experimental unit..
Simple pseudoreplication is probably most common in observational studies - especially multiple group studies. If random samples are taken from one polluted area and one unpolluted area, they can validly be used to compare the two areas, but not to examine the effect of pollution. That would require comparison of a number of polluted areas with a number of unpolluted areas.
Another way of looking at this is to say that the treatment factor is completely confounded with area. In other words, we do not know whether the differences arise from some other factor which differs between the two areas.
Temporal pseudoreplication
This is identical to simple pseudoreplication except that multiple samples are taken are taken from the same unit over time, rather than over space.
Sacrificial pseudoreplication
This occurs where treatments have been genuinely replicated, but where the analysis does not correctly use the variation between replicates to assess the treatment effect. This is frequently a problem with nested designs. Hurlbert divided this into two categories but, as we shall see, the two categories only reflect the different ways in which binary and measurement variables are analysed:
- The data for replicates are pooled before analysis.
This is most commonly done for binary variables when proportions are being compared - such as prevalences or sex ratios. For example say an insecticide treatment is applied to ten villages, and not to another ten villages. The prevalence of malaria is then compared by pooling all the observations in treated and untreated villages, and comparing overall prevalences (using for example Pearson's chi square test). This is not valid because it assumes that treatment has been randomized to individuals - when in fact it was randomized to villages. The correct approach is to work out the prevalence for each village and then compare the mean prevalences for treated and untreated villages.
- Repeated measurements (evaluation units) on one experimental unit are treated as independent replicates. This is usually done for measurement variables where analysis is then carried out using (say) a t-test or analysis of variance. To return to our cabbages, say two levels of treatment are randomly allocated to four fields of cabbages. You then sample 10 cabbages from each field and weigh them. This is a nested design with cabbages nested in field, and fields nested in treatment. It is incorrect to analyze this using cabbages as replicates because treatment was randomized to fields - not to cabbages. In other words you have only two replicates (comprised of the means of 10 weights) for each treatment level - not 20 replicates!
Ensuring independence
Even if treatments are allocated randomly, and the analysis is carried out using the correct unit of analysis, lack of independence can still arise after the process of randomization. We have already discussed this issue at length in the core text so we will restrict ourselves to two further examples.
Efficacy of anthelmintics treatment for calves
Say we want to compare three anthelmintic drugs with a 'no treatment' control for their efficacy in keeping calves free of worms when they are released on to pasture in spring. Forty calves (= experimental units) are selected for the experiment and the four drugs are randomly allocated to the calves. We then want to turn the calves out to pasture to assess the efficacy of treatment. If we turn them all out on to pasture together, then infections will develop in the control calves and continually reinfect the pasture. Hence treatments will not be independent. If we divide the pasture into four, and keep each treatment group separate we again have non-independence and effectively only one replicate for each treatment.
One possible solution in this case would be to divide the pasture into 40 separate pastures, and keep each calf separate. However, one could argue that this is a very unnatural arrangement and may affect their behaviour. Probably the best option would be to change the experimental unit to a herd of (say) five calves, allocate treatment to herds and then release each herd on to separate pasture. Because areas of pasture close to each other are likely to have similar worm loads (spatial autocorrelation) care would have to be taken to ensure that allocation of herds to pasture area is also done randomly. Note that in doing this we again change the experimental unit, which has now become the herd/pasture.
Comparison of insect trap types
Say we want to compare four different trap types for their efficacy in catching a species of insect. Let's start with a completely randomized experimental design. We could just make twenty replicate traps of each design, select eighty sites (= experimental units) and randomly allocate the four trap types to the different sites. Sites would have to be selected such that no two sites were too close together (that is within the 'range of attraction' of another trap) or treatments would not be independent. One also has to assume that the insect population is (infinitely) large and mobile so that there is no 'trapping out' effect around the best traps. Catch would be recorded over a period of time. It would probably be best to record over several days in order to increase precision - but note we cannot use the repeated samples over time as replicates in this design as that would be pseudoreplication.
Such a design would be statistically valid. But the main problem it would face is that trap catches often vary enormously between sites. Twenty replicates per treatment may not be adequate - we might need to use hundreds of replicates to show a difference. Such a massive trapping effort could drastically reduce the size of the insect population - not to mention your research funds.
This problem is often got round in practice by using some sort of crossover design - usually a multiperiod Latin square design, where treatment changes over time. The rows and columns of the Latin square represent sites and days respectively. A key assumption for this design is that relative catches in different sites remains the same over time (technically there is no 'day times site' interaction). Ideally several replicate Latin squares are run in several locations, but often just one square is used.
So what are the implications of this design for pseudoreplication??
This can be answered most readily by asking how many replicates of each trap we are actually comparing. Ideally one would use a different replicate of each trap type each day. But, in practice, what is usually done (and what we have done in the past!) is to rotate the same trap round to a different position each day. If we only have one square, there is therefore only one of each trap type in existence. This means that treatments within a square are not independent - and we cannot possibly be comparing trap type, only those particular traps! What should be done here is either use a different replicate trap each day - so that each square is using four replicate traps - or treat each square as providing only a single (relatively precise) replicate, and replicate the squares.
Some contrasting viewpoints
The term 'pseudoreplication' was first introduced by Hurlbert (1984) in one of the most widely cited publications of all time. However, a number of ecologists have since taken issue with Hurlbert, especially in relation to terminology, and the need to distinguish the issues of replication and independence. A few, most noticeably Oksanen (2001), have gone further and rejected some of the key tenets of Hulbert's thesis. Oksanen considered what the options were for a researcher who wished to investigate the impact of an intervention in a large scale system such as a lake or a ranch.
These were:
- Carry out microcosm experiments - in other words create miniature ecosystems in the laboratory and experiment with these.
- Carry out a field trial but at a reduced scale - say with small ponds or enclosure experiments. Both these options have their place in investigating an ecological issue but one can never be sure that things would not be very different at a larger scale.
- Use a single treatment but compare it with multiple controls. Various analytical options are then available although most make questionable assumptions.
- Conduct an unreplicated experiment and then either (a) just present results descriptively with no statistical testing (as recommended by Hurlbert) or (b) use inferential statistics and commit pseudoreplication.
Oksanen evaluates these different options and concludes that only Hulbert's approach of using descriptive statistics is suboptimal - other approaches all have their place. He also uses the (distinctly dubious) argument that since statistical tests of spatial and temporal differences in observational studies still abound in the literature, that pseudoreplication can therefore be condoned in experimental studies. Cottenie & De Meester (2003) attempt to take a middle view between Oksanen and Hurlbert. They regard inferential statistics for one replicate large scale experiments as 'gentleman's behaviour towards the reader'. The results of statistical tests are considered as merely an extension of descriptive statistics.
Our own viewpoint is close to Hurlbert's. Pseudoreplication is simply a mistake, and it always can and should be avoided if at all possible. The error is in regarding that test as being of some treatment effect for which there is no independent replication. However, we see nothing wrong with using inferential statistics on randomly-located samples to demonstrate that two specific areas are different in some respect.