Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)




Cluster sampling

and goodness of fit tests

Categorical data are often obtained using some form of cluster sampling, rather than random sampling. In health surveys, the cluster may be city blocks or schools. In ecological research, clusters may be a location, a plot or a tree. In each case a number of sampling units will be taken from each cluster. Units within a cluster are often more similar to each other than units sampled at random. If so there will be a positive intra cluster correlation. One result of this is that the variation among clusters with m units in each cluster will be larger than the variance among groups of m randomly selected units.

In tests for goodness of fit and for independence, both Pearson's X2 and G are seriously inflated by such cluster effects. There are a number of different techniques to deal with this problem. One of the simplest is due to Rao & Scott (1981), (1992). The data are transformed by dividing by the variance inflation due to clustering, also known as the design effect (D). If certain assumptions are made, one can instead just divide the chi square (or G) statistic by D.

The design effect can be defined in terms of the intracluster correlation coefficient (a measure of the similarity among members of a group relative to the differences found among groups) ( rI) and the number in a cluster (m).

Hence D = 1 + (m − 1) rI.

The larger the intracluster correlation and the larger the cluster size, the bigger is the design effect.