Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site


  1. Sampling methodology

    The target population should first be clearly defined whether by location, species, age, breed, gender and so on. The response variable can be a binary variable (such as infected/uninfected) or a measurement variable (such as species richness). In some analytical surveys, there will be a clear hypothesis over which explanatory variable(s) are responsible for variation in the response variable. In others it is more of a 'fishing expedition' to assess which of a range of explanatory variables affect the response variable. Remember there will also be a number of potentially confounding variables which may affect the outcome. In an analytical survey the only way to adjust for confounding variables is at the analysis stage - but this can only be done if you have to have measured them in the first place! In many cases the value that the response variable takes may be related to the value of the explanatory variable at some time in the past - for example exposure to harmful materials. In medical studies one can ask participants about conditions in the past - but remember that recall bias can be a severe problem. Every effort should be made to minimize measurement error.

    In medical studies the sampling unit may be the individual (the classic cross-sectional study) or it may be a group of individuals such as schools, districts or countries. In veterinary studies, the sampling unit may be the individual animal, but is more commonly the farm or herd. In ecological studies the sampling unit is commonly the plot, or a specific habitat such as a woodland or nature reserve. You must then define your sampling frame for the primary sampling unit - whether individuals, farms or nature reserves. You can then either take a simple random sample, or more commonly use some form of multistage sampling.


  2. Analysis of cross-sectional data

    We have to assume that exposure precedes the response if causality is to be inferred. Since this assumption usually cannot be made, cross-sectional studies should be considered to be exploratory and to provide only initial indications of possible causative factors which need to be followed up with other studies.

    • Binary response variable

      Pearson's chi square test can be used to assess the significance of the association between a binary response variable and each categorical or ordinal explanatory variable. In addition one should also calculate a (prevalence) risk ratio with its confidence interval as a measure of the effect size. After checking each variable separately (the univariate analysis), one should look at how all important variables act together on the response variable by carrying out a multivariate analysis. The most commonly used method of multivariate analysis is logistic regression. This produces an estimate of the effect size adjusted for the action of confounding variables. However, this has the (possible) drawback that the effect size is expressed as an odds ratio for each variable, not a risk ratio.

      Many authorities prefer the prevalence risk ratio as the effect measure since the absolute value remains constant irrespective of the prevalence. The problem with the odds ratio is that it only provides a good approximation to the risk ratio if the prevalence is low in both the exposed and unexposed groups. However as you can see from the figure below, for high prevalence and/or high risk ratios, the odds ratio may greatly overestimate the risk ratio. Hence if the odds ratio is 3.0 you cannot state that the factor in question increases the risk of disease three times.

      {Fig. 1}

      It has been argued (Pearce (2004)) that, subject to certain assumptions, the odds ratio from a cross-sectional study provides the best estimate of the incidence rate ratio - which may indeed be a better measure of what we are interested in rather than the risk ratio. Those assumptions are that we have a steady state population and that the disease is rare. The jury is still out on which approach will eventually become standard - the important thing is to specify exactly what measure you are using, and to interpret it correctly.

      Always specify exactly what effect measure has been used. If you are interested in the absolute value of the ratio, remember the odds ratio overestimates the risk ratio to a greater or lesser extent. However, it may provide the best estimate of the incidence rate ratio.

      There are other options for analyzing cross-sectional survey data, although opinions differ on what is best. One option is to still use logistic regression (generalized linear model with a binomial distribution and a logit link), but use it to estimate prevalences. Taking the ratios of these gives the prevalence risk ratio. Unfortunately it is difficult (though not impossible) to attach a confidence interval to this ratio. Another option is to use Cox's proportional hazards regression which involves the (questionable) assumption of Poisson rather than binomial variability. It does however give risk ratios rather than odds ratios. A third approach is use a generalized linear model with a binomial distribution and a log link - this model has the right distributional assumptions but may require constrained estimation to avoid prevalence estimates that are greater than 1. We look at these modelling approaches in Unit 14.

      Simple methods of analysis assume that a simple random sample has been taken. If cluster sampling has been used, the result of Pearson's chi square test must be adjusted appropriately. The presence of clusters must also be accounted for in the generalized linear models, usually by inclusion of a random effects term.


      All the above comments refer to the situation when the response variable is a binary variable and the explanatory variables are nominal or ordinal. If the explanatory variable is a measurement variable, an alternative approach is to compare the mean level of the explanatory variable for the two levels of the response variable using either a two-sample t-test or the Wilcoxon-Mann-Whitney test. For multiple groups use analysis of variance . The effect size should be summarized as the difference between the means (or medians) with its confidence interval.


    • Measurement response variable

      When both response and explanatory variables are ordinal or measurement variables, the usual approach is to calculate either a (non-parametric) rank correlation coefficient or parametric Pearson's correlation coefficient . Alternatively, if there is a clear distinction between the dependent and independent variable(s), then linear regression, multiple regression or various generalized linear models can be used. Linear regression and correlation are briefly introduced in Unit 1, but modelling is examined repeatedly through this course. The best measure of effect size is given by the coefficient of determination (r2) which quantifies the percentage of the variance of the response variable that can be accounted for by a linear fit of the response variable on the explanatory variable.