Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site



Case-control designs

Sampling methodology

    The source population has to be (at least partially) defined by specified criteria so that appropriate controls can be selected. Such criteria will often be geographic, although they need not necessarily be so. Cases must be rigorously defined to avoid misspecification bias.

    Selection of cases is often straight forward since usually all cases from the source population over a defined period are included. If all cases are not selected, then a random sample of cases must be taken. If cases are only selected from hospital, there may still be bias. Some people (for example those able to afford medical insurance) are more likely to be hospitalized than other people for the same disease. It is preferable to use new (incident) cases because previously diagnosed cases represent long term survivors.


    Selection of controls is probably the most important (and most difficult) part of a case-control study. In general, controls should be representative of those in the source population at risk of becoming a case. The way in which controls are obtained depends on how well defined the source population is.

    1. Population controls
      If the source population is fully defined (in other words all units in the population can be listed), a random sample of controls can be obtained from that population. The study can then be properly described as population-based. This improves the validity of conclusions, and decreases the likelihood of selection bias. Usually a listing is only available in special circumstances, such as a case-control study nested in a cohort study. More often various pseudo-random methods are used to obtain the sample including systematic and haphazard sampling and (for human studies) random digit telephone dialling - all of which are subject to bias. If matching is carried out (see below), controls only need to be representative (that is randomly sampled) within strata - for example within each age group. If a density case-control design is being used, then the selection probability for each control should be proportional to the individual's person-time at risk - a technique known as density sampling. A major problem with population controls is participation bias. It is not unusual to get a participation rate of less than 50% - and non-participation may be correlated with socioeconomic status and/or education level.
    2. Neighbourhood controls
      Where the source population is not fully defined, controls can be obtained from the vicinity of the place of residence of the case. They are commonly used as matched controls. Selection bias should be avoided by using an element of random or systematic sampling in the final selection - for example visiting neighbours along a street in a set order, or at predetermined distances and angles from the case's home. Ecologists looking at the reasons for choice of particular nest sites (or prey kill sites) often use neighbourhood controls selected in some predetermined manner from the vicinity of the chosen nest site. In human studies there is still a problem with low participation rates using neighbourhood controls.

    3. Hospital controls
      These are patients suffering from other diseases selected at random in the same or neighbouring hospitals. They are easier to obtain than population controls and often more cooperative - leading to much higher participation rates. Hospital controls may either be matched to cases by some characteristic or selected at random from the hospital 'population'. There are several possible sources of bias in using hospital controls.

      1. The risk factor for the disease being studied may also be a risk factor for one of the diseases which controls are suffering from. This will result in underestimation of the importance of the risk factor. For example, smoking is a risk factor for a wide range of diseases, so case control studies using hospital controls are inappropriate for identifying smoking as a risk factor.
      2. Patients suffering from two diseases at the same time may be more likely to be hospitalized than patients suffering from either one. This leads to a spurious association between the two and is known as Berkson's bias
    4. Friend, associate or relative controls
      Here cases are asked to name friends, associates or relatives who could act as matched controls in the study. There is clearly a risk of selection bias here - for example, there is evidence that cases tend to identify friends who are better educated. On the other hand the level of participation bias is much reduced.


    We then have to consider the optimal number of controls. Probably the commonest approach is to just have one control per case. This is optimal if one has a sufficient number of cases. But if there are very few cases available, then it is better to increase the number of controls up to a maximum of about 4 controls per case.


    There are big advantages to using two different types of control groups, even though this increases the work involved, and hence the cost of the study. If both control groups give the same sort of answers, then the credibility of the results is strengthened. Some authors have argued against the use of two control groups on the basis that one does not know which result to ignore if one gets disparate results. This argument seems to be rather along the lines of 'ignorance is bliss'. It is far better to know if there is a possible problem of selection bias in one of the groups so it can be further investigated.

    Once cases and controls have been selected, then their exposure to the suspected risk factor(s) must be assessed. This is done from past records or by questionnaire. If at all possible, the same method should be used for cases and controls.


Analytical methods

  • Unmatched (and frequency matched) studies

    Since we are taking separate samples of cases and controls, we cannot estimate prevalences and hence cannot directly estimate the risk ratio. However, we can estimate the odds ratio as explained in Unit 1. Exactly what this odds ratio approximates to depends on which particular variant of the design is being used. For a cumulative case-control design, the odds ratio will only approximate to the risk ratio if the condition is rare (low incidence). Otherwise it will over-estimate the risk ratio. For a density case-control design, the odds ratio will approximate to the risk ratio irrespective of whether the condition is rare or not.

    For an unmatched case-control study, continuous explanatory variables can be compared using the parametric two-sample t-test or the non-parametric Wilcoxon-Mann-Whitney test. The significance of an association between a risk factor and case status can be tested using Pearson's chi square test , Fisher's exact test (but not recommended), or by attaching a confidence interval to the odds ratio. Use of Mantel-Haenszel methods to deal with multiple 2×2 tables are dealt with in . Modelling approaches for case-control designs using logistic regression are covered in . When cases and controls are frequency matched, Szklo & Nieto (2004) suggest that the most efficient strategy is to use ordinary logistic regression and include the matching variables in the model.


  • Individually matched studies

    Matching must be taken account of at the analysis stage since cases and controls are no longer being sampled independently. Consequently the data are arranged in a contingency table as the number of study pairs:

    Case exposed
    to risk factor
    Control exposed to risk factor
    • c1 and c2 are concordant pairs of case and control (with the same exposure, either positive or negative)
    • d1 and d2 are discordant pairs of case and control (with different exposure)

    The odds ratio is then given by the number of discordant cases where case is exposed and control is not exposed (d1), divided by the number of discordant cases where case is not exposed and control is exposed (d2).

    Algebraically speaking -

    Odds ratio (ω)   =    d1/d2
    • d1 and d2 are defined as above.


    For a matched case-control study with 1:1 matching, continuous explanatory variables can be compared using a parametric paired t-test or the non-parametric Wilcoxon matched-pairs signed-ranks test. The significance of the association between a (categorical) risk factor and case status can be tested using McNemar's test, or by attaching a confidence interval to the odds ratio. Analysis for more than one control matched to each case can also be done using Mantel-Haenszel methods. Modelling for individually matched case-control designs should be done with conditional logistic regression.