Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Case-control designs

Characteristics  Pros & cons  


    The case-control design is an observational design in which study groups are defined by the response variable rather than by the explanatory variable. The response variable is usually binary - that is an individual either has a particular condition (a case) or does not have that condition (a control). Having defined the two groups, the subsequent direction of the study is backwards in time. Past exposure to possible risk factors by cases is compared with that experienced by the controls. The aim is to find out which risk factors were most closely associated with an individual becoming a case.

    In human or veterinary epidemiology the condition of interest is usually a disease. Information on risk factors is obtained by examining past records or (in the case of human studies) by interviewing each individual or their relatives. In ecology the design has been used to assess why particular locations are used as nest sites or prey kill sites. Hence the condition of interest is whether the site has been utilized or not.

    Assessment of a relationship between a response and explanatory variable(s) in a case control design is always done by calculating odds ratios. That odds ratio can sometimes be interpreted as a risk ratio or a rate ratio, depending on whether cases are incident or prevalent, the type of source population (fixed or dynamic), the sampling strategy, and the underlying assumptions. Alternatively it can remain an odds ratio without such interpretation if assumptions are not met.


Criteria to define type of case-control designs

  1. Cases incident or prevalent

    If cases are prevalent (in other words pre-existing) rather than incident (newly arising), the odds ratio estimated is the prevalence odds ratio. Under certain conditions this may estimate the incidence rate ratio (if the duration of disease does not depend on exposure status) or the prevalence ratio (if the disease is rare). However, it is always preferable to use incident (newly arising) cases in a case-control study if at all possible.


  2. Fixed cohort or dynamic population Cases and controls can be selected from fixed cohorts (= closed population), for example a birth cohort born in 1 calendar year, or from a dynamic population (= open population) affected by births and deaths, immigration, and emigration. If rates remain reasonably constant for a period of time then a dynamic population can be considered 'stable' - in other words the composition of the population (including the exposure distribution) will not change markedly over time. A fixed population is by definition not stable (it will for example keep getting older). Dynamic populations may be stable over short time periods. The term nested case-control study is commonly used when a case-control study is carried out within a cohort study.


  3. Control selection
      1. Exclusive case-control design (or cumulative case-control design)
        This is the 'classical' retrospective model. Controls are selected from survivors at the end of the period of interest. For a fixed cohort the odds ratio (introduced in Unit 1) calculated using this type of case-control design only reflects the risk ratio (also introduced in Unit 1) if the probability of becoming a case is low - the so-called 'rare disease assumption'. For a dynamic population it only reflects the rate ratio if the population is stable. If respective assumptions are not met, then the odds ratio can only be interpreted as an odds ratio.

      2. Case cohort design (or case base design)
        A random sample of controls is selected from the entire base population at the start of the period of interest. This sample will of course include individuals who subsequently become cases and the analysis must handle the potential duplication of cases as controls. For a fixed cohort the odds ratio will estimate the risk ratio providing that censoring (withdrawals/losses) from the population is unrelated to exposure to the risk factor. For a dynamic population the odds ratio only reflects the rate ratio if the population is stable.

      3. Density case-control design (or synthetic case-control design)
        Here controls are selected throughout the period of interest. For a fixed cohort the odds ratio will estimate the rate ratio providing controls are matched with cases on time and the analysis take this into account. For a dynamic population the odds ratio will estimate the rate ratio providing controls are matched with cases on time, or if this is not the case then providing the population is stable.



      Instead of selecting controls at random, controls are often selected in such a way that they share certain (potentially confounding) risk factors - such as age - with the case. Alternatively they may be matched on time as noted above. The main reason for matching in a case-control study is to improve study efficiency - in other words to obtain the same degree of precision of estimates for a smaller sample size.

      Matching can be done either on a group basis or on an individual basis:

      1. Group or frequency matching
        Controls are selected so that the overall make up of the control group is similar to that of the cases. If for example 55% of the cases are male, and 45% are female, then controls will be selected so that the sexes are in the same relative numbers. Within that constraint, however selection would still be random.
      2. Individual matching
        One or several controls are matched to individual cases. This is often done in a non-random manner, especially where the 'nearest neighbour' control is selected. Selection of non-random control can be accepted if there really is no subjective choice involved (in other words no risk of selection bias, but usually a random element is essential. For example, in a case-control study on sleeping sickness, selection of the nearest village for the control was done in a randomly chosen direction from the village where the case lived.

      Although matching will tend to improve study efficiency, it is important to note that, unlike with a cohort design, confounding is not automatically controlled by matching in a case-control design. One needs a statistical analysis that properly accounts for the matching to obtain a valid estimate of effect.

      There are also important disadvantages to matching in a case-control design. Firstly you can no longer use the study to assess the effect of the matching variables on the outcome. Also it may be difficult or impossible to find an exact match to a case which adds to the cost of the study. A further disadvantage is that overmatching may occur which makes the analysis highly inefficient thus counteracting any potential gains in this respect. This is the case when (1) the matching factor is not a true confounder, but is instead the result of exposure to the risk factor or (2) the matching factor is highly correlated with other matching variables. Lastly, if cases and controls are matched, the effects are irreversible - there is nothing you can do at the analysis stage if you conclude that matching was not a good idea!



Pros and cons of case-control designs


  • They are well suited to studying rare conditions. In such a situation a random sample for a cross-sectional study would have to be very large to produce sufficient positives. This applies equally to rare diseases (where only one in ten thousand may be affected) or nest sites for a bird (where only a tiny proportion of suitable sites are occupied)
  • They have the advantage over cohort studies that results can be obtained at one point in time, rather than having the time and expense of following individuals over time. Existing records on explanatory variables can sometimes be utilized.
  • Many explanatory variables (risk factors) can be examined together. Because the numbers of cases and controls are usually balanced (either 1:1 or 1:n), interaction and confounding can be more readily evaluated.
  • The case-control study is an efficient design so sample size can generally be smaller than for other observational designs.
  • This design is in common use in medical research and, when the response variable is binary, can be analysed using standard 'logistic regression' methods.


  • In the classical case-control design we can never be certain that exposure to the explanatory variable preceded the individual getting the condition. The information we have on the explanatory variable is dependent on recall or past records - both can be unreliable (high level of measurement error) and difficult to validate. This is much less of a problem for density case-control designs.
  • There is a high risk of selection bias. Some of this bias can be minimized by using randomly selected population controls. But it is much more difficult to avoid incidence-prevalence bias. This arises because cases reflect survival as well as acquisition. So in the epidemiological situation, cases that die before making it to hospital will not be included in the sample! Similarly bird nest sites that get destroyed will not be included in the sample.
  • The retrospective nature of the design results in a high risk of recall bias. This is especially a problem where cases have a better recall of explanatory factors than controls. It is not, however, a problem if exposure information is taken from records. In ecological case-control studies, current conditions are often used as a proxy for past conditions - which may well be very unreliable.
  • Misclassification bias can also occur if controls are wrongly assumed to be free of the condition. In veterinary studies many disease cases go unreported - lack of reported cases may not indicate freedom from disease. Similarly in ecological studies, it is essential to check that the randomly chosen 'control' locations are indeed unoccupied!
  • In the classical case-control design, only one response variable can be studied at one time. This is because you are defining your groups with reference to the response variable rather than the explanatory variable(s). It is possible, however, to conduct a set of case-control studies nested within the same population using several outcomes (say five different diseases) but the same control group.
  • If controls are selected from survivors at the end of the period of interest (the classical design), the odds ratio will only approximate to the risk ratio if the condition is rare (the 'rare disease assumption').

topics :

Berkson's Bias