Distinguishing between variables
In a study of relationships between variables, we can often (but not always) distinguish between two types of variables:
The response variable (also called the dependent variable) is the variable you are studying.
An explanatory variable (also called the independent variable) is any variable that you measure that may be affecting the level of the response variable. An explanatory variable is also commonly termed a factor in an experimental study, or a risk factor in an epidemiological study
In many studies the distinction between response and explanatory variables is quite clear. Let's take as an example an epidemiological study of the disease cysticercosis in a rural population. The aim is to determine if there is a relationship between anyone having cysticercosis in a household, and the keeping of pigs in the compound. Here the explanatory variable (risk factor) is whether or not pigs are kept, and the response variable is infection status.
But sometimes this distinction cannot be made - for example you might want to assess the relationship between eye colour to hair color. It is hard to argue that eye colour affects hair colour, or vice versa, although the two may be associated in some way.
Relationships between nominal variables
The contingency table
Data on relationships between nominal variables are usually tabulated in the form of a contingency table. This is a table of frequencies classified according to two, or more, sets of values of nominal or ordinal variables. They are also known as a cross tabulation.
The simplest cross tabulation is a 2 row × 2 column (usually shortened to a 2 × 2) contingency table, where each variable can only take one of two values - in other words they are binary variables. The data may have been recorded as a binary variable (for example, whether infection is present or absent). Or the data may have been recorded as a measurement variable, but collapsed to a binary variable (for example size measurements collapsed to 'large' or 'small'). The data are then commonly laid out as below:
|Response variable ⇓||Totals
|+||a||b||a + b
|-||c||d||c + d
|Totals||a + c||b + d||a + b + c + d = n
- a,b,c and d are the number of individuals in each cell,
- n is the total number of individuals.
Let's return to our example of an epidemiological study of cysticercosis in a rural population. Let us assume that households are listed, and a random sample of households is taken. The household is the sample unit - not the individual people - so we are only recording whether or not anyone in the household is infected. The frequencies of households, in each of the four mutually exclusive categories, are shown in the table.
|Risk factor||Infection status||Totals
|With pigs||173 (a)||530 (b)||703
|Without pigs||72 (c)||328 (d)||400
In a contingency table there is unfortunately no agreed convention on whether to have the explanatory variable as rows or columns. Epidemiologists and experimental biologists usually put the explanatory variable in the rows and the response variable in the columns. We will do our contingency tables this way. Social scientists usually do the reverse - which, when expressed graphically, would correspond to plotting the explanatory variable on the x-axis.
In addition there is no agreed convention on whether to give levels as + followed by - for each variable (as we have done) or whether to list them as - and then +. For explanatory variables this is arbitrary if the explanatory variable is something like age. But it obviously dictates the value of any ratios you may calculate. So, before calculating any of the ratios detailed below, ensure you check which way round the variables have been listed, and which level is the 'reference level' of the explanatory variable (in our examples shown in deep yellow).
Contingency tables are not limited to 2 × 2 tables. A common type is the r × 2 table - where there are multiple rows for the categories of the explanatory variable, but the response variable is still binary. For such tables the explanatory variable is often measured on the ordinal scale - for example, age categories, or income categories. The ratios we detail below are equally applicable for r × 2 tables. You can also have r × c tables, where there are multiple rows and columns. These cannot be summarized with the ratios below.
Ratios for summarizing relationships
Epidemiologists often need to summarize relationships between nominal variables, because both the response and explanatory variables they study are usually nominal, and often binary - for example whether an individual has a disease or not, and whether that person smokes or not. Hence will use their terminology for the methods we examine - but remember that the same designs can be (and are being) used in other disciplines. The response variable is commonly either a measure of disease frequency, or a measure of mortality. The ratios used are either a risk ratio, an odds ratio, or a rate ratio. For some study designs, only one type of ratio is appropriate.
The risk ratio
The risk of disease is the number of cases of disease divided by the number of people at risk. In other words it is the proportion infected with the disease (but see below that there are two ways in which this can be estimated). The risk ratio (sometimes termed relative risk although this is also used in a less precise way) is the proportion infected (= risk) for those exposed to a risk factor divided by the proportion infected (= risk) for those not exposed to that risk factor.
If the value of the risk ratio is close to 1, it is unlikely that exposure to the risk factor is associated with infection with the disease. The further the value is from unity, the more likely it is that the exposure is related to infection with the disease.
There are two types of risk ratios, depending on how the proportion infected is obtained:
- A survey is carried out at a single point in time on a population. All individuals are either exposed, or not-exposed, to the risk factor of interest. This is known as an analytical survey. The proportion infected (prevalence) for both the exposed and not-exposed group is obtained from a random sample. The ratio of prevalences is called the prevalence risk ratio
- Two defined groups of individuals are followed-up over a period of time. One group is exposed, the other not-exposed. This is known as a cohort study. The proportion of each group that becomes infected (the cumulative incidence) is determined. The ratio of the cumulative incidences is called the cumulative incidence risk ratio.
Although the risk ratio is a very useful effect measure for a particular risk factor, it cannot indicate the overall importance of a risk factor for a particular condition. This is because it does not take account of the prevalence of the risk factor. For example, making love whilst driving might have a very high risk ratio for having a fatal accident - but since (hopefully) the prevalence of such behaviour whilst driving is quite low, one would not expect this to be an important risk factor for accidents. We therefore need a measure which combines the risk ratio with prevalence of the risk factor to give the proportion of cases that are attributable to a particular risk factor. This is known as the attributable risk proportion (or attributable risk, attributable proportion or aetiologic fraction). We give details on how to estimate the attributable risk proportion along with a worked example in the related topic on attributable risk proportion.
The odds ratio
Another way to summarize a relationship is to calculate an odds ratio. There are two ways to do this depending on the design of the study.
- Analytical survey
For an analytical survey one takes a random sample and then records the number of individuals with/without infection and the number of individuals exposed/ not exposed to a particular risk factor. The odds of infection for each group (exposed or unexposed) is the number of individuals with the disease, divided by the number of people without the disease. The odds ratio is then the odds of infection for those exposed to a risk factor, divided by the odds of infection for those not exposed to that risk factor.
Strictly speaking, what we have calculated above is a prevalence odds ratio - because the frequencies in each category are obtained from a (cross-sectional) analytical survey. Note that it is similar to, but slightly larger than, the prevalence risk ratio for the same data. When the risk of infection is very small, the value of the odds ratio is very similar to that of the risk ratio. If the risk of infection is large, the odds ratio will be much larger than the risk ratio. The risk ratio is usually (but not always) the preferred measure for prevalence studies since it is more readily interpretable in terms of risk of infection. However, the prevalence odds ratio is still heavily used.
In a case-control study, the groups to be compared are selected on the basis of the response variable - so one group comprises (usually) all the cases in the population, and the other a randomly-selected group of controls. Since you are not taking a random sample from the entire population, you cannot estimate the proportion infected - so a risk ratio cannot be calculated. Nor can you estimate the odds of infection in exposed and unexposed groups. But you can estimate the odds that each group of individuals (cases and controls) have been exposed to a particular risk factor. Because of a simple mathematical identity, you can then indirectly estimate the odds ratio as shown below:
Depending on the precise type of case-control study and on the assumptions that can be made, the odds ratio may approximate to either the risk ratio or the rate ratio. We will consider these various designs in more depth in Unit 7.
The incidence rate ratio
The incidence rate ratio is calculated as the ratio of the incidence rates in exposed and unexposed individuals. Incidence rate can be estimated as the number of cases, divided by sum of time at risk - or as the number of cases, divided by the average size of the group over the period. Rate ratios can only be estimated from cohort studies because we need to know the number of cases over a defined period of time.
Algebraically speaking -
|Incidence rate ratio =
- ρ1 and ρ2 are the rates in the exposed and unexposed populations respectively;
- e1 & e2 are the number of events in each population;
- n1 & n2 are the size of the two groups, midway through the time period.
As with the risk ratio and odds ratio, the further the value is from unity, the more likely it is that the exposure is related to infection with the disease.
Note that all the ratios covered here may also be used to summarize relationships in r × 2 tables. One of the levels of the explanatory variable must be chosen as the reference or control level - other levels are then compared with this (see below for an example of this in the 'how to display' section).
Significance of the association
Just because any of the ratios above does not precisely equal one does not indicate that there is any real association between the two variables. It is quite possible that the observed deviation from one arose by chance. The greater the deviation from one (for a given sample size), the greater the chance that an association did not arise by chance, but is statistically significant.
There are a number of methods that can be used to assess the significance of an observed association between nominal variables. These include the well known (and much abused) 'chi square' test. Another approach is to attach a confidence interval to the ratio - although this should be done primarily to give an idea of the reliability of the estimate, rather than as a surrogate statistical test. We consider the analysis of contingency tables in depth in Unit 9.
Relationships between measurement variables
When we come to measurement variables, we have a lot more information about the relationship between the two variables. The relationship can be displayed by plotting one variable against the other on a scatterplot as shown here.
We could of course collapse each variable into two classes (as shown in the second figure) and still put our data into a contingency table as we did above. But this would not be a good idea for two reasons. Firstly we would loose all the extra information we have gained by using a measurement variable. Secondly our dividing points between light and heavy and tall and short would be entirely arbitrary, and might therefore introduce bias. Instead we want to assess the degree to which a change in one variable (weight) is associated with a change in another variable (say height). This is usually done using correlation or regression analysis.
However, the first step is always
to make a scatterplot, as above. The reason for this is very simple. There are many 'models' which can be used to describe the relationship between two variables. The commonest of these assume a straight line or linear relationship between them. If you blindly apply regression or correlation analysis to data, without first checking that any relationship really is linear, you are liable to obtain a quite misleading result. We look at the practicalities of this below when we cover display of relationships.
Correlation and regression - a brief introduction
Correlation and regression are relatively advanced topics in statistics which we look at in depth in Units 12 and 14. However, the techniques are used so widely in exploratory data analysis that we need to provide a brief introduction to the topic here.
There are several measures of the strength of association or correlation between two measurement variables. The most commonly used measure is the Pearson correlation coefficient (r).
Algebraically speaking -
||Σ(X − )(Y − )
|√ [Σ(X − )2 Σ(Y − )2]
- r is the Pearson product moment correlation coefficient,
- X and Y are the individual observations of the two variables,
- and are the arithmetic means of the two sets of observations.
This describes the strength and direction of the linear association between two variables. In other words it assesses to what extent the two variables covary.
The value of the correlation coefficient can vary from +1 (perfect positive correlation) through 0 (no correlation) to -1 (perfect negative correlation) as shown in the graph. Note, however, that correlation analysis is only valid if each variable has a symmetrical (or normal) distribution. For correlation analysis there is no distinction between Y and X in terms of which is an explanatory variable and which a response variable.
There are other measures of correlation (known as rank correlation coefficients) which do not assume a linear relationship, but still assume that the relationship is monotonic (in other words, if the value of one variable increases so does the other, and vice versa). No simple measure of correlation can deal with (for example) a U-shaped relationship.
Linear regression is used to fit a straight line relationship between a response variable and an explanatory variable. The simplest way to fit the line is to use the method of least squares. This minimizes the squared deviations of the points from the predicted line.
The line is of the form: Y = a + bX where Y is the value of the dependent variable, a is the intercept (value of y when x = 0), b is the slope, and X is the value of the independent variable. The figure below shows what the intercept (a) and slope (b) in the regression equation represent on a graph:
The best-fit regression line can be obtained using the method of least squares in the following way:
Calculate the slope (b) of the best-fit line:
Algebraically speaking -
||Σ(X − )(Y − )
|Σ(X − )2
- b is the regression coefficient,
- X and Y are the individual observations of the independent and dependent variables respectively,
- and are the arithmetic means of the of the independent and dependent variables respectively.
Calculate the intercept of the best-fit line by substituting the estimated value of b in the equation below:
|a = − b|
At first sight regression analysis seems very similar to correlation. Both measures estimate how well a linear relationship fits the scatter of observations. However, regression analysis assumes that the independent variable is measured without error and that the distribution of the deviations of the observations from the fitted line are normally distributed. In addition the variability of the observations should not be related to the value of the independent variable.
As we shall see in the examples, these simple rules are widely ignored - and it is common to find regression equations fitted to all types of data. If this is done purely for descriptive purposes, there are no major problems with this approach. But if the values of the slopes of the lines are important, then ignoring the assumptions of regression analysis will result in biased estimates.
Significance of the association
As with nominal variables, you need a way to assess whether an apparent relationship between measurement variables could have arisen by chance, or is instead significant. For the correlation coefficient there are various statistical tests that can be carried out, or you can just look up the value in statistical tables or on you software package. The smaller the sample size, the larger the value of r has to be for one to be confident that the apparent association arose just from chance. For regression you can test whether the slope differs significantly from zero using a t-test, or you can carry out an analysis of variance. We go into this in depth in Units 12 and 14.
Association and causation
There is a big difference between showing that two variables are associated, and concluding that changes in one variable are causing changes in the other variable. We have dealt with the problem of chance association above, but two other factors can also produce a spurious association:
- Bias - there are many sources of bias in a study which can give rise to an apparent association. Two of the most important are selection bias and measurement bias. Random sampling (for descriptive and observational studies) and random allocation (for experimental studies) help to reduce bias, although the effects of measurement errors can be difficult (if not impossible) to allow for - and are routinely ignored.
- Confounding variables - there is always the possibility that the association results from both variables being related to a third, confounding variable. This may either cause an illusory association between two variables, when no such association exists, or it may mask an true association. Confounding variables are an especial problem in descriptive and observational studies - random allocation should largely eliminate the problem in experimental studies.
We consider the problem of bias and confounding factors in Unit 2
and look at measurement error in more detail in Unit 12.
Even if we are confident that we have excluded the effects of chance, bias and confounding factors, we still cannot 'prove' causation statistically. It can only be inferred by considering evidence from a number of different sources - including whether there is a viable biological mechanism for the relationship to operate. Since the design of the study is also very important in this respect, we examine the issue of causation in some depth in Unit 7.