Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Relationships between variables How to summarize and display themOn this page: Distinguishing between variables Relationships of Nominal variables Relationships of Measurement variables Association & Causation
Distinguishing between variables
In a study of relationships between variables, we can often (but not always) distinguish between two types of variables:
In many studies the distinction between response and explanatory variables is quite clear. Let's take as an example an epidemiological study of the disease cysticercosis in a rural
Relationships between nominal variables
The contingency table
Data on relationships between nominal variables are usually tabulated in the form of a contingency table. This is a table of frequencies classified according to two, or more, sets of values of nominal or ordinal variables. They are also known as a cross tabulation.
The simplest cross tabulation is a 2 row × 2 column (usually shortened to a 2 × 2) contingency table, where each variable can only take one of two values - in other words they are binary variables. The data may have been recorded as a binary variable (for example, whether infection is present or absent). Or the data may have been recorded as a measurement variable, but collapsed to a binary variable (for example size measurements collapsed to 'large' or 'small'). The data are then commonly laid out as below:
Let's return to our example of an epidemiological study of cysticercosis in a rural
In a contingency table there is unfortunately no agreed convention on whether to have the explanatory variable as rows or columns. Epidemiologists and experimental biologists usually put the explanatory variable in the rows and the response variable in the columns. We will do our contingency tables this way. Social scientists usually do the reverse - which, when expressed graphically, would correspond to plotting the explanatory variable on the x-axis.
In addition there is no agreed convention on whether to give levels as + followed by - for each variable (as we have done) or whether to list them as - and then +. For explanatory variables this is arbitrary if the explanatory variable is something like age. But it obviously dictates the value of any ratios you may calculate. So, before calculating any of the ratios detailed below, ensure you check which way round the variables have been listed, and which level is the 'reference level' of the explanatory variable (in our examples shown in deep yellow).
Contingency tables are not limited to
Ratios for summarizing relationships
Epidemiologists often need to summarize relationships between nominal variables, because both the response and explanatory variables they study are usually nominal, and often binary - for example whether an individual has a disease or not, and whether that person smokes or not. Hence will use their terminology for the methods we examine - but remember that the same designs can be (and are being) used in other disciplines. The response variable is commonly either a measure of disease frequency, or a measure of mortality. The ratios used are either a risk ratio, an odds ratio, or a rate ratio. For some study designs, only one type of ratio is appropriate.
Note that all the ratios covered here may also be used to summarize relationships in
Significance of the association
Just because any of the ratios above does not precisely equal one does not indicate that there is any real association between the two variables. It is quite possible that the observed deviation from one arose by chance. The greater the deviation from one (for a given sample size), the greater the chance that an association did not arise by chance, but is statistically significant.
There are a number of methods that can be used to assess the significance of an observed association between nominal variables. These include the well known (and much abused) 'chi square' test. Another approach is to attach a confidence interval to the ratio - although this should be done primarily to give an idea of the reliability of the estimate, rather than as a surrogate statistical test. We consider the analysis of contingency tables in depth in
Relationships between measurement variables
When we come to measurement variables, we have a lot more information about the relationship between the two variables. The relationship can be displayed by plotting one variable against the other on a scatterplot as shown here.
We could of course collapse each variable into two classes (as shown in the second figure) and still put our data into a contingency table as we did above. But this would not be a good idea for two reasons. Firstly we would loose all the extra information we have gained by using a measurement variable. Secondly our dividing points between light and heavy and tall and short would be entirely arbitrary, and might therefore introduce bias. Instead we want to assess the degree to which a change in one variable (weight) is associated with a change in another variable (say height). This is usually done using correlation or regression analysis. However, the first step is always to make a scatterplot, as above. The reason for this is very simple. There are many 'models' which can be used to describe the relationship between two variables. The commonest of these assume a straight line or linear relationship between them. If you blindly apply regression or correlation analysis to data, without first checking that any relationship really is linear, you are liable to obtain a quite misleading result. We look at the practicalities of this below when we cover display of relationships.
Correlation and regression - a brief introduction
Correlation and regression are relatively advanced topics in statistics which we look at in depth in
There are several measures of the strength of association or correlation between two measurement variables. The most commonly used measure is the Pearson correlation coefficient (r).
This describes the strength and direction of the linear association between two variables. In other words it assesses to what extent the two variables covary.
The value of the correlation coefficient can vary from +1 (perfect positive correlation) through 0 (no correlation) to -1 (perfect negative correlation) as shown in the graph. Note, however, that correlation analysis is only valid if each variable has a symmetrical (or normal) distribution. For correlation analysis there is no distinction between Y and X in terms of which is an explanatory variable and which a response variable.
There are other measures of correlation (known as rank correlation coefficients) which do not assume a linear relationship, but still assume that the relationship is monotonic (in other words, if the value of one variable increases so does the other, and vice versa). No simple measure of correlation can deal with (for example) a U-shaped relationship.
Linear regression is used to fit a straight line relationship between a response variable and an explanatory variable. The simplest way to fit the line is to use the method of least squares. This minimizes the squared deviations of the points from the predicted line.
The line is of the form: Y = a + bX where Y is the value of the dependent variable, a is the intercept (value of y when x = 0), b is the slope, and X is the value of the independent variable. The figure below shows what the intercept (a) and slope (b) in the regression equation represent on a graph:
The best-fit regression line can be obtained using the method of least squares in the following way:
At first sight regression analysis seems very similar to correlation. Both measures estimate how well a linear relationship fits the scatter of observations. However, regression analysis assumes that the independent variable is measured without error and that the distribution of the deviations of the observations from the fitted line are normally distributed. In addition the variability of the observations should not be related to the value of the independent variable.
As we shall see in the examples, these simple rules are widely ignored - and it is common to find regression equations fitted to all types of data. If this is done purely for descriptive purposes, there are no major problems with this approach. But if the values of the slopes of the lines are important, then ignoring the assumptions of regression analysis will result in biased estimates.
Significance of the associationAs with nominal variables, you need a way to assess whether an apparent relationship between measurement variables could have arisen by chance, or is instead significant. For the correlation coefficient there are various statistical tests that can be carried out, or you can just look up the value in statistical tables or on you software package. The smaller the sample size, the larger the value of r has to be for one to be confident that the apparent association arose just from chance. For regression you can test whether the slope differs significantly from zero using a t-test, or you can carry out an analysis of variance. We go into this in depth in
Association and causation
There is a big difference between showing that two variables are associated, and concluding that changes in one variable are causing changes in the other variable. We have dealt with the problem of chance association above, but two other factors can also produce a spurious association:
Even if we are confident that we have excluded the effects of chance, bias and confounding factors, we still cannot 'prove' causation statistically. It can only be inferred by considering evidence from a number of different sources - including whether there is a viable biological mechanism for the relationship to operate. Since the design of the study is also very important in this respect, we examine the issue of causation in some depth in
Related topics :