Relationships between nominal variables are often only presented in tabular form. This is especially the case for 2×2 tables. For r×2 tables, they may also be presented graphically, especially if (as here) the explanatory variable is measured on the ordinal scale. The results from this table are shown graphically, as a bar diagram, in the first figure below. The response variable (prevalence) is shown on the vertical or y-axis, with the explanatory variable (age) on the horizontal or x-axis.
|Risk factors for cysticercosis in pigs in Bolivia
| +ve || -ve |
In the second figure above, the risk ratios are plotted rather than the original prevalences. The ratio is plotted on the y-axis and age on the x-axis. The attached confidence intervals indicate the reliability of the estimates. Note that a log scale is used for the y-axis because of the skewed distribution of ratios. When using ratios, it is important to always specify the reference category (in this case 2-7 month old pigs) on the graph.
As with a bar diagram, the response variable is shown on the y-axis, and the explanatory variable on the x-axis. If response and explanatory variables cannot be distinguished, the choice of which variable to put on which axis is arbitrary. The only exception to this is if you wish to predict the value of one variable (say weight of a cow) from another (say girth) - in this case the variable you wish to predict is put on the y-axis.
Ensure that the units are clearly stated for each of the variables. The minimum and maximum values on the x and y axes should be slightly below and above the minimum and maximum values in your data.
Scatterplots are the main means of exploratory data analysis, for looking at relationships between variables. Exploratory data analysis is concerned with understanding what the data are trying to tell you, and getting the best out of your data. There are several issues you can clarify with scatterplots:
- What is the shape of the relationship?
The first figure below shows a reasonably good linear relationship between two variables.
The second graph shows a very close relationship between Y and X, but is emphatically not linear - it is, in fact, described as a 'sigmoid' (-shaped) curve. If we were to analyse these data using correlation, or linear regression analysis, we would conclude that the relationship was 'significant'. But such a model is clearly quite inappropriate for these data. The fourth graph also shows a clear relationship between Y and X - in this case it is U-shaped. Here a linear analysis would indicate no relationship between the variables.
In the data we have shown above, the shape of each relationship is immediately clear because there is not much variability about the general trend. In other words, there is a high signal-to-noise ratio. Very often, however, this is not the case - and it may be difficult to assess the underlying shape of their relationship. In the More Information page on Measures of location we look at the use of running means and medians for this purpose. These can also be used for bivariate data, with the data arranged in increasing order of the X-variable.
Alternatively one can calculate a median trace as shown here. Again the data are arranged in increasing order of the X-variable, but this time they are simply grouped, and the medians X and Y values calculated for each group.
If the relationship between two variables is not linear, it is often possible to linearize the relationship with a transformation. This means we change the scale of one (or both) of the variables. A logarithmic scale is often appropriate because many biological processes operate on a multiplicative rather than additive scale. A unit change in the X-variable produces not an arithmetic increase in the Y-variable, of for example 2 units, but a proportionate increase in Y, of 1.5× - or perhaps a doubling.
In this situation a logarithmic transform of the Y variable will often linearize the relationship - as has been done here. If Y increases with X, but at a decreasing rate (the opposite of what we have here), we would take the logarithm of the X-variable rather than the Y-variable. A different transformation - either the probit or logit transformation - can be used to linearize a sigmoid relationship. The reasoning underlying probit and logit transformations are explored in Unit 14.
- Does the relationship result from very few points?
Sometimes a relationship that your software package tells you is 'significant' results from very few points. This happens when you have one or more influential points. An influential point
is an extreme value of the response and/or explanatory variable that has a disproportionate effect on the regression analysis, both in terms of the slope of the line and the significance level. In the graph we have shown, with that point we find there is a significant relationship - without it (see second graph) there is clearly no relationship.
In the sense that they are extreme values, such points are a special type of outlier. Outliers have extreme values for either the response or explanatory variable. Unfortunately researchers have a habit of including them if they happen to fit what the researcher wants to get out of the data, yet excluding them if they do not fit the expected trend. In general it is best to analyse and display the data both with, and without, influential points and outliers - to make it clear how much a conclusion depends upon one or two observations. Only if a value can be shown to be in error can it be safely excluded from the data set.
- Can I extrapolate outside the range of data?
In general you should only make predictions (extrapolate) about the value of the response variable from the value of the explanatory variable if the point lies within the range of your observations. This is why the solid line of a regression plot should never be extended outside the range of observations as shown in the first figure here. The correct way to show this relationship is shown in the second figure. If you wish to predict the rate of development at say 25oC, then a dotted line should be used (as shown in the third figure) to indicate one has much less confidence in the relationship outside the range of observations.
It should be clear by now why you need to show both the line and all the data points when you display a regression-type relationship. Just displaying the regression line without the data is very bad practice - if you see it in a paper, suspect the worst!