Example, with R
Scatterplots are among the simplest sort of graph (other than rugplots). For example:
This can be done using a pencil and ruler, or with R
The pairing of values in variables x and y is assumed to be important. In other words each value of x is usually 'paired' with a value of y for some good reason.
- For instance, those 5 pairs of x,y values could be the result of examining 5 different farms - each pair of values representing one farm.
Note, with R: plot(y,x) would give a plot x on y.
Definition and Use
- A scatterplot (also called a scattergram or scattergraph) is the graph that results from plotting one variable (Y) against another (X) on a graph. Each point represents one unit and is positioned at the intersection of the values of the two variables.
- The pattern of the points indicates the strength and direction of the association or correlation between the two variables.
- If the points cluster along a band from the lower left to the upper right, this suggests a positive association.
- If the points cluster along a band from the upper left to the lower right, this suggests a negative association.
- If there is no suggestion of the points clustering, then there is no evidence for any association between the two variables.
Tips and Notes
- Association between two variables can never prove that one variable CAUSES the other. It can provide supporting evidence for such a relationship, but ONLY if various other criteria for causality are also met.
- The association must be strong and confirmed in different places and at different times
- Cause must occur before effect
- There should be a dose response relationship
- The relationship must be biologically plausible
- There should be experimental evidence for a causal link.
- Beware of relationships that result from very few points. Sometimes you will find that inclusion of just one 'influential' point can suggest a relationship whereas its exclusion would indicate no relationship.
- In general you should only make predictions (extrapolate) about the value of Y from the value of X if the point lies WITHIN the range of your observations. If you fit a line to a relationship, only use a solid line within those limits.
Inspect the scatterplot shown below.
Would you be convinced by this relationship between the level of glutamate dehydrogenase and the number of flukes in cattle?
- The red line was fitted by ordinary least-squares regression of y on x, for all the points shown.
Data courtesy of Leclipteux et al. (1998)
- Griffiths, D. et al. (1998). Understanding Data. Principles and Practice of Statistics. Wiley, Brisbane.
- Give an excellent account of exploratory data analysis of bivariate relationships using scatterplots, including use of the median trace.
- Kabacoff, R.I. (2012). Quick-R: Scatterplots. Full text
- Covers simple scatterplots, scatterplot matrices, high density scatterplots and 3D scatterplots
- Kuo (2002). Extrapolation of correlation between 2 variables in 4 general medical journals. JAMA 287 (21), 2815-2817. Full text
- Looks at the prevalence of unjustified extrapolation in recent medical literature.
- Wikipedia: Scatter Plot.