Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Transformations: Use & misuse
(rescaling, monotonic, log transformation, normalizing distributions, linearizing curves, homogenizing variances, additive model, detransformation)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
Transforming a variable re-scales it.  A transformation can be any mathematical operation applied to data. A detransformation reverses, or inverts that process. Although an infinite variety of transformations are possible, the most important transformations are applied to all values. Those that do not alter the ranks of the data are known as monotonic transformations. The transformation is linear if plotting the transformed data against the untransformed data produces a straight line. Linear transformations are mainly used to ease data handling or display. Non-linear transformations are defined as those mathematical operations that do not give a straight line when the transformed data are plotted against the original data. They are of special value in data analysis for several purposes. Probably the commonest are to to normalize distributions of data and their statistics and to linearize a curvilinear relationship between two variables. They are also used to equalize variances of different groups of observations, to produce an additive model for multiplicative biological processes and to provide a more appropriate measure of central location.
In medical and veterinary research the only common transformation encountered is the log transformation. Veterinarians have been engaged in a long running debate over whether or not to log transform faecal egg counts before estimating percentage reduction caused by anthelmintic treatments. Applied ecologists have traditionally made extensive use of transformations, whether log, square root or angular. However, increased use of generalized linear models is now resulting in a reduced usage of some transformations - in particular the angular transformation which used to be used for proportions.
Probably the commonest misuse of transformations is not using a transformation when one is required. Another general problem is that whilst the need for a transformation may be correctly identified, there is often no check made after the data have been transformed to see if distributions have actually been normalized or variances homogenized. Graphical display methods rather than significance tests should be used to assess normality, because a small sample size will inevitably result in normality being accepted. We give examples of where tests have been used and the result misinterpreted to show that there was a 'significant fit' to normality rather than (as was actually the case) a significant deviation from normality. Also observational studies based on convenience samples - not very useful! Should never automatically use a particular transformation - give example where effectiveness of treatment against nematode larvae was not well assessed using either arithmetic or geometric means - percentage of cattle with only a small number of larvae might have been better indicator.
Presentation of results after a transformation is highly variable - not least because many statistical texts are unclear on the matter. We recommend strongly that only detransformed means (and as importantly confidence intervals) are reported, but some authors still follow the advice of texts that suggest that means and intervals based on the untransformed data should be presented. One clear error of presentation is to calculate asymmetric confidence intervals, but only show one of the tails in the figure. One important issue (about which there is little awareness) is the bias caused by adding constants to cope with zeros - especially with the log(x+1) and Ö(x+0.5) transformations. For the Box-Cox transformation, it is better to use a 'convenient estimator' rather than the precise power transformation - albeit that only the log transformation is really useful since it allows one to attach a confidence interval around the ratio of the geometric means.
What the statisticians sayArmitage & Berry (2002) cover transformations in medical statistics in Chapter 10 along with other approaches to analyzing data that do not have a normal distribution. Bland (2000) provides a more accessible introduction to the topic for medics in Chapter 10. Krebs (1999) provides an good account of transformations in Chapter 15, including some valuable do's and don'ts for detransforming log transformed data. Standard biometry texts including Zar (1999), Sokal & Rohlf (1995) and Snedecor & Cochran (1989) all provide basic coverage of the logarithmic, square root and arcsine transformations.
For medical researchers, Bland & Altman (1996a) (1996b) (1996c) provide a good introduction to the use of transformations, especially the log transformation. Keene (1995) argues that the log transformation has particular advantages and should frequently be preferred to untransformed analyses.
For ecological researchers O'Hara & Kotze (2010) advise against log-transforming count data, especially if there are zeros present, and instead recommend the use of generalized linear models. McArdle et al. (1990) and Gaston & McArdle (1994) provide an excellent discussion on ways of assessing the temporal variability of animal abundances. They point out that the standard deviation of log (n+1) data will underestimate the true standard deviation if zeros are present, so rare populations look less variable than they really are. Box and Cox (1964) describe how to obtain the optimum transformation for data being analyzed with a linear model. Anscombe (1948) and Bartlett (1947) consider the theory and need for transformations, and then look at several in more detail.
Wikipedia provides a section on data transformation and the Box-Cox distribution. . NIST/SEMATECH e-Handbook of Statistics explains how to use the Box-Cox transformation to normalize the data. The Handbook of biological statistics covers the common transformations as does Will Hopkins (2000a) (2000b) Pengfei Li gives an overview of Box-Cox transformations.