InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Stat.Book Beginners Stats & R
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Beginners statistics introduction

Transformation and recoding

Example, with R,  Definition and Use,  Simple formula,  Tips and Notes,  Test yourself,  References  Download R  R is Free, very powerful, and does the boring calculations & graphs for scientists.

Example, with R

Let us rank-transform (rescale to rank) the heights jumped by 5 mice:

32.2 10.0 135.2 145.3 145.301

You could number these items from [1] (the lowest) to [5] (the highest):

32.2[2] 10.0[1] 135.2[3] 145.3[4] 145.301[5]

Transformed to rank, these five items become: 2 1 3 4 5

Or you could get their rank with 


  • Notice each rank ignores the relative difference between their jumps - it just numbers them in order of magnitude.
  • These ranks only apply within this set of results. The top ranking jump of another set of results may be feeble or Herculean compared to these efforts.
  • Like recoding, we cannot detransform rank-transformed data without a corresponding list of values.
  • To reflect that loss of information, we often talk about 'collapsing' data to ranks.
  • Unlike recoding, ranks do tell us something about their relative size - and we assume a rank of 2.5 would reflect a value somewhere between that corresponding to ranks 2 and 3.
  • The results of ranking may, or may not, be linearly related to the values that gave those ranks.


Definition and Use

  • Data transformation is a process by which we substitute one series of values by another series of equally distinct values. This process usually assumes data are at least ordinal and, at least in principle, should be reversible.
    • Linear transformations do not affect the relative difference between values - non linear transformations (such as logarithmic rescaling) change the relative difference between values, but do so systematically and impartially.
    • Monotonic increasing transformations preserve the rank (relative order) of transformed values.
    • Some transformations, such as the reciprocal (=1/y), may reverse the ranks - other transformations just change some of them.
    • Not all transformed data can be de-transformed. Collapsing data, for example converting temperatures (in °C) to a binary scale (such as 'normal'/'abnormal') may be non-reversible. This is also true of rounded data, or truncated data, or 'pooled' data.
  • Data recoding replaces one set of values, or names, by another set. For instance you might code 'red', 'dead', 'round' and 'lost' as 'a', 'b', 'c' and 'd', or code them as 1, 2, 3, and 4.
    • Coding is arbitrary, and without the decode list, cannot be reversed.
  • Transformation and recoding (collectively termed remapping) are performed for many reasons - not all of which are sensible.
    • For instance, to avoid having to write down lots of zeroes, numbers may be divided (or multiplied) by 1000.
    • Logarithmic rescaling is useful where changes vary proportionally, or where data ranges from very small to very large positive numbers. It is often used in an attempt to ensure that errors are more or less normally distributed, or to weaken any relationship between the variance and the mean.
    • Data are commonly converted to whole numbers (integers), or letters, so it is easy to write them on a data sheet.


Simple formula

One of the most familiar linear transformations is to remap temperature measured on the Fahrenheit scale to the Celsius scale.

Assuming F is Fahrenheit and C is Celsius

°F = °C*9/5 + 32

Transforming Fahrenheit back to Celsius is equally straightforward

°C = °F-32*5/9 + 32

Probably the commonest non-linear transformation is the log transformation. If we denote the original data as X, and the transformed data as Y, then

Yi = log10Xi

A common example of recoding are the box-numbers used in post offices. For example:

Box 3076 = Mrs. A. Scrivena
Box 3078 = The Society in Favour of Unusual Practices
Box 3079 = The Bank of Uber Wallop Ltd.

Notice the box numbers tell you nothing about their users' relative size, importance, merit or rank. Nor, can you 'back- transform' (or 'de-transform') box numbers to holders without a list of names.


Tips and Notes

  • Where possible, avoid recoding, rounding, pooling or collapsing data when you record it. Data thus lost is seldom recoverable, and this can be a source of needless and irritating error.
  • Beware, many people are simply unable to interpret transformed data, especially non-linear scales.
  • Be especially cautious when interpreting statistics or graphs based upon pooled or collapsed data.
    Ask yourself:
    "why were those groups pooled?" or
    "why was that particular breakpoint or criterion chosen".
    Arbitrary ill-considered criteria can produce surprisingly misleading results!


Test yourself

  1. Assuming you have a set of values in Y, could a square transformation (Y2) change their rank?
    Hint: what is (-1)2?

  2. If whenever you weigh yourself, you truncate your weight to kilograms, what effect would that have upon their average?

  3. A biologist records the number of spiders on each leaf using the following scale:
    • Up to 20:       write the actual count
    • From 20 to 100: round to the nearest 10
    • More than 100:       write 'over 100'

    Any comments?

    Hint: how would he find their mean or maximum? What might this do to their median?

  4. Given PO Box numbers are a recoding of the holders' names, would the average of those box numbers provide a useful summary of their box-holders?


Useful references

Hopkins, W.G. (2000). A new view of statistics: the log transformation. Full text 
A useful account of the log transformation and its uses.


Keene, O.N. (1995). The log transformation is special. Statistics in Medicine 14 (8), 811-819. Full text
Argues that in medical research the log transformation should frequently be preferred to untransformed analyses.


O'Hara, R. B. & Kotze, D. J. (2010). Do not log-transform count data. Methods in Ecology and Evolution 1, 118-122. Abstract Full text
Advises ecological researchers against log transforming count data, especially if there are zeroes present, and instead recommends the use of generalized linear models.


Wikipedia: Data transformation (statistics). Full text