InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site

 

 

Simple ways to re-scale a variable using R

Collapsing to class intervals

In order to collapse values to class-intervals you need to specify breakpoints for those intervals - or at least to specify how many (equal-width) class intervals should be used. One way to achieve this is to use R's histogram function, for example as follows:

Gave us:

[1] 5 9 14 2

    Note:
  • Because the result of evaluating these instructions is an array of numbers, R uses a [1] to tell you the value to its right (in this case 5) is the first value of your result.
  • Instead of recoding each value of y, according to which interval it falls within, these instructions yield the frequency values of y fell within each interval - so, given 4 intervals, it produces 4 frequencies. It does not recode each value of y according to which interval it fell within. To recode y using their interval midpoints, you could use these instructions:

  • This code sets the number of intervals (i=4) so, if you need to find what values of breakpoints were actually used, you could enter hist(y,breaks=br,plot=FALSE)$breaks - which, in this case gave: [1] 400 450 500 550 600. To obtain their midpoints, rather than their breakpoints, use hist(y,breaks=i,plot=FALSE)$mids
  • The following instructions would enable you to set those breakpoints yourself:

  • Be aware that, although these breakpoints do not have to be given in ascending order, and their intervals do not have to be the same (these were not), the upper and lower breakpoints must not lie within the range (minimum to maximum value) of y.

    For example, the following instructions would give an error message if applied to the values of y we assigned above.

 

Collapsing to ranks

The simplest way to collapse a set of (n) values to their ranks is to sort those values into ascending order, number them from 1 to n, and accept those numbers are their ranks - for example as shown below:

Gave us:

> sort(y) # show y in ascending order [1] 420 430 430 445 450 460 470 475 480 485 490 [12] 495 495 500 505 510 520 520 520 530 530 535 [23] 535 535 540 545 545 545 570 570 > 1:length(y) # show the rank of y [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Note:

  • For clarity, we have shown our instructions in red
  • This code assumes there are no missing values, and 'sequentially' gives every value a different (unique) whole-numbered rank.
  • If you want to obtain the rank of y, without reordering it, you can get the same result using the rank function as follows:

  • ties.method specifies how ties are treated. method='first' gives tied observations a sequential rank, 'random' puts those ranks in random order, 'average' replaces them by their mean, and 'max' and 'min' replace them by their maximum and minimum respectively.

  • When every value of y is different, the following instructions will produce the same result - but if two or more values are the same (tied), it gives the average rank of tied values.

 

 

Collapsing to binary (I)

A common way of collapsing values in a variable (y) to a binary scale is to decide upon a 'breakpoint' (x), and classify each value of y as being above or below that breakpoint or above it. For example, you could use the following instructions:

Gave us:

[1] 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 0 1 1 0 1 0 [24] 0 0 1 0 0 0 0

    Note:
  • To replace the values in y with their recoded values use y=(y>x)*1 instead of (y>x)*1
  • If you use y>x to collapse y, instead of (y>x)*1, the result would be TRUE or FALSE rather than 1 or 0.
  • By using y>x, any values of y that are exactly equal to x are recoded to 0 not 1 - which automatically introduces a bias. Removing that bias without introducing others is harder than you might think. (One solution is to randomly recode half of the offending values as 0, and the remainder as 1.)

Collapsing to binary (II)

There are any number of ways of collapsing values to a binary scale - depending upon what sort of values need to be collapsed - and what form of binary values you require. (For example 0/1, TRUE/FALSE, or accept/reject, or commonplace/divergent.) Of these, classifying which values fall within a specified range is especially popular - once you have decided upon a suitable range. For instance the following instructions classify values (in variable y) according to whether they fall outside a specified range.

Gave us:

[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE [8] TRUE FALSE FALSE TRUE FALSE FALSE TRUE [15] TRUE FALSE FALSE TRUE FALSE FALSE FALSE [22] TRUE TRUE FALSE FALSE TRUE FALSE FALSE [29] FALSE FALSE

Note:

  • To collapse y to 1/0 rather than TRUE/FALSE, use (yx2)*1
  • If any value in y is exactly equal to the breakpoints it is classified as TRUE, in other words, only values within those breakpoints are classified as FALSE. Whilst this is the conventional rule, it automatically introduces bias.
  • If you wish to classify values according to, for example their 95% 'reference interval' you could use these instructions:

  • names=FALSE asks the quantile function to only give the quantile values, rather than their names (in this case 5% and 95%).
  • To calculate what range encloses the most typical 90% of the values of variable y, you could use:

  • In which case, quantile(y, p = 0.75, names = FALSE) - quantile(y, p = 0.25, names = FALSE) is the same as the interquartile range of y, IQR(y).

 

Truncating and rounding

It is sometimes necessary to reduce the accuracy of information, most commonly because you can only record it to a set number of decimal places. Perhaps the simplest way to collapse values to whole-numbers (integers) is to 'truncate' them by discarding anything to the right of their decimal point - for example using these instructions:

Gave us:

[1] 0 0 1 0 1 5 1 0 30 0

Note:

  • Whereas truncating positive values rounds them down to the nearest whole number, truncating negative values rounds them up.
  • If all the values of y are positive, truncation will almost certainly reduce their mean - in other words it introduces a bias in their location.

    A common way to reduce that bias is to add 0.5 to each value before truncation - but this assumes the values in y are positive.

  • One way to avoid these problems is to 'round' to the nearest whole number, for instance using round(y)

 

Linear transformations

The simplest linear transformation is to shift the location of every value by adding a constant, for instance as follows:

Gave us:

[1] -55 30 40 10 70 30 45 45 5 35 -50 0 20 -40 [15] -70 20 20 -70 35 35 -25 45 -80 -5 -15 70 -20 -5 [29] -30 -10

    Note:
  • Adding the same constant to every value of y, uniformly shifts their location - so mean(y+shift) always equals mean(y)+shift, and median(y+shift) equals median(y)+shift.
  • Shifting a set of values is a common way to ensure there are no negative values, or (if the shift is their minimum value) it changes their minimum value to zero, or (when the shift is their mean) it changes those values to deviations from that mean, and the mean of their shifted values is zero.
  • Shifting does not affect the dispersion (range) of values, in other words it has no effect upon the maximum-minimum.
To uniformly alter the dispersion (range) of y, you multiply every value by a constant. For example as follows:

Gave us:

[1] 44.5 53.0 54.0 51.0 57.0 53.0 54.5 54.5 50.5 53.5 45.0 [12] 50.0 52.0 46.0 43.0 52.0 52.0 43.0 53.5 53.5 47.5 54.5 [23] 42.0 49.5 48.5 57.0 48.0 49.5 47.0 49.0

    Note:
  • Unless the mean is zero, changing the range will also change the mean.
In some situations it is useful to change both the location and the dispersion. There are two ways of doing this:
  1. Change (a) the location, then (b) the dispersion.
      For instance, using (y+a)*b
    Or,
  2. Change the dispersion (b), then the location (a).
      For instance, using (y*b)+a
Since these usually give different results, which is most appropriate depends upon the situation.
  • Notice that, unless b=0, then x=(y-a)/b and y=bx+a

    So, if (y-a)/b linearly transforms y to x, then bx+a de-transforms x back to y - provided that a and b remain the same throughout.

 

 

Non-linear transformations

Potentially there are infinitely many types of non-linear transformation, but in practice these are limited by the ingenuity of mathematicians and what might be useful - or interpretable. Three of the more popular transformations are the logarithmic, square-root, and reciprocal. Applying these transformations, and their detransformations can be quite straightforward.

For example:

Gave us:

> (x=log10(y)) # log transform & display [1] -0.08377005 -0.09678186 0.18943273 -0.16530571 [5] 0.10820303 0.71703507 0.01521246 -0.18021457 [9] 1.48811786 -0.51187434 > 10^x # display detransformed data [1] 0.8245746 0.8002361 1.5467949 0.6834304 1.2829302 [6] 5.2123680 1.0356487 0.6603671 30.7693175 0.3076987 > (x=log(y)) # ln transform & display [1] -0.19288766 -0.22284847 0.43618498 -0.38063046 [5] 0.24914668 1.65103426 0.03502799 -0.41495939 [9] 3.42651801 -1.17863422 > exp(x) # display detransformed data [1] 0.8245746 0.8002361 1.5467949 0.6834304 1.2829302 [6] 5.2123680 1.0356487 0.6603671 30.7693175 0.3076987 > (x=sqrt(y)) # sqrt transform & display [1] 0.9080609 0.8945592 1.2437021 0.8266985 1.1326651 [6] 2.2830611 1.0176683 0.8126297 5.5470098 0.5547060 > x^2 # display detransformed data [1] 0.8245746 0.8002361 1.5467949 0.6834304 1.2829302 [6] 5.2123680 1.0356487 0.6603671 30.7693175 0.3076987 > (x=1/y) # reciprocally transform & display [1] 1.21274655 1.24963120 0.64649812 1.46320679 0.77946563 [6] 0.19185138 0.96557839 1.51430924 0.03249991 3.24993248 > 1/x # display detransformed data [1] 0.8245746 0.8002361 1.5467949 0.6834304 1.2829302 [6] 5.2123680 1.0356487 0.6603671 30.7693175 0.3076987

    Note:
  • With the exception of reciprocals, none of these transformations change the order (rank) of values within y - and (unlike collapsing data) none of these transformations discard any information.
  • R can deal with some expressions usually considered as impossible.
      For example, R evaluates:
    • log10(0) and log(0) as -Inf,
    • 1/0 as Inf,
    • 1/Inf and 10^-Inf and exp(-Inf) as 0.
    But infinitely large numbers have nasty properties.
      For example, if you evaluate the geometric mean of 0, 1000 & 1000000 as 10^mean(log10(c(0,1000,1000000))), or as mean(0*1000*1000000)^(1/3), the result is zero - and the same is true of any set of numbers containing a zero. In other words those geometric means are biased. This bias can be reduced by shifting every value prior to transformation (by adding a small constant, such as 1), then subtracting that constant following detransformation. Notice however, if your data contains many zeroes and the remaining values are small, using a shift of 1 introduces its own (very noticeable) bias, and a smaller constant (e.g. 0.1 or 0.01) may be preferable.
  • Logs and square roots of negative numbers cannot be evaluated so, if any zeroes or negative numbers are present, you must shift your values to ensure every value is positive before applying those transformations.