Collapsing a variable
The table below shows the weight of 7 mice (in grams) - weight is a continuous measurement variable. We have first grouped our 7 mice into 4 class intervals, namely 0-9 grams, 10-19 grams, 20-29 grams, 30-39 grams. Note that this is precisely equivalent to a rounding operation, in this case rounding the weights to mid-points of 5, 15, 25 and 35 grams. Our data now have a discrete distribution, even though the underlying distribution is continuous.
|Mouse weights data collapsed in different ways
|Weight (grams) ||7||9||19||21||22||29||31
Next we have collapsed the data to rank order. Note that the rank implies nothing about the magnitude of the difference between the categories. The differences between ranks are not assumed to be equal, but each category has a clear relationship to every other category.
Lastly we have collapsed the data to two arbitrary categories of small and large to give a binary variable. We have not specified the limits of these categories, so we can no longer carry out any arithmetic operations on them.
Transforming a variable
Linear transformation (coding)
One common example of coding is to multiply each data point by ten (or one hundred or one thousand) to eliminate the decimal point prior to calculation of the mean or some other statistic. This reduces the risk of errors during data entry. After calculations are complete, you decode by dividing the mean by ten. If you are using old computer programmes, you may have to recode data in this way to prevent 'overflows' or serious rounding errors.
Another common use of coding is when you want to calculate a mean from a frequency distribution.
The frequency distribution here shows the weight distribution of a small herd of zebu cattle.
We can calculate the mean from this distribution by multiplying each class midpoint by the number of observations in that class. We then add these together and divide by the total number of observations to give a mean of 207 kg.
|Total||N = 40
A quicker and easier way to do this would be to recode the class midpoints, so that they become 0, 1, 2 etc. This is achieved by subtracting the value of lowest class midpoint (in this case 185) from each midpoint, and dividing by the class interval (in this case 10).
We can decode this mean by multiplying by 10, and adding 185, to give the same mean of 207 kg. This method can be useful if you are out in the field, and need to calculate a mean value when you do not have access to a calculator or computer
|Coded Mean = 2.2 |
Coding may also be required before or after a non-linear transformation. If there are zeros in the data, coding is essential before you can carry out a log transformation. This is because you cannot take a log of zero. The commonest solution to this is to add one to all observations before transforming. However beware, if you have many zeros or low numbers, adding one will substantially bias your results. After calculations are complete, remember to decode by subtracting one.
Coding may also be done after a non-linear transformation. For example, insect catches may vary from a mean of 0.1 to 1000. Expressing results as their logarithms condenses this to a range of -1 to 3. Since some workers find graphs containing negative numbers difficult to assimilate, such results may be recoded to yield all positive numbers - by the addition of one to their logarithm. A more familiar example to chemists is pH. This is the negative log of the hydrogen ion content of a solution. Optical densities are commonly rendered positive for the same reason.
Do not confuse coding and scoring. Ordinal data may be scored, to indicate their relative magnitude. For example, low, medium and high levels of infestation with a pest could be scored as 0, 1 and 2. Sometimes scores are arbitrarily applied, for example to different eye colours, and do not imply any ranking. Whichever is the case, the mean of scored data is meaningless because we do not know how greatly the measurements differ.
Non-linear transformations are a major topic, and you need to understand more about distributions before going into them in depth. For now we will just give an example of the logarithmic transformation:
We will take as an example data on the number of Plasmodium trophozoites per mm3 in blood of patients about to be treated for malaria. As is commonly the case with parasitic infections, one patient has a very large number of parasites compared to the rest. The arithmetic mean of 112,837 is unduly influenced by this high number. Instead we transform the data by taking the logarithm of each number, and work out the mean of the transformed data:
Mean of transformed data
= (4.63 + 4.81 + 4.58 + 5.60 + 4.88 + 4.73) / 6 = 4.87
Detransformed mean = Antilog 4.87 = 74,131
The detransformed mean of log transformed data is known as the geometric mean to distinguish it from the arithmetic mean. The geometric mean provides a more reasonable measure of location of the distribution than the arithmetic mean, since it is much less affected by the single high value. We will look at the geometric mean in more depth in the More Information page on measures of location.