InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Stat.Book Beginners Stats & R
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Beginners statistics introduction

Types of data

Example,  Definition and Use,  Tips and Notes,  Test yourself,  References  Download R  R is Free, very powerful, and does the boring calculations & graphs for scientists.

Example

Consider these 5 items of information about one volunteer in a survey:

S.T. Nugent, smoker, feels good today, big eater, weighs 84 kg

Assuming 5 similar items of information (variables) are recorded for each volunteer, it should be obvious that some of these variables convey more information than others.

Variable:   Quality of such data:

  1. S.T. Nugent (a nominal value) tells us the volunteer's name. This value cannot be ranked, other than by some external criterion (such as its position in a list, or by how many letters the name has). Also, we cannot be certain there is only one S.T. Nugent - a volunteer number+date would be more unique.
  2. smoker implies he/she could also be classified as a non-smoker, but not both at once. This value, like gender, presumably has just two possible values (a binary value). Notice smoker/nonsmoker cannot be ranked, other than by some external criterion (e.g. cancer risk, prosperity, or social bonding).
  3. feels good today could be ranked without having to apply some external criterion (it is an ordinal value). In other words we assume some days the volunteer may feel 'not-so-good', other days feels 'really good', and so forth. This scale of values may be internally consistent to this S.T. Nugent, but we cannot impartially compare these values to those of any other volunteer - S.T. Nugent may be especially privileged or disadvantaged or neurotic for all we know.
  4. big eater is also an ordinal value. Furthermore, if volunteers were assessed by an impartial observer, that assessment should be comparable to other volunteers. However this measure does not tell us how much a 'big eater' eats (e.g. kcal or kg per day).
  5. weighs 84 kg clearly conveys the most information - allowing for the fact it may have been rounded from 84.3 or 83.8127 kg

Lastly note that, when these 5 variables are used to record information from each volunteer, or repeatedly form one volunteer, we can describe these as 'multi-variate' data. If we had only recorded two things about each volunteer, such as height and weight, this would be bi-variate. In other words, by recording more than one variable for each volunteer, we can see if those values are associated. For instance, we might expect volunteers' weight will be related to food intake - but is that a simple relationship?


Definition and Use

Whilst the amount of information recorded by each item (datum) can be classified according to any number of criteria, one of the most useful is as follows:
  1. Nominal data (also known as categorical) have any number of non-overlapping values (e.g. a fruit could be classed as an apple or as an orange, but cannot be both) and cannot be ordered. Binary data have just 2 mutually-exclusive values (e.g. dead / not dead)
  2. Ordinal data also have any number of non-overlapping values, but they can be ordered or ranked along a scale. In general the rank implies nothing about the magnitude of the difference between the categories.
  3. With measurement data the distance between any two numbers on the scale are of known size. Thus a difference of one pound is the same whether you are referring to 5 pounds or 500000 pounds.
    • Some measurement variables (such as the Celsius scale for temperature) only have an arbitrary zero, in which case it is described as an interval scale of measurement. Nevertheless the ratios of difference on the scale are independent of the unit of measurement and of the zero point.
    • When a scale has all the characteristics of an interval scale and also has a true zero point (such as weight and height) it is described as a ratio scale of measurement.

This classification is useful in many ways, not least since it indicates what sort of summary variables may be of use, and what possible outcomes such summaries may have. For instance the mean of a variable consisting of exclusively zeroes and ones cannot be greater than one or less than zero, but can be anywhere in that range - whereas their median has at most just 3 possible values.


Tips and Notes

  • Notice that, whilst nominal data can be converted to numbers, such recoding cannot not add any information. So, if we substituted a number for each volunteer's name, calculating their average is unlikely to tell you much of use!

  • Whilst recoding and rescaling cannot add information, it can easily discard it. Rounding and truncating weights to the nearest kg obviously discards information, as would classifying weights in kg as underweight, normal or overweight. Replacing data by their ranks also discards information: for instance saying volunteer x was the heaviest does not tell us whether she was slightly overweight or morbidly obese - or normal, since the other volunteers were anorexic.

  • None of these points allow for how trustworthy, or biased, or variable, these data are - nor how the volunteer was selected, nor how many volunteers there were, nor how often they were examined, nor what if anything these results are supposed to be representative of.

  • Although the classification into nominal, ordinal, interval and ratio is used in nearly every introductory statistics text, it was only proposed in the 1940s. Since then it has been criticized on the basis it is too strict to apply to real data, and often leads to prescriptive analysis based only on the type of variable, rather than all the characteristics of the data.


Test yourself

  • If we arbitrarily replaced each volunteer's name by a number, would we increase the amount of information about that volunteer?
  • If we replaced the recorded weight of each volunteer (kg) by a weight-class (such as underweight, normal, overweight, obese) does this add or loose information?
    Hint: If we recorded volunteers' weight-class, instead of their weight in kg, could we convert their weight back to kg?


Useful references

Altman, D.G. & Royston, P. (2006). The cost of dichotomizing continuous variables. BMJ 332, 1080. Abstract  Full text 
Highlights the cost of dichotomizing continuous variables.


Velleman, P.F. & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician 47 (1), 65-72. Full text 
Discuss the inadequacies of classifying variables as being on the nominal, ordinal, interval or ratio scale.


Wikipedia: Level of measurement. Full text 
Describes the four main levels of measurement along with a brief account of the debate on the classification scheme.