Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Some terminology

Even when scientists can be persuaded to look at their (raw, unanalyzed) results, they seldom consider the properties inherent to that form of information - or how those properties might affect any statistics they may use to summarize their results. Since the properties of their information have major implications for the validity of their subsequent analysis, this is rather a shame. The first obstacle to be overcome in this matter lies in describing what sort of information you are dealing with. But before we can proceed, we need to define a few key terms.

A datum is a single item of information. In practice, we are interested in a number of items of information for which we use the plural of datum - data.

Data exist as observations. These are classifications or measurements made on the smallest sampling unit, known as the evaluation unit. The evaluation unit is commonly a single individual. The weight of a mouse is a single observation. If we weigh it again, or weigh another mouse, that is a second observation.

Do not confuse observations with events. An event is something that happens, such as a storm. You can make observations on different aspects of an event (or a mouse), such as how big it is, when it begins (or is born, or conceived), how long it survives for, and how it affects things around it.

For their conclusions to be of practical use, scientists need to distinguish treatments they apply from results they observe. Both may be measured, are subject to error, and can apply to the same item - such as a mouse. The difference is the extent to which you can infer that one causes the other.

Inevitably, we are interested in how our observations vary. If they never varied, there would be little reason to make the observations. Thus, each type of observation we make is called a variable. We can formally define a variable as a characteristic of an evaluation unit that can assume a number of different values. There can be any number of variables pertaining to an individual item or event. For example, we may measure the length and width of a tooth. The length is one variable, the width is another. Or we could measure the duration, rainfall and windspeed of a storm, or the temperature, humidity and wind direction preceding it.

There are several criteria we can use to classify variables. The most important are the scale of measurement used, whether the distribution of the variable is discrete or continuous, and whether it is a primary or derived variable.



Scales of measurement

Nominal variables

    A nominal variable (also known as a categorical variable) has a number of categories to which an observation can be assigned. Thus colour is measured on the nominal scale (or categorical scale). For example, fungal cultures could be classified as 'white', 'yellow' or 'green'. Or you may record the number of mice that are white, brown, or brown and white. If you also record whether they are 'furry', 'nude' or 'wrinkled', this is another nominal variable. If you wish to have intermediate categories, you must provide them before you make your measurements. You could assign a number or score to each category, say 1 for furry, 2 for nude and 3 for wrinkled. But these numbers are for convenience only - they do not indicate any formal relationship between the categories. Gender is a special type of nominal variable in that it can only take one of two values - male or female. This is known as a binary or dichotomous variable. Binary variables may be labelled as 0 or 1 or sometimes + or −.

Ordinal variables

    In the ordinal scale different categories can be related to each other. The descriptions or numbers indicate their rank order (see below ) so we may use a grading system to classify the severity of disease from mild (1) to moderate (2) to serious (3). An ordinal variable describes the relative magnitude of observations, rather than how much they differ.

    In general with ordinal scales the rank implies nothing about the magnitude of the difference between the categories. However, there is a special type of ordinal scale that attempts to do this and thus may approach a measurement scale - this is known as the visual analogue scale. It is most commonly used in medical research to assess patient conditions that cannot be readily quantified - such as level of pain or state of asthma. A straight line, usually 10 cm in length, is drawn - with for example, 0 mm corresponding to no pain, and 100 mm to the worst pain possible. The patient then has to indicate where on the line his or her pain level lies.

Measurement variables

    In the measurement scale, the distances between any two numbers on the scale are of known size. Any difference is the same, wherever it lies on its scale. Thus a difference of one pound is the same whether you are referring to 5 pounds, or 50,000 pounds. Examples of a measurement variable are variables such as length, weight or number of births.

    Some measurement variables only have an arbitrary zero, in which case it is described as an interval scale of measurement. For example the Centigrade scale of temperature measurement has an arbitrary zero point. In an interval scale the ratios of differences on the scale are independent of the unit of measurement and of the zero point. Say you have three temperatures on the Celsius scale, 0, 10 and 100 degrees. Equivalent temperatures on the Fahrenheit scale are 32, 50 and 212. The ratio of the differences on the Celsius scale (100-10/10-0) = 9) is the same as the ratio on the Fahrenheit scale (212-50/50-32)=9.

    When a scale has all the characteristics of an interval scale and also has a true zero point it is described as a ratio scale of measurement. Many measurement variables, such as weight and height, have a true zero point.

Is this classification meaningful?

    Before just accepting the classification above, we should perhaps think a little more carefully about the matter. Despite the fact it is used in nearly every introductory statistics text, it was only proposed in the 1940s. Since then it has been criticized (for example Velleman & Wilkinson (1993) and Bergman (1996)) on three grounds:

    • It is too strict to apply to real data - for example the visual analogue scale has more information than a purely ordinal scale, but cannot be described as a measurement variable.
    • It often leads to the use of 'non-parametric' methods of data analysis, when parametric methods could be used.
    • It leads to prescriptive analysis based only on the type of variable, rather than all characteristics of the data.
    It has to be said, however, that the main issue of contention is whether the arithmetic mean is an appropriate measure of location for an ordinal variable. It is true that some variables measured on the visual analogue scale may well approach a measurement scale sufficiently closely to be accepted as 'honorary' measurement variables. But it is still highly questionable to use the mean for variables that cannot possibly be regarded as approaching the measurement scale. We return to this issue in the More information page on measures of location.



Continuous versus discrete variables

Variables can also be classified as to whether they are continuous or discrete. The mathematical definition of a continuous variable is that, for a random sample, no two values will be identical. In more general terms, a continuous variable is one which can (at least theoretically) take any intermediate value between its maximum and minimum value. The exact value we record is limited by the accuracy of our measurements.

    For example, ordinary mercury-in-glass clinical thermometers are usually calibrated in divisions of 1/10 of a degree. Therefore we are unlikely to be able to measure to less than half that amount, that is one twentieth of a degree (i.e. 0.05≡). A measurement of 35.102 degrees would imply the actual measurement is somewhere between 35.1015 and 35.1025 degrees. This level of accuracy is not possible for such equipment, and is therefore misleading.

A discrete variable (also known as a meristic variable) is one where the measurement can only exist as a whole number (an integer). Discrete variables are usually counts, for example the number of children in a school, or the number of vehicles using a road. Some variables, such as monetary income, can be difficult to assign as continuous or discrete. Although income can take many intermediate values, values below the smallest unit of currency are not possible. As a result, bills can only be presented, or cheques written, to the nearest, cent, penny, or shilling.

Rounding is a process applied to observations of a measurement variable where intermediate values are recorded to the nearest whole number. Weight, for example, may be rounded to the nearest gram. Some rounding is, of course, inevitable in any measurement. But too much rounding reduces the information content of a measurement, and makes the data behave like a discrete, rather than continuous variable. This results in 'tied' values which can badly upset the assumptions of many commonly-used statistical models. Some methods of rounding also introduce bias to the measurement, if for example the data are truncated by just omitting all numbers after the decimal point. Hence one should always use an unbiased method of rounding.

Note that whether a variable is continuous or discrete tells us nothing about the shape of its frequency distribution. For example many statistical tests assume that the data (or statistics derived from them) follow the normal distribution, the theoretical continuous probability distribution shown in virtually every elementary statistics textbook. Whilst some measurement variables do indeed tend towards this symmetrical bell-shaped distribution, many (such as time) do not, and are instead heavily skewed. Moreover, some discrete variables (such as counts) can approximate to one of the theoretical continuous distributions.



Derived and proxy variables

A derived variable is one that is derived from two (or more) primary variables. Hence percentages, ratios, indices and rates are all derived variables. Care must be taken when analzsing derived variables, for two reasons:
  1. Since the value a derived variable takes depends on two other variables, changes in a derived variable may have arisen from a change in either or both of the primary variables. For example an increase in the proportion of males in a population can arise from an increase in the number of males, a decrease in the number of females, a greater increase in the number of males than of the number of females, or a lesser decrease in the number of males than the number of females.

  2. Derived variables can have unexpected properties. You may be able to treat percentages much as other variables, for example when analysing packed cell volumes (PCVs). Or they may have a skewed frequency distribution which will need to be handled differently. If you are comparing prevalences, take special care if they are based on different sample sizes.

An important difference between the different types of variables is the amount of information each observation can contain, and therefore the analyzes that can usefully be applied to each. Observations using a ratio scale can contain more information than those employing an interval scale. Continuous variables can describe more than discrete ones. Ordinal variables convey less information, and categorical variables give least. Data derived from combining different types of variable must be analyzed according to the least informative variable included. In other words, combining different types of variables looses information. For example, the weight of mice times an ordinal aggression score (1 = fierce, 2 = normal, 3 = docile), has to be analyzed as ordinal data, not as a measurement variable.

A proxy variable (also termed a surrogate variable) is an indirect measure of the variable that a researcher wishes to study. Proxy variables are widely used when the variable under study is difficult (or impossible) to measure or observe directly. For example:

  1. Medical care premiums can be used as a proxy variable for socio-economic status as they are worked out based on income. The area where you live is generally not such a good proxy measure for socio-economic status, nor is educational attainment.
  2. The stem diameter of a bush or tree (which is easy to measure) can often be used as a proxy variable for relative plant size and height
  3. Evapotranspiration can be used as a proxy variable for plant primary productivity

Proxy variables should always be validated if at all possible - we will examine this in detail in the next unit.



Collapsing and transforming variables

Collapsing a variable means changing the scale of a variable, or reducing the number of categories. It is most commonly done to facilitate display or analysis of data. Continuous measurement data can be collapsed to give discrete categories by putting them in class intervals. Providing each interval is fully defined, it still remains as measurement data, because we can carry out arithmetic operations on it (as below when we work out the arithmetic mean from a frequency distribution). However as soon as an interval is defined merely as 'greater than' some value, the variable is reduced to an ordinal scale. Rounding is, of course, the first step in the process of collapsing continuous measurement data to discrete observations.

Measurement data can also be collapsed to the ordinal scale by assigning ranks to the observations. Similarly, data measured on the measurement, ordinal or nominal scale can be collapsed (or 'dichotomized') to a binary or dichotomous variable having just two categories. Always think carefully before collapsing a variable, as you are always loosing information in the process.


An item's rank (r) within a given set of values describes how many of those values are less than or equal to it. The simplest way to assign ranks to a set of (n) values is to sort them in ascending order and number them from 1 to n.

    For instance, sorting these (n=4) values: 9.9, 1.1, 1.3, 1.2 into ascending order gives (1) 1.1, (2) 1.2, (3) 1.3, (4) 9.9 - here their rank is shown blue. Hence each item's rank is its order when arranged thus.

The inverse rank describes each item's location when values are in descending order. The relative rank (=r/n) is the proportion of items of equal or lesser rank (we discuss this further elsewhere).

    The relative rank of these unsorted values, among their n=4 fellows, is: (1.00) 9.9, (0.25) 1.1, (0.75) 1.3, (0.50) 1.2


Arbitrarily assigning a rank-order can run into difficulties when every item does not have a different value. For tied items, whose values are identical, there are several alternative ways to define rank (when every value differs their results are identical). The ranks of the tied items are shown in light blue

  1. The maximal rank uses their highest rank (yielding tied ranks). This is equivalent to the most common and simplest definition, above.
      The maximal rank of these (n=4) values is: (4) 9.9, (2) 1.2, (3) 1.3, (2) 1.2

  2. The minimal rank uses their lowest rank (yielding tied ranks).
      The minimal rank of these (n=4) values is: (4) 9.9, (1) 1.2, (3) 1.3, (1) 1.2

  3. The ascending rank or sequential rank breaks ties by assigning ranks according to the order the values were observed - so every rank is different.
      Arranged in that order, the ascending rank of these (n=4) values is: (4) 9.9, (1) 1.2, (3) 1.3, (2) 1.2

  4. The jittered rank breaks ties by randomly assigning ranks within tied values, in other words by assigning them a random order. This is equivalent to adding or subtracting an infinitesimally small, but different, variation to each tied value prior to ranking.
      Applying that rule, we found the jittered rank of these (n=4) values was: (4) 9.9, (2) 1.2, (3) 1.3, (1) 1.2

  5. The mean rank allocates tied values their mean rank. This is equivalent to their average jittered rank, or expected rank, after many random assignments - hence mean rankings are tied.
      The mean rank of these (n=4) values is: (4) 9.9, (1.5) 1.2, (3) 1.3, (1.5) 1.2

Collapsing a variable is just one form of data transformation. There are many types of data transformation. Some loose information, such as the rank transformation covered above, but many do not.

Linear transformation or coding is the addition of, or multiplication by, a constant number. This may be done to ease the burden of calculations or recording, or to enable a non-linear transformation to be performed. We give some examples of this below.

Non-linear transformation is the application of any other mathematical function to data - other than simple multiplication or addition. Non-linear transformation is commonly used, either to change the frequency distribution of a variable, to make it more amenable to analysis, or to linearize (straighten) a relationship between two variables. The commonest non-linear transformations are the logarithmic (or log) transformation (the logarithm is taken of each observation) and the square root transformation (the square root is taken of each observation). Means estimated from transformed data should normally be subjected to a detransformation (by carrying out the reverse mathematical procedure) for presentation. We return to non-linear transformations in more depth in Unit 6, both in the core text and in a More Information page on the topic.



Assumptions and requirements

Usually, the key assumption for any data is that they are selected randomly. In other words they are selected without pattern and with equal probability.

For example, there is little point analysing data on farm income if your selection procedure tends to favour the most well-off landowners, or those nearest to major roads. Similarly you will get a biased assessment of herd health if you only sample the most easily caught animals. Nor can you hope to assess a pest control campaign by only inspecting the places you are taken to by the District Officer.

Many analytical procedures also assume the observations are independent of each other. In other words you did not choose to take a sample from patient B just because the sample from patient A was positive (or negative).

Observations taken in series over time are usually not independent, and require special methods of analysis. For example, the weather on one day is often quite similar to that of the previous day, animal health one week is generally quite similar to that the previous week, and worm infestation one year tends to be influenced by their prevalence the previous year. A number of procedures have been developed to test for the simpler patterns of serial correlation. However, there is no test which can assess whether observations are truly independent in all respects.

topics :

The computer revolution

Data verification