Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Types of variables and collapsing and transforming themProperties & Assumptions Nominal variables Ordinal variables Measurement variables Is the classification meaningful? Continuous versus discrete variables Derived & proxy variables Collapsing & transforming variables Assumptions & requirements
Even when scientists can be persuaded to look at their (raw, unanalyzed) results, they seldom consider the properties inherent to that form of information - or how those properties might affect any statistics they may use to summarize their results. Since the properties of their information have major implications for the validity of their subsequent analysis, this is rather a shame. The first obstacle to be overcome in this matter lies in describing what sort of information you are dealing with. But before we can proceed, we need to define a few key terms.
A datum is a single item of information. In practice, we are interested in a number of items of information for which we use the plural of datum - data.
Data exist as observations. These are classifications or measurements made on the smallest sampling unit, known as the evaluation unit. The evaluation unit is commonly a single individual. The weight of a mouse is a single observation. If we weigh it again, or weigh another mouse, that is a second observation.
Do not confuse observations with events. An event is something that happens, such as a storm. You can make observations on different aspects of an event (or a mouse), such as how big it is, when it begins (or is born, or conceived), how long it survives for, and how it affects things around it.
For their conclusions to be of practical use, scientists need to distinguish treatments they apply from results they observe. Both may be measured, are subject to error, and can apply to the same item - such as a mouse. The difference is the extent to which you can infer that one causes the
Inevitably, we are interested in how our observations vary. If they never varied, there would be little reason to make the observations. Thus, each type of observation we make is called a variable. We can formally define a variable as a characteristic of an evaluation unit that can assume a number of different values. There can be any number of variables pertaining to an individual item or event. For example, we may measure the length and width of a tooth. The length is one variable, the width is another. Or we could measure the duration, rainfall and windspeed of a storm, or the temperature, humidity and wind direction preceding it.
There are several criteria we can use to classify variables. The most important are the scale of measurement used, whether the distribution of the variable is discrete or continuous, and whether it is a primary or derived variable.
Scales of measurement
A nominal variable (also known as a categorical variable) has a number of categories to which an observation can be assigned. Thus colour is measured on the nominal scale (or categorical scale). For example, fungal cultures could be classified as 'white', 'yellow' or 'green'. Or you may record the number of mice that are white, brown, or brown and white. If you also record whether they are 'furry', 'nude' or 'wrinkled', this is another nominal variable. If you wish to have intermediate categories, you must provide them before you make your measurements. You could assign a number or score to each category, say 1 for furry, 2 for nude and 3 for wrinkled. But these numbers are for convenience only - they do not indicate any formal relationship between the categories. Gender is a special type of nominal variable in that it can only take one of two values - male or female. This is known as a binary or dichotomous variable. Binary variables may be labelled as 0 or 1 or sometimes + or −.
In the ordinal scale different categories can be related to each other. The descriptions or numbers indicate their rank order (see
In general with ordinal scales the rank implies nothing about the magnitude of the difference between the categories. However, there is a special type of ordinal scale that attempts to do this and thus may approach a measurement scale - this is known as the visual analogue scale. It is most commonly used in medical research to assess patient conditions that cannot be readily quantified - such as level of pain or state of asthma. A straight line, usually 10 cm in length, is drawn - with for example, 0 mm corresponding to no pain, and 100 mm to the worst pain possible. The patient then has to indicate where on the line his or her pain level lies.
In the measurement scale, the distances between any two numbers on the scale are of known size. Any difference is the same, wherever it lies on its scale. Thus a difference of one pound is the same whether you are referring to 5 pounds, or 50,000 pounds. Examples of a measurement variable are variables such as length, weight or number of births.
Some measurement variables only have an arbitrary zero, in which case it is described as an interval scale of measurement. For example the Centigrade scale of temperature measurement has an arbitrary zero point. In an interval scale the ratios of differences on the scale are independent of the unit of measurement and of the zero point. Say you have three temperatures on the Celsius scale, 0, 10 and 100 degrees. Equivalent temperatures on the Fahrenheit scale are 32, 50 and 212. The ratio of the differences on the Celsius scale (100-10/10-0) = 9) is the same as the ratio on the Fahrenheit scale (212-50/50-32)=9.
When a scale has all the characteristics of an interval scale and also has a true zero point it is described as a ratio scale of measurement. Many measurement variables, such as weight and height, have a true zero point.
Is this classification meaningful?
Before just accepting the classification above, we should perhaps think a little more carefully about the matter. Despite the fact it is used in nearly every introductory statistics text, it was only proposed in the 1940s. Since then it has been criticized (for example Velleman & Wilkinson
Continuous versus discrete variables
Variables can also be classified as to whether they are continuous or discrete. The mathematical definition of a continuous variable is that, for a random sample, no two values will be identical. In more general terms, a continuous variable is one which can (at least theoretically) take any intermediate value between its maximum and minimum value. The exact value we record is limited by the accuracy of our measurements.
For example, ordinary mercury-in-glass clinical thermometers are usually calibrated in divisions of 1/10 of a degree. Therefore we are unlikely to be able to measure to less than half that amount, that is one twentieth of a degree (i.e. 0.05≡). A measurement of 35.102 degrees would imply the actual measurement is somewhere between 35.1015 and 35.1025 degrees. This level of accuracy is not possible for such equipment, and is therefore
A discrete variable (also known as a meristic variable) is one where the measurement can only exist as a whole number (an integer). Discrete variables are usually counts, for example the number of children in a school, or the number of vehicles using a road. Some variables, such as monetary income, can be difficult to assign as continuous or discrete. Although income can take many intermediate values, values below the smallest unit of currency are not possible. As a result, bills can only be presented, or cheques written, to the nearest, cent, penny, or
Rounding is a process applied to observations of a measurement variable where intermediate values are recorded to the nearest whole number. Weight, for example, may be rounded to the nearest gram. Some rounding is, of course, inevitable in any measurement. But too much rounding reduces the information content of a measurement, and makes the data behave like a discrete, rather than continuous variable. This results in 'tied' values which can badly upset the assumptions of many commonly-used statistical models. Some methods of rounding also introduce bias to the measurement, if for example the data are truncated by just omitting all numbers after the decimal point. Hence one should always use an unbiased method of
Note that whether a variable is continuous or discrete tells us nothing about the shape of its frequency
Derived and proxy variablesA derived variable is one that is derived from two (or more) primary variables. Hence percentages, ratios, indices and rates are all derived variables. Care must be taken when analzsing derived variables, for two reasons:
An important difference between the different types of variables is the amount of information each observation can contain, and therefore the analyzes that can usefully be applied to each. Observations using a ratio scale can contain more information than those employing an interval scale. Continuous variables can describe more than discrete ones. Ordinal variables convey less information, and categorical variables give least. Data derived from combining different types of variable must be analyzed according to the least informative variable included. In other words, combining different types of variables looses information. For example, the weight of mice times an ordinal aggression score (1 = fierce, 2 = normal, 3 = docile), has to be analyzed as ordinal data, not as a measurement variable.
A proxy variable (also termed a surrogate variable) is an indirect measure of the variable that a researcher wishes to study. Proxy variables are widely used when the variable under study is difficult (or impossible) to measure or observe directly. For example:
Collapsing and transforming variables
Collapsing a variable means changing the scale of a variable, or reducing the number of
Measurement data can also be collapsed to the ordinal scale by assigning ranks to the observations. Similarly, data measured on the measurement, ordinal or nominal scale can be collapsed (or 'dichotomized') to a binary or dichotomous variable having just two categories. Always think carefully before collapsing a variable, as you are always loosing information in the process.
An item's rank (r) within a given set of values describes how many of those values are less than or equal to it. The simplest way to assign ranks to a set of (n) values is to sort them in ascending order and number them from 1 to n.
Arbitrarily assigning a rank-order can run into difficulties when every item does not have a different value. For tied items, whose values are identical, there are several alternative ways to define rank (when every value differs their results are identical). The ranks of the tied items are shown in light blue
Collapsing a variable is just one form of data transformation. There are many types of data transformation. Some loose information, such as the rank transformation covered above, but many do not.
Linear transformation or coding is the addition of, or multiplication by, a constant number. This may be done to ease the burden of calculations or recording, or to enable a non-linear transformation to be performed. We give some examples of this below.
Non-linear transformation is the application of any other mathematical function to data - other than simple multiplication or addition. Non-linear transformation is commonly used, either to change the frequency distribution of a variable, to make it more amenable to analysis, or to linearize (straighten) a relationship between two variables. The commonest non-linear transformations are the logarithmic (or log) transformation (the logarithm is taken of each observation) and the square root transformation (the square root is taken of each observation). Means estimated from transformed data should normally be subjected to a detransformation (by carrying out the reverse mathematical procedure) for presentation. We return to non-linear transformations in more depth in Unit 6, both in the core text and in a More Information page on the topic.
Assumptions and requirements
For example, there is little point analysing data on farm income if your selection procedure tends to favour the most well-off landowners, or those nearest to major roads. Similarly you will get a biased assessment of herd health if you only sample the most easily caught animals. Nor can you hope to assess a pest control campaign by only inspecting the places you are taken to by the District Officer.
Many analytical procedures also assume the observations are independent of each other. In other words you did not choose to take a sample from patient B just because the sample from patient A was positive
Observations taken in series over time are usually not independent, and require special methods of analysis. For example, the weather on one day is often quite similar to that of the previous day, animal health one week is generally quite similar to that the previous week, and worm infestation one year tends to be influenced by their prevalence the previous year. A number of procedures have been developed to test for the simpler patterns of serial correlation. However, there is no test which can assess whether observations are truly independent in all respects.