Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Descriptive surveys over time and spaceLongitudinal surveys, monitoring, sampling methodology Displaying monitoring data Analysis of monitoring data Distribution surveys, mapping Display of spatial data Analysis of spatial data
Longitudinal surveys (monitoring)
The first step is to define the target population and the response variable - whether disease incidence, pest density or population size. If using passive surveillance, ensure that you have recorded all the cases correctly. For active surveillance and population monitoring, use probability sampling if at all possible. For different sampling methods see the More Information page on Sampling methods.
Sometimes randomly located samples are taken on each occasion. Such an approach may be used if you are monitoring the development of a pest population on a crop. Alternatively, repeated samples may be taken from a number of fixed locations - such as trap positions, or line transects. By restricting sampling to the same positions on each sampling occasion, the catch or number observed will not be differentially affected over time by any 'site effect' - providing any site-dependent bias remains constant over time, which it may not! If the aim is to study the dynamics of a population, the range of habitats sampled should cover the full range of habitats used over the year. Otherwise apparent changes in population may merely reflect changes in its distribution.
As regards the pattern of sampling over time, the most informative way of analysing monitoring data (time-series analysis) assumes that the sampling interval (the period between sampling occasions) remains constant. Hence as far as possible, sampling should be done daily, weekly, or after a set number of days. The length of the sampling interval required depends upon the rate of change of the characteristic being studied - a rapidly changing population should be sampled more frequently than one which has a slow rate of change.
It is important to take sufficient samples so that you have adequate power to detect population changes. This necessitates (again) use of probability sampling methods so that meaningful confidence intervals can be attached.
One needs to think carefully about what will be reported in disease monitoring data. The first data available is number of suspected cases by date of reporting. However, this can be very misleading if there are diagnostic problems. Alternatively one can report the number of confirmed cases by week of confirmation. This is more reliable as regards getting real cases, but can be misleading as it is affected by delays at the laboratories where confirmation is taking place. The best way is to report data confirmed cases according to the week in which they were reported. If the period of time between onset and reporting is known or can be estimated then confirmed cases can be plotted by date of clinical onset of the disease to give what is known as an epidemic curve.
If your observations are made at equally-spaced intervals, for example every 15 days, it is straightforward to plot the variable being measured against its day number. Monthly readings will not be precisely equally spaced, as the numbers of days in a month varies. If observations are not equally spaced, ensure that points are plotted on the correct day number, rather than just as an observation number or month. Month can then be superimposed on the day number axis as we have done below.
Readings are sometimes missed, so there may be several months between readings instead of the usual one. If you do have missing data, for example the two values for May, you should join them with a dotted line, as we have done in the second figure above, to indicate that you have much less confidence on what was happening during that period. Of course, you may be missing fluctuations between the regular sampling occasions, especially if the sampling interval is excessively long for the organism you are studying. Time trend plots are also sometimes done using bar diagrams rather than line plots but this is not recommended.
Need a mention here on running means and medians referring back to unit 1 - MImean
When we come to consider multiple time series plots, we may find some quite complex relationships between them. Sometimes they may vary quite independently of each other - other times they may vary together. But sometimes there is a time lag or time delay between change in one variable and change in the other variable. In the example below there appears to be 1-2 week lag between when host density changes and when percentage parasitism changes.
It can be informative to plot this type of data as a trajectory plot, shown in the second figure above. Here percentage parasitism is plotted against host density, and the points are joined in sequence. If percentage parasitism is related to host density some time period previous, we will get an anticlockwise spiral in the trajectory
There is another way that lags in systems can be identified - that is by plotting one variable against the other variable, at some specified earlier time period. The first figure below shows percentage parasitism plotted against host density with no time lag - there is no obvious relationship between the two variables.
The second and third figures show the effect of adding a progressively greater time lag. There is a very clear relationship between percentage parasitism and host density two weeks earlier. It is frequently necessary to test a number of different time lags when looking at relationships between (say) pest numbers and meteorological factors such as temperature or rainfall.
Probably the most common model for time series data is to fit a simpler linear regression line to a trend over time. If the trend over time is curvilinear, the response variable can be log transformed before fitting the regression line. We looked briefly at regression in the More Information page on Relationships between
There are more rigorous methods of analysis for serially correlated observations - namely time series analysis. One version of this is ARIMA modelling. The acronym stands for Auto-Regressive, Integrated, Moving Average. The first step is often to transform the response variable to ensure homogeneity of variances. ARIMA models are then fitted to both the explanatory and response variables to remove the effects of serial correlation. The two series are then cross-correlated to determine whether an association exists. Note however that this approach can only be used with very long runs of data - if, for example, you wish to identify seasonal trends, you will need 10-20 years of data.
Extrapolation of a multivariate regression relationship is another approach, which is widely used for forecasting pest damage levels to crops. Any forecast based upon only a few years data is, of course, likely to less reliable than one based on many years data. But however-many years' data are used, it is still possible that relationships, which have held in the past, may not hold in future. This will lead to the breakdown of any forecasting system.
Often a more valid use of past data is to use the level of past variability to identify unusually large increases in the number of cases. If we know the number of cases of a disease have only varied historically between 10 and 100 per year, then the occurrence of 400 cases in a year would certainly be a cause for concern.
Distribution surveys (mapping)
Where passive surveillance often get 'presence only' data - in other words you know the locations where an individual occurs, but you cannot say it is absent in other locations. Where active sampling, and desire is to map distribution, then random sampling is not necessarily the best approach. A systematic grid sample is often preferable in order o provide good coverage
For mapping surveys a great deal of additional spatial data are required, including topographical features (contours, rivers, roads etc) and vegetation types. Such data may be obtained from a variety of sources. Map data can be obtained from existing maps, ground survey or remote sensing. Often remote sensing data must be ground-truthed to be useful. For example remote sensing will provide multicoloured maps of the different vegetation types in an area - but the different types identified must then be ground-truthed to link up the particular vegetation types to the particular colours on the maps. Attribute data comprise any georeferenced data describing features of the area.
Usually several data layers are input to the GIS multilayer data base - for example vegetation, rivers, roads, villages and disease prevalence. The base map data are stored in one of two formats - either raster-based (also called grid-based, or pixel-based), or vector-based. For a raster-based system a map is scanned, whilst for a vector-based system a map's features are traced over using a digitizing tablet. Some GIS systems can analyse both raster and vector
Data are output as different types of maps. Such maps may incorporate contours (such as altitude contours commonly found on maps), shading, and a variety of symbols of different sizes and shapes. The base map can be displayed in two or three dimensions.
We will take our examples of mapping from work carried out by us, at Nguruman in the Rift Valley of Kenya, on the distribution and abundance of tsetse flies - vectors of trypanosomosis in cattle. Two geographic information systems were used, ARC/INFO, which is a vector-based system, and CRIES, which is a raster-based system. Data could be converted between the two systems, so each system could be used for the tasks best suited to it. Data sources included Kenya Ordinance Survey maps, aerial photographs, satellite imagery, and extensive ground-based field data. Long term data were available on tsetse fly numbers from 22 traps that were placed within the tsetse habitat.
The first map below shows a three-dimensional representation of the topography of the area. The escarpment edge bordering the Rift Valley lies to the west and is topped by the Losuate Hills. The trap positions marked in pink lie at the base of the escarpment.
The second map (above) is rotated round a little and shows the patch of denser vegetation within which the traps are set. That denser vegetation results from a series of streams which descend from the hills to the plains. The third map (above) is further rotated and shows the full range of vegetation types present in the area. This map was obtained from satellite imagery, which was then ground-truthed to match up vegetation types to the different colour signatures. Where there is no source of groundwater, the vegetation is dry grassland and scrub.
We then have to consider how to display the main response variable in the map - for example disease prevalence or number of insects caught. One option is to use a proportional circle map. The area of the
So far the maps used have only indicated the value of the variable of interest at specific locations. But usually the real interest is in how the variable is distributed over the whole area - not just at particular sites. For this one has to interpolate numbers or proportions in areas that fall between actual observations (in other words, where data are absent). This can be done using simple arithmetic or geometric means - or values can be weighted, depending upon values in the other data layers
One way to do this would be to draw a chloroplethic map. The whole area would be divided into discrete
A better way to visualise the distribution of tsetse would be to use isoplethic maps. An isoplethic map aims to represent the true boundaries between different values by joining all points of equal value. This produces a set of 'contour lines'. The areas between contour lines can be shaded or coloured to represent different ranges of values.