Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Cox's proportional hazards regression model: Use & misuse
(proportional hazards assumption, multicollinearity, explained variation, model selection)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and MisuseThe semiparametric Cox proportional hazards model is widely used to model survival in medical research. It is less heavily used in veterinary and livestock production research (where the Weibull model is popular) and relatively rarely in ecological/wildlife research where it may be more difficult to follow a cohort of individuals.
Although the Cox model makes no assumptions about the distribution of failure times, it does assume that hazard functions in the different strata are proportional over time - the so-called proportional hazards assumption. If one is to make any sense of the individual coefficients, it also assumes that there is no multicollinearity among covariates. Both these assumptions should always be tested, but in the small sample of papers we reviewed there was a tendency to either check both these assumptions - or neither. Where neither were reported as having been checked - for example the study on survival of midwife toad infected with chytridiomycosis - failure to meet assumptions may well have affected conclusions of the study. Most authors interpreted coefficients correctly (that is a higher hazard ratio indicates a shorter survival time relative to the comparison group), but there were sufficient (typographical?) errors in one paper to make one suspect that referees had (partially) corrected this aspect.
The remaining issues were the same as those noted for other multiple regression modelling. Misspecification bias - missing out (the most) important variable(s) - was especially apparent in a study of factors affecting the likelihood of suicide after leaving the armed services. Sexual orientation is known to one of the biggest factors affecting suicide of young men but was not considered in the study. Nor were home environment factors considered in a study of factors affecting relapse after treatment for schizophrenia. Such issues would become more apparent if authors included some measure of explained variation (R square analogue), but this seems to be rarely done and indeed rarely recommended in the standard texts.
As for model selection methods, the preferred approach of limited manual backwards variable selection followed by validation procedures (recommended by Harrell) is still little used. The automated stepwise procedure is known to be biased and inefficient at obtaining the optimal model, as is preliminary screening followed by forward selection. Sometimes we are simply not told what simplification procedure was followed. However the wildlife example on natal dispersal distance in American martens was a good example of how it should be done. We lastly note that the importance of adequate group sizes - using group sizes of 4-5 animals (as in a study of scrapie in sheep) is virtually guaranteed to demonstrate no significant difference in survival between groups!
What the statisticians sayRecent texts on survival analysis include Kleinbaum & Klein (2011) and Machin et al. (2006). Classic older texts include Hosmer & Lemeshow (1999) and Collett (1991). Most general medical statistics texts also cover survival analysis, for example Armitage & Berry (2002) in Chapter 17. Fox (2001) introduces ecologists to failure time analysis. McCullagh & Nelder (1989) look at treating Cox proportional hazards regression as a generalized linear model in Chapter 9.
Chan (2004) provides a concise primer on survival analysis for medics. Clark et al. (2003a) (2003b) and Bradburn et al. (2003a) (2003b) provide a comprehensive review of survival techniques and Cox regression for medical researchers. Peduzzi et al. (2002) includes a good (but brief) summary of methods. Fleming & Lin (2000) review past developments and future directions of survival analysis in clinical trials. Thompson et al. (1998), Lee (1994), and Lee & Chia (1993) advocate estimation of the prevalence ratio rather than the odds ratio in cross sectional studies using (amongst other methods) the Cox proportional hazards model .
Xue et al. (2007) considers the analysis of time to event data in the presence of collinearity between covariates. Heagerty & Zheng (2002) look at ROC curves for describing the predictive accuracy of survival models. Schemper & Henderson (2000) and Schemper & Stare (1996) consider measures of explained variation for the Cox proportional hazards regression model. Harrell et al. (1996) looks at issues in model development and evaluating goodness of fit with special emphasis on the Cox regression model. Schemper (1993) considers how to assess the relative importance of prognostic factors in studies of survival. Lin et al. (1993) and Grambsch & Therneau (1994) look at the use of various residuals for checking the Cox model. Schemper et al. (1992) considers Cox analysis of survival data with non-proportional hazard functions.
Zens & Peart (2003), Lebreton et al. (1993) and Muenchow (1986) review techniques available to ecologists for analyzing survival data. Martel et al. (2008) use the Cox model to investigate patch time allocation in parasitoids following earlier work by Wajnberg (2006), Wajnberg et al. (2003) and van Alphen et al. (2003) . Eberhardt (2002) looks at the use of Cox's proportional hazards model for trend data. Dungan et al. (2003) uses interval censored failure time analysis to investigate leaf lifespans. Tenhumberg et al. (2001) and Moya-Laraņo & Wise (2000) look at the use of survival regression analysis in behavioural studies.
Wikipedia provides sections on survival analysis, the Kaplan-Meir survival estimator, the logrank test and proportional hazards models. John Fox (2002) provides an excellent guide to using R for Cox proportional hazards regression. Other R guides include Michael Crawley, David Diez and Mai Zhou. Thomas Lumley (2004) describes the survival package in R. Stephen Jenkins (2004) provides a fairly detailed account of different aspects of survival analysis.