Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Cox's proportional hazards regressionOn this page: Principles Cox's proportional hazards model Fitting the model Significance testing Model checking The multivariate model Time-dependent variables Checking a multivariate model Assumptions
The most common approach (although not perhaps the most obvious) is to base the regression model on the proportional hazards assumption. If we are comparing a new treatment with the standard treatment, it is assumed that the ratio of the hazard for an individual on a new treatment to that for an individual on the standard treatment remains constant over time. This applies even if the magnitude of hazards varies over time. The constant ratio is known as the hazard ratio. This assumption is met surprisingly often in medical research - a new treatment tends to change the overall mortality rate, rather than change the pattern of mortality over time. In this approach any explanatory variable acts multiplicatively on the hazard ratio - not directly on the failure time. In the most popular of these models - Cox's proportional hazards model - no underlying distribution of failure times is assumed. In another model - the Weibull proportional hazards model - the failure times are assumed to follow a theoretical distribution known as the Weibull distribution.
In an alternative group of models, the explanatory variables act multiplicatively directly on the failure time. These are known as the accelerated time failure models, and generally do not assume proportional hazards.
Cox's proportional hazards model
The basic model
The most frequently used regression model for survival analysis is Cox's proportional hazards model. We will first consider the model for the 'two group' situation since it is easier to understand the implications and assumptions of the model. We will then extend the model to the multivariate situation.
If we have two groups, one receiving the standard treatment and the other receiving the new treatment, and the proportional hazards assumption is met, then the ratio of the hazard function in one group to that in the other group at any point of time will be constant:
The hazard ratio can be regarded as a measure of relative risk. If the hazard ratio is less than 1, the new treatment is superior. If the hazard ratio is greater than 1, then the standard treatment is superior.
Because hazards are always positive, the hazard ratio must also be positive. It is more convenient when modelling to have parameters in the model that can vary in an unlimited way
This is known as the proportional hazards model for two groups. Note that, unlike the regression models we have encountered so far, there is no constant on the right hand side of the equation other than the regression coefficient β. This is because, were we to include a constant, it could be cancelled out by dividing the baseline hazard function by exp (constant). What we have to do now is to see how we estimate the unknown regression coefficient, and hence fit our data on survival times to Cox's model.
Fitting the model
The coefficient is estimated using the maximum likelihood method in much the same way as we did with logistic regression. But to do this we need the likelihood function for the proportional hazards model.
Let's consider a hypothetical data set comprising two treatment groups each of five individuals suffering from new variant CJD. One group receives the standard treatment, whilst the other receives a new treatment previously untested in humans. The first individual, a member of the standard treatment group, dies on day 20. At the time just before that death all the individuals in the two groups were at risk or are in the risk set. So what is the probability (p1) that the death observed on day 20 was of the subject who did die at that time?
This obviously depends on how many other individuals are in the risk set and their relative risks of death. If all others had the same risk of death, the probability would be one out of ten, or 0.1. But the risk of death varies depending on which group the individual is in. The risk of death for each of the risk set relative to baseline is equal to
A similar probability (p2 to p10)can be calculated each time a death occurs with the risk set steadily decreasing in size. These probabilities are then combined by multiplying them together to obtain the likelihood function for the proportional hazards model. Because probabilities tend to be very small, we usually use the log likelihood function which is obtained by adding together the natural logarithms of the probabilities.
All that then remains is to determine the value of β which maximises this function - in other words gives the maximum likelihood. This is usually done by software such as R using an iterative procedure such as the Newton-Raphson method.
This gives a hazard ratio of 0.1003. Note especially that the actual times of death are not used in maximising the likelihood function - only the rank order of the deaths in each group determines the risk set. Hence the treatment of time is non-parametric. Not surprisingly we therefore have to consider how to deal with ties. What happens if we have two individuals who die at the same time? The conventional approach (termed the Breslow method) is to consider the several deaths at time t are distinct and occur sequentially. Alternative methods include the Efron method and the exact method (all of which are offered in R).
The data we have used to demonstrate maximising the likelihood function contained no censored points, but censored points are readily dealt with. The product of probabilities is only taken over the individuals for whom death times are recorded. Hence the censored individuals do not contribute to the numerator of the likelihood function. But they are included in the summation over the risk sets at death times that occur before a censored time.
The preferred method to test the null hypothesis that the hazard ratio is equal to one (β = 0) is to use a likelihood ratio test, much like the G-test in Unit 9. We have already determined the log likelihood for our model which incorporates the explanatory variable. We can also determine the log likelihood for the null model - that model for which there are no explanatory variables in the linear component of the model. We have done this by taking β as equal to 0
Hence the log likelihood ratio statistic is given by -2 log Lnull model - (-2 log Lfull model).
The analysis in R including estimation of the hazard ratio and significance testing is given below. We conclude that the explanatory variable has a significant effect on the hazard ratio.
As you can see from the R output there are two other ways of testing the significance of the hazard ratio. If you assume the coefficient β is approximately normally distributed, we can use a Z-test to determine if the value of β differs significantly from 0. For this you need the standard error of β which is computed from the second order partial derivatives of the log-likelihood function. Squaring the Z value gives the Wald statistic which is approximately distributed as χ2. In our example we get a standard error of 1.112, which gives a Z-value of 2.06748 (P = 0.0387).
You have probably noticed that the two P-values are rather different - 0.0137 compared to 0.0387. This highlights the point that in these tests we have not met the 'large sample' assumptions of the tests. We only have a sample size of ten 'deaths' observed, so at best the tests can only be regarded as very approximate. For large samples (n > 100) the two methods should give similar results where we have only one explanatory variable. However, the log ratio tests are much to be preferred when dealing with several explanatory variables.
Given that we now have the standard error of the coefficient (and assuming a large sample size), we can also attach an approximate 95% confidence interval, assuming that the maximum likelihood estimate follows a normal distribution. This is given by β ±1.96 SE(β) which for our example is -4.4795 to -0.1205. We can convert back to the hazard ratios using exp(β) which gives us 0.011 to 0.886. The Z-test we have given above is equivalent to determining whether or not this confidence interval includes unity.
We look now at the matter of checking the assumptions of the model, in particular the assumption of proportional hazards. We saw previously that we could test this by simply plotting the cumulative survival function for each group against time. If the lines were parallel, we could assume proportional hazards.
There is another way of doing this which tells us not only whether the proportional hazards assumption is met, but also, as we shall see later, informs us about the distribution of failure times. Instead of using the cumulative survival function we use the cumulative hazard
In practice we estimate H from our data by taking minus the natural log of the cumulative survival function as shown here. We then plot the natural log of H(t) against either time or the natural log of time - a plot sometimes known (confusingly) as a log minus log plot because one is plotting ln(-ln S(t)) on the y axis. These plots for our example above are shown here:
If the two lines or curves are parallel, then the hazards can be considered proportional. The vertical separation of the two lines gives an approximate estimate of the log hazard ratio.
Another way to test the proportional hazards assumption (more appropriate with multivariate models) is to plot score residuals (also termed scaled Schoenfeld residuals or beta (t) - see below) against (transformed) time. If the proportional hazards assumption is true, beta(t) will be a horizontal line. The printout gives a test for slope=0.
As expected in this case we can readily accept that the line is horizontal, given that we could draw such a line within the dotted lines showing the 95% confidence band.
Let's see what these plots looks like for a data set where the proportional hazards assumption is clearly not met. These data are the emergence times of seedlings subjected to two different treatments to break dormancy. Since the survival plots cross we already suspect that the hazard ratio is not constant over time. the log cumulative hazard plot confirms our suspicions - clearly the lines are not parallel and we cannot assume proportional hazards.
This example can also be used to demonstrate the dangers of premature censoring. If the experiment had been terminated at day 400, we would have concluded that treatment one produced a more rapid emergence rate - a conclusion not supported by events after day 400.
We reach the same conclusions about the failure to meet the proportional hazards assumption if we look at the time-dependent coefficient beta(t). Clearly the line is not horizontal. The significance test also indicates a highly significant deviation (P = 0.001).
The multivariate model
Up till now we have only considered the situation where there are two groups, with a dummy variable taking the value 0 or 1 to define whether the individual is in the control or treatment group. The full power of the Cox regression model is realised when we extend it to the multivariate situation. Let us first recap on the assumptions and implications of the model for two groups (that is with a single explanatory variable), before we extend it to cover several explanatory variables:
Multivariate Cox regression is most heavily used in medical research, although it also increasingly used in veterinary and ecological studies. One common use in medical research is to adjust the estimator of the treatment effect in a randomised controlled clinical trial. Use of covariates allows one to deal with any confounding problems if there are any imbalances between the covariate and the treatment
How then do we extend the Cox regression model to include several explanatory variables? The explanatory variables are assumed to act multiplicatively on the hazard ratio. So if we have several such variables, the log of the hazard ratio for an individual is equal to the sum of the effects of the explanatory variables:
The model accommodates factors with multiple levels as well as variates in the same way as the other linear models that we have considered:
In order to fit the multivariate model we have to modify the likelihood function to allow for multiple explanatory variables. We will not go into the details of this aspect, but the modification is straight forward, and model fitting can be carried out on most good statistical software packages. Selecting the 'best' model is then carried out much as in other multiple regression techniques.
Up till now we have assumed that all explanatory variables are measured for each individual at the start of the study, and do not change over time. Whilst this is sufficient for variables that remain constant (such as ethnic group), it may not be so for variables that are time dependent. These may be internal variables that relate to a particular individual in the study such as blood pressure, or external variables such as levels of atmospheric pollutants. Such time-dependent variables can also be introduced into the Cox regression model to give what is known as the updated covariates (proportional hazards) model.
It is important to note at this stage that we are still assuming that the coefficients of the regression model remain constant - it is the values taken by the covariates that are changing. However, since the hazard of death at time t is no longer proportional to the baseline hazard, one could challenge whether it should still be described as a proportional hazards model. The hazard ratio can only be given after a specified time interval, rather than as a constant measure of risk.
The basic model is modified such that the hazard for an individual at time t is dependent on the value of the explanatory variable not at the start of the study but at time t:
The likelihood function is modified in like manner and the model is fitted in the usual way, although computational time can be greatly increased. The time dependent variable values must be known at the time of each death for all members of the risk set. The only problem comes in knowing exactly what value of the variable to use if they are not being measured very frequently and do not vary in a predictable way. For continuous variables linear interpolation can be used to estimate values between recording occasions. For categorical variables the value from the most recent reading is used.
Checking the multivariate model
We have spent much of this page discussing checking the various assumptions, but we summarize them here briefly to be consistent with other More Information pages.
Related topics :