Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Logistic regressionOn this page: Principles Estimating parameters Ungrouped or grouped data Significance testing Testing individual coefficients Model simplification & parsimony Explained variation, Goodness of fit, Residual measures
Linear regression models assume that the response variable is a continuous measurement variable - or at least can be treated as such. Logistic regression (more strictly binary logistic regression) on the other hand is appropriate for binary response variables as where individuals are assigned to one of two classes (say infected or uninfected). There are other forms of logistic regression where the response variable is instead ordinal (such as 'alive', 'half dead' or 'dead' - perhaps applied to members of the House of Lords in UK) or nominal with more than two categories - we cover these briefly in the related topics.
With a binary variable it is customary to denote a positive response (say infected) by Y= 1 (in statistical terminology successes) and a negative response (say uninfected) by Y=0 (known as failures). The mean of this response variable equals the probability of success p (that is the proportion infected).
We might then consider model such as p=a + b1X1 + b2X2 and so on, but
Hence we need an appropriate transformation - which is achieved using the logit transformation.
The logit transformation
With this transformation as p increases from 0 to 1, the logit increases from -∞ to +∞. It also linearizes the relationship so the logistic regression model can be specified as below:
The estimated probability of success (p) can be obtained by rearranging the logistic regression equation thus:
In modern parlance logistic regression is viewed as a generalized linear model. The parameters for the best fit model are estimated using maximum likelihood rather than least squares. Maximum likelihood is an iterative way of finding the smallest possible deviance between the observed and predicted values using calculus. The final value for the deviance is called -2 Log Likelihood (or -2LL or D). Note that although logistic regression generally 'works', there is no statistical theory underlying why it works (unlike with linear regression).
There are two different approaches to maximum likelihood estimation in logistic regression - the unconditional approach and the conditional approach. The unconditional approach is used when the number of degrees of freedom for the model is small relative to the number of observations. However when individual matching is used (whether 1:1 or 1:m), the model degrees of freedom are much larger and the conditional approach must be used (see related topic above).
Ungrouped or grouped data?
When analyzing data on a binary response variable (say infected versus uninfected), we must decide whether to input the raw binary data or group them together (as for example number of infected versus number of uninfected). The key issue here is whether one has unique values of one or more explanatory variables for each individual case). If this is the case, then one should input the raw binary data case by case. If not, one should sum the counts and re-code the binary response as count of a 2-level factor. Both such approaches are commonly termed logistic regression. The maximum likelihood estimates for the grouped data will be the same as for the ungrouped data, and the increase in log-likelihood when extra regressors are added will also be the same. For a given model and coefficient the two log-likelihoods differ only by an additive constant. However, there are some important differences between the two approaches related to overdispersion which we will highlight in the worked examples
Significance testing in logistic regression
The overall significance of a logistic regression can be assessed with a likelihood ratio test where the null (constant only) model is compared to the current model including predictors. The larger the difference, the greater the evidence that the model is significant. The log of the likelihood ratio is the difference between these two log likelihoods. In practice we work with minus twice the log of the likelihood ratio as the log of the likelihoods are always negative. As with the G-test, where all frequencies are large, the natural log of L2 (twice the log of the likelihood ratio) has an approximately chi-square distribution.
Hence the log likelihood ratio statistic is given by -2 log Lnull model - (-2 log Lfull model).
The number of degrees of freedom is equal to the difference between the number of β-parameters being fitted under the two models.
Testing of individual coefficients
The best way to test individual predictors is to again use a likelihood ratio test, this time comparing the log likelihood for the model without the predictor with the log likelihood for the full model:
An alternative approach to testing the statistical significance of each coefficient in the model is to use a Wald test. Here a Z| statistic is computed as the coefficient divided by its standard error. This is then squared, yielding a Wald statistic with a chi-square distribution.
However, the Wald statistic can be unreliable for small sample sizes and/or large coefficients, so it may be better to stick to the likelihood-ratio test. The only justification we have found given for using the Wald statistic is that it is computationally easy and is given automatically in the output of most statistical computer packages.
Model simplificationSelecting the most parsimonious model is done in the same way as for multiple linear regression. It can be done on the basis of differences in deviance between model pairs, although nowadays the Akaike Information Criterion (AIC) is generally preferred. For generalized linear models AIC is the log likelihood ratio statistic (G) penalized for the number of predictors (p) and the number of observations (n) or unique category combinations (D). When comparing two possible models from a family of models, this is reduced to AIC = G - 2df where df is the difference in degrees of freedom of the two models.
Assessing explained variation ('goodness of fit')
There are several summary measures for the difference between observed and fitted values. The Pearson goodness of fit statistic is obtained by summing the squares of the Pearson residuals. The deviance goodness of fit statistic is obtained by summing the squares of the deviance residuals. Each of these statistics has number of degrees of freedom equal to the number of (grouped) data items minus the number of independent parameters fitted to the model. The number of independent parameters is 1 for the constant, 1 for a measurement variable and (number of levels − 1) for a nominal variable. Note this method can only be used when data are grouped and not when raw binary data are input.
If the data are sparse (in other words there are few observed values for some combinations of explanatory variables), the deviance goodness of fit statistic given above may not be distributed as χ2. In this situation and providing one only has continuous (measurement) explanatory variables and a large number of observations (>400), it is preferable to use the Hosmer-Lemeshow fit test. The predicted probabilities are divided into deciles and then a Pearson chi-square is computed that compares the predicted to the observed frequencies (in a 2 X 10 table). Non- significant values indicate a good fit to the data and, therefore, good overall model fit.
There is no true R2 value in logistic regression, but statisticians have tried very hard to come up with analogous measures by treating deviances in the same way as the sums of squares residual in a least squares regression. One such attempt is termed the Cox & Snell Pseudo- R2.
Although Nagelkerke's R2 does range from zero to 1, the resemblance to R2 in least squares regression remains superficial. Different pseudo R2s (there are at least eight in common use) all give somewhat different answers including those that range from zero to 1.
Pseudo R-squareds therefore cannot be interpreted as proportion of variance explained, nor can they be compared across datasets. However, they are valid and useful for comparing multiple models predicting the same outcome on the same dataset providing the same type of same type of pseudo R-squared is used throughout.
It is more difficult to use residuals to examine fits in logistic regression than in linear regression because the response is discrete. However, deviance residuals and Pearson residuals can both be used. Diagnostic plots can be made but are hard to interpret. High leverage is not really a problem as the observations are bounded.
If you want the coefficients to be meaningful, it should be noted that multicollinearity (two or more explanatory variables being correlated) is as much of a problem for logistic regression as it is for multiple regression. It should therefore be checked at the beginning of the analysis.