InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Principles

Linear regression models assume that the response variable is a continuous measurement variable - or at least can be treated as such. Logistic regression (more strictly binary logistic regression) on the other hand is appropriate for binary response variables as where individuals are assigned to one of two classes (say infected or uninfected). There are other forms of logistic regression where the response variable is instead ordinal (such as 'alive', 'half dead' or 'dead' - perhaps applied to members of the House of Lords in UK) or nominal with more than two categories - we cover these briefly in the related topics.

With a binary variable it is customary to denote a positive response (say infected) by Y= 1 (in statistical terminology successes) and a negative response (say uninfected) by Y=0 (known as failures). The mean of this response variable equals the probability of success p (that is the proportion infected).

We might then consider model such as p=a + b1X1 + b2X2 and so on, but

  1. the response variable is bounded (p must fall between 0 and 1)
  2. the typical response is not linear but is instead sigmoidal.

Hence we need an appropriate transformation - which is achieved using the logit transformation.

 

The logit transformation

Algebraically speaking -

logit (p)   =   ln [ p ]
1−p
where
  • p is the probability of success.

With this transformation as p increases from 0 to 1, the logit increases from -∞ to +∞. It also linearizes the relationship so the logistic regression model can be specified as below:

Algebraically speaking -

logit (p)   =  β0 + β1X1 + β2X2 + βkXk
where
  • p is the probability of success
  • β0 is the intercept
  • β1X1 to βkXk are the regression coefficients that represent log odds. They are more interpretable in exponent form (exp β or eβ) which converts them to odds ratios.

The estimated probability of success (p) can be obtained by rearranging the logistic regression equation thus:

Algebraically speaking -

p   =   eβ01X1 + β2X2 + βkXk
1+eβ01X1 + β2X2 + βkXk
where
  • p is the probability of success.

 

Estimating parameters

In modern parlance logistic regression is viewed as a generalized linear model. The parameters for the best fit model are estimated using maximum likelihood rather than least squares. Maximum likelihood is an iterative way of finding the smallest possible deviance between the observed and predicted values using calculus. The final value for the deviance is called -2 Log Likelihood (or -2LL or D). Note that although logistic regression generally 'works', there is no statistical theory underlying why it works (unlike with linear regression).

There are two different approaches to maximum likelihood estimation in logistic regression - the unconditional approach and the conditional approach. The unconditional approach is used when the number of degrees of freedom for the model is small relative to the number of observations. However when individual matching is used (whether 1:1 or 1:m), the model degrees of freedom are much larger and the conditional approach must be used (see related topic above).

 

Ungrouped or grouped data?

When analyzing data on a binary response variable (say infected versus uninfected), we must decide whether to input the raw binary data or group them together (as for example number of infected versus number of uninfected). The key issue here is whether one has unique values of one or more explanatory variables for each individual case). If this is the case, then one should input the raw binary data case by case. If not, one should sum the counts and re-code the binary response as count of a 2-level factor. Both such approaches are commonly termed logistic regression. The maximum likelihood estimates for the grouped data will be the same as for the ungrouped data, and the increase in log-likelihood when extra regressors are added will also be the same. For a given model and coefficient the two log-likelihoods differ only by an additive constant. However, there are some important differences between the two approaches related to overdispersion which we will highlight in the worked examples

 

Significance testing in logistic regression

Overall significance

The overall significance of a logistic regression can be assessed with a likelihood ratio test where the null (constant only) model is compared to the current model including predictors. The larger the difference, the greater the evidence that the model is significant. The log of the likelihood ratio is the difference between these two log likelihoods. In practice we work with minus twice the log of the likelihood ratio as the log of the likelihoods are always negative. As with the G-test, where all frequencies are large, the natural log of L2 (twice the log of the likelihood ratio) has an approximately chi-square distribution.

Algebraically speaking -

G   =   −2ln [ Lnull ]
−Lfull
where
  • Lnull is the log likelihood for the null model,
  • Lfull is the log likelihood for the current model

Hence the log likelihood ratio statistic is given by -2 log Lnull model - (-2 log Lfull model).

The number of degrees of freedom is equal to the difference between the number of β-parameters being fitted under the two models.

 

Testing of individual coefficients

The best way to test individual predictors is to again use a likelihood ratio test, this time comparing the log likelihood for the model without the predictor with the log likelihood for the full model:

Algebraically speaking -

G   =   −2ln [ Lreduced ]
−Lfull
where
  • Lreduced is the log likelihood for the model without the predictor,
  • Lfull is the log likelihood for the full model

An alternative approach to testing the statistical significance of each coefficient in the model is to use a Wald test. Here a Z| statistic is computed as the coefficient divided by its standard error. This is then squared, yielding a Wald statistic with a chi-square distribution.

Algebraically speaking -

Wald statistic   =   b12
SE12
where
  • b1 is the coefficient for explanatory variable 1,
  • SE1 is its standard error

However, the Wald statistic can be unreliable for small sample sizes and/or large coefficients, so it may be better to stick to the likelihood-ratio test. The only justification we have found given for using the Wald statistic is that it is computationally easy and is given automatically in the output of most statistical computer packages.

 

Model simplification

Selecting the most parsimonious model is done in the same way as for multiple linear regression. It can be done on the basis of differences in deviance between model pairs, although nowadays the Akaike Information Criterion (AIC) is generally preferred. For generalized linear models AIC is the log likelihood ratio statistic (G) penalized for the number of predictors (p) and the number of observations (n) or unique category combinations (D). When comparing two possible models from a family of models, this is reduced to AIC = G - 2df where df is the difference in degrees of freedom of the two models.

 

Assessing explained variation ('goodness of fit')

Residual measures

There are several summary measures for the difference between observed and fitted values. The Pearson goodness of fit statistic is obtained by summing the squares of the Pearson residuals. The deviance goodness of fit statistic is obtained by summing the squares of the deviance residuals. Each of these statistics has number of degrees of freedom equal to the number of (grouped) data items minus the number of independent parameters fitted to the model. The number of independent parameters is 1 for the constant, 1 for a measurement variable and (number of levels − 1) for a nominal variable. Note this method can only be used when data are grouped and not when raw binary data are input.

If the data are sparse (in other words there are few observed values for some combinations of explanatory variables), the deviance goodness of fit statistic given above may not be distributed as χ2. In this situation and providing one only has continuous (measurement) explanatory variables and a large number of observations (>400), it is preferable to use the Hosmer-Lemeshow fit test. The predicted probabilities are divided into deciles and then a Pearson chi-square is computed that compares the predicted to the observed frequencies (in a 2 X 10 table). Non- significant values indicate a good fit to the data and, therefore, good overall model fit.

There is no true R2 value in logistic regression, but statisticians have tried very hard to come up with analogous measures by treating deviances in the same way as the sums of squares residual in a least squares regression. One such attempt is termed the Cox & Snell Pseudo- R2.

Algebraically speaking -

Cox & Snell Pseudo-R2
R2   =   1 − [ −2LLnull ] 2/n
−2LLfull
where

  • Lnull is the log likelihood for the null model,
  • Lfull is the log likelihood for the full model

Because this R-squared value cannot reach 1.0, Nagelkerke (1991) it. The correction increases the Cox and Snell version to make 1.0 a possible value for R-square.

Algebraically speaking -

Nagelkerke-R2
R2   =   1 − [ −2LLnull ] 2/n
−2LLfull
1 − (− 2LLnull)2/n
where

  • Lnull is the log likelihood for the null model,
  • Lfull is the log likelihood for the full model

Although Nagelkerke's R2 does range from zero to 1, the resemblance to R2 in least squares regression remains superficial. Different pseudo R2s (there are at least eight in common use) all give somewhat different answers including those that range from zero to 1.

Pseudo R-squareds therefore cannot be interpreted as proportion of variance explained, nor can they be compared across datasets. However, they are valid and useful for comparing multiple models predicting the same outcome on the same dataset providing the same type of same type of pseudo R-squared is used throughout.

It is more difficult to use residuals to examine fits in logistic regression than in linear regression because the response is discrete. However, deviance residuals and Pearson residuals can both be used. Diagnostic plots can be made but are hard to interpret. High leverage is not really a problem as the observations are bounded.

If you want the coefficients to be meaningful, it should be noted that multicollinearity (two or more explanatory variables being correlated) is as much of a problem for logistic regression as it is for multiple regression. It should therefore be checked at the beginning of the analysis.

Related
topics :

Conditional logistic regression

Ordinal regression

Poisson regression