Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site




The regression model

Simple linear regression provides a means to model a straight line relationship between two variables. In classical (or asymmetric ) regression one variable (Y) is called the response or dependent variable, and the other (X) is called the explanatory or independent variable. This is in contrast to correlation where there is no distinction between Y and X in terms of which is an explanatory variable and which a response variable.

The regression model is given by:

Y = α + βX + ε
where α is the y intercept (the value of Y where X = 0), β is the slope of the line, and ε is a random error term.

It may also be given as:

Y = β0 + β1X + ε
where β0 is the y intercept, β1 is the slope of the line, and ε is a random error term.

In the traditional regression model, values of X-variable are assumed to be fixed by the experimenter. The model is still valid if X is random (as is more commonly the case), but only if X is measured without error. If there is substantial measurement error on X, and the values of the estimated parameters are of interest, then errors-in-variables regression should be used. Errors on the response variable are assumed to be independent and identically and normally distributed.


Estimating model parameters

The parameters of the regression model are estimated from the data using ordinary least squares.

Algebraically speaking -

b    =   Covariance (X,Y)    =   Σ(XY) − (ΣX)(ΣY)/n
Variance (X) ΣX2 − (ΣX)2/n
  • b is the estimate of the slope coefficient (β ),
  • X and Y are the individual observations,
  • and are the means of X and Y,
  • n is the number of bivariate observations.

a    =   − b

  • a is the estimate of the Y intercept (α ) (the value of Y where X=0).


Testing the significance of a regression

There are several ways the significance of a regression can be tested. Providing errors are normally and identically distributed, a parametric test can be used. Analysis of variance is often the preferred approach, although one can also use a t-test to test whether the slope is significantly different from zero. If errors are not normally and identically distributed, then a randomization test should be used.

  1. Using analysis of variance (ANOVA)

    The total sums of squares of the response variable (Y) is partitioned into the variation explained by the regression and the unexplained error variation. The error sums of squares is obtained by subtracting the regression sums of squares from the total sums of squares.

    Algebraically speaking -

    SSTotal   =   ΣY2 − (ΣY)2/n
    SSRegression   =   [ΣXY − (ΣX)(ΣY)/n]2
    ΣX2 − (ΣX)2/n
    SSError   =   SSTotal − SSRegression

    • SSTotal is the total sums of squares, or Σ(Y − )2,
    • SSRegression is the sums of squares explained by the regression, or Σ()2,
    • SSError is the unexplained error, or Σ(Y− )2
    • is the expected (predicted) value of Y for each value of X,
    • Xij and Yij are the individual observed values,
    • and are the means of X and Y,
    • n is the number of replicates in each treatment and N is the total number of observations.

    Mean squares are obtained by dividing sums of squares by their respective degrees of freedom. The significance test is carried out by dividing the mean square regression by the mean square error. Under a null hypothesis of a zero slope this F-ratio will be distributed as F with 1 and n − 2 degrees of freedom.

    Source of variation
    n − 2
    n − 1
    MSReg (s2)
    MSError (s2Y.X)
    MSReg / MSError
    P value

    The coefficient of determination, r2, (R2 is sometimes used instead) compares the explained variation with the total variation of the data. It can therefore be estimated from the ratio of the regression sums of squares to the total sum of squares.

    Algebraically speaking -

    r2   =   SSReg
    • SSReg is the sums of squares explained by the regression,
    • SSTotal is the total sums of squares.

    Often the adjusted coefficient of determination (r2adj) is quoted instead. The adjustment takes account of the sample size and the number of explanatory variables. With simple linear regression (only one explanatory variable) the adjustment only becomes noticeable for small sample sizes (n < 20).

    Algebraically speaking -

    r2adj   =   1 − (1 − r2) n − 1
    n − 1 − k
    • r2 is the unadjusted coefficient of determination,
    • n is the number of bivariate observations, and
    • k is the number of explanatory variables


  2. Using a t-test on the slope
    Alternatively a t-test can be carried using the studentized slope statistic. The null hypothesis may be either that β= 0, or β is some expected value.

    This test assumes you have a large sample.

    Algebraically speaking -

    t    =   b − β
    • b is the estimated slope and β is the slope under the null hypothesis (for testing whether the slope is significant or not, β = 0)
    • SEb is the standard error of the slope coefficient (b)
    SEb    =    = 
    Variance(Y − ) MSerror
    Variance(X)/(n-1) ΣX2 − (ΣX)2/n
    • MSerror is the mean square error which can be obtained from the analysis of variance above, or directly from Σ(Y − ) 2/ (n− 2) where is the expected (predicted) value of Y for each value of X,
    • X and Y are the individual observations,
    • n is the number of bivariate observations.


  3. Using a t-test to compare 2 slopes

    The slopes of two different regression lines can be compared using a t-test provided we can assume that the variances of Y for the two lines are equal. The two estimates of the variance are pooled as in the two-sample t-test. The t-test then compares the difference of the two slopes to the standard error of the difference.

    Algebraically speaking -

    t = b2 − b1
    √(s2 pooled (1/Σx12 + 1/Σx22))
    • t is the estimated t-statistic; under the null hypothesis that β1=β2, it is a random quantile of the t- distribution with (n1 + n2 - 4) degrees of freedom,
    • b1 and b1 are the estimated slopes,
    • x1 = X − in regression 1.
    • x2 = X − in regression 2.
    • s2pooled is the pooled estimate of the error variance as below:
    s2pooled = SS(1)error + S(2)error
    (n1 + n2 − 4)
    • SS(1)error and SS(2)error are the sums of squares (error) for the two regressions,
    • n1 + n2 are the number of bivariate observations for each regression.

    If there are more than two slopes, one must use an approach analogous to analysis of variance called analysis of covariance. We consider an alternate approach to this topic in Unit 13.


Standard errors and confidence intervals

Normal approximation 95% confidence intervals can be attached to various statistics produced by a regression analysis by multiplying the appropriate standard error by t with n − 2 degrees of freedom. The standard errors are as follows:

1. Standard error of slope

This is the most frequently calculated standard error since it is used in significance tests of the slope.
SEb    =  
ΣX2 − (ΣX)2/n

2. Standard error of observed sample mean () at

The variance of Y around the point , is now less than s2 because we have accounted for some of the variation in Y in terms of the variation in X.
SE    =  

3. Standard error of an estimated Y value () for a specified value of Xp

When the estimate is at any value of X other than the mean, we have to add an additional error term. Hence the further away that Xp is from its mean, the greater is the error of the estimate. If one calculates a series of confidence intervals for different values of X, one gets a biconcave confidence belt reflecting the lower reliability of our estimates as we move further away from the mean.
SE    =  
MSerror [ 1 + (Xp)2 ]
nΣX2 − (ΣX)2/n

4. Estimated standard error of a predicted Y value (Y) for a specified value of Xp

If one wishes to predict the Y value that would be obtained in a new experiment on the basis of the regression equation, then the error is again increased.
SE    =  
MSerror [ 1 + 1 + (Xp)2 ]
nΣX2 − (ΣX)2/n

5. Estimated standard error of a predicted mean () of n replicates for a specified value of Xp

It is sometimes more useful to predict the result of a mean of a number of replicates (m) for a given value of X.
SE    =  
MSerror [ 1 + 1 + (Xp)2 ]
n m ΣX2 − (ΣX)2/n

  • MSerror is the mean square error which is Σ(Y − ) 2/ (n− 2) where is the expected value of Y for each value of X,
  • X and Y are the individual observations,
  • Xp is the value of X for which Y is being estimated or predicted,
  • is the mean of the X observations,
  • n is the number of bivariate observations, and m is the number of replicates for the predicted mean.


Prediction of X from Y

Sometimes we may need to estimate X from Y, rather than the more usual procedure of estimating Y from X. If the X values are fixed and measured without error, it would not be valid to simply regress X on Y rather than vice versa as the assumptions of regression would not be met. Instead one should use the method of inverse regression. This method is unbiased providing the usual assumptions of regression are met - in particular that X is measured without error.

An unbiased estimate of X () for a specified Y value (Yp) is obtained simply by reversing the equation. Hence:

   =   Yp − a
  • b is the slope and a is the intercept of the regression on Y on X

However, the confidence interval of this estimate is rather tedious to obtain as it requires computation of two further statistics, commonly known as D and H.

D    =   b2t2 SEb
  • b is the slope of the regression of Y on X,
  • t is a quantile from the t-distribution for the chosen type I error rate (α) and n-2 degrees of freedom,
  • SEb is the standard error of the slope.

H    =   t MSerror [ D ( 1 + 1 ) + (Yp)2 ]
D n ΣX2 − (ΣX)2/n
  • t and D are as specified above,
  • n is the number of bivariate observations,
  • Yp is the value of Y for which X is being predicted,
  • is the mean of Y
  • X and Y are the individual observations.


Lower confidence limit    =   + b (Yp) H
Upper confidence limit    =   + b (Yp) + H


Spurious correlations

This is a controversial topic which has generated considerable debate in the journals. If the dependent and independent variables, Y and X, are not independent, then regression or correlation analysis may well indicate they are correlated when in fact the relationship derives solely from the presence of a shared variable. This has come to be known as the 'spurious correlation' issue. It has been suggested that erroneous conclusions derived from spurious correlations may be more widespread and persistent in the literature than pseudo-replication ever was. Prairie & Bird (1989) on the other hand argue that it is not the correlations that are spurious, but the inferences drawn from them. There is, however, general agreement that one cannot use the usual parametric tests of significance in this situation.

One common reason for two variables not being independent is when they share a common term. For example two variables (X and Y) may be standardized by dividing by a third variable (Z), and Y/Z is then regressed on X/Z. Pearson (1897) was the first to note that two variables having no correlation between themselves become correlated when divided by a third uncorrelated variable. In this case the spurious correlation tends to be positive. Another example is when one is plotting Y/X against X - for example some weight specific function against an organism's body weight. In this case the spurious correlation is usually negative. The level of (spurious) correlation depends on the precise way in which the common term is shared, on the variability of each measure and on the amount of measurement error in shared term. But for the two common forms shown above, the coefficient of determination just for the spurious component commonly exceeds 0.5. This means that testing against a null model of zero correlation is highly misleading. Some have argued that regressions of non-independent ratios should never be carried out, and that researchers should instead use analysis of covariance. The alternative is to specify the appropriate null model using a randomization test.


Another situation where regression is used on variables that are not independent is in the detection of density dependence. Population ecologists commonly detect density dependence by regressing a measure of population change (r) over a specified period against a measure of population size (N) at the start of that period . The slope of this regression is biased downwards, and use of a conventional parametric test based on a null hypothesis of β = 0 would be incorrect. Turchin (2004) reviews the troubled history of these analyses. He concludes that two methods, both of which use appropriate nulls, are adequate. In one method, log population change (ln (Nt + 1 − ln(Nt) is regressed on log population size at time t, and a randomization test is used to assess the correlation coefficient. In the other method log population change is regressed on population size at time t, and a parametric bootstrap is used to test significance.

There still remains the problem of measurement error highlighted recently by Freckleton et al. (2006) The measurement error in density for a given year appears in the corresponding change in population density with equal magnitude but opposite sign. This can lead to a spurious relationship, even using a randomization or bootstrap test with the appropriate null hypothesis. Unfortunately population estimates tend to have a high level of measurement error, both because estimates are based on samples rather than exact counts, and because the group of organisms measured is often not a truly self-contained population, but part of a wider population. Whilst the former error can (sometimes) be estimated by repeated sampling, the latter is much more intractable. Note that approaches described in the More Information page on errors-in-variables regression are not appropriate as errors are not independent.




  1. The relationship between Y and X is linear. Violations of this assumption are especially serious - and probably the most common in the literature. It should be checked for initially using a scatterplot of Y against X. After regression parameters are estimated, it can be checked by plotting observed versus predicted values, and by examination of plots of residuals (differences between observed and predicted values) against predicted values. A 'bowed' pattern indicates the model is making systematic errors when making large or small predictions. A transformation of one or both variables may improve linearity of the response.
  2. The errors are independent. It is assumed that successive residuals are not correlated over time (serial correlation). This can be checked by plotting the serial correlation coefficient against time lag (a correlogram), and/or by using the Durbin-Watson test.
  3. The variance of the errors is constant (homoscedasticity) (a) versus time and (b) versus the predictions (or versus the independent variable). Failure to meet this assumption will result in confidence intervals that are too wide or too narrow. This can be checked by plotting the residuals versus time, versus predicted values and versus the independent variable. If the variance is not constant, a transformation can be used (although this has implications for linearity). Alternatively one can use a weighted regression in which data with a larger variance are given less weight relative to data with small variance (for details see Steel & Torrie (1980).)
  4. Errors are normally distributed. This probably the least important assumption as regression line is essentially a mean so central limit theorem applies. Even so presence of a few large outliers can bias estimates of coefficients, and affect confidence intervals. It can be checked using a normal probability plot (a qq plot) of residuals.
  5. X is measured without error or values of X are fixed by the experimenter. This assumption must be met if the parameter values are of interest; if the regression is purely descriptive, or is being used for prediction, then this assumption can be disregarded.



Regression diagnostics are used to detect outliers and/or influential points in the regression and to assess whether the various assumptions of the model are met. They can also suggest improvements to the regression model.

  • Residuals & residual plots

    The most important measure used in regression diagnostics is the simple residual (e). This is the difference between each observed and predicted value of Y.

    Algebraically speaking -

    ei = Yi - i = Yi − (a + bXi)
    • Yi and i are the observed and predicted values of Y,
    • a is the intercept and b is the slope of the regression model.

    Points with unusually high residuals - in other words points that does not fit the current model - are known as regression outliers.

    Some researchers prefer to use standardized (or studentized) residuals. These are ordinary residuals divided by an estimate of their standard error so they have equal variance. Since the mean of these errors is zero, and standardised residuals are assumed to be t-distributed, errors greater than the critical value (or > 2) may be described as significantly deviating from the trend of the regression - and subject to further scrutiny.

    Other researchers prefer un-standardised residuals, partly because these have the same scale as the outcome variable (Y).

    Algebraically speaking -

    ri = ei
    (1 − hi)
    • ei is the residual for observation Yi,
    • MSerror is the mean square residual for the regression model. Internally studentized residuals use the residual mean square from the model fitted to the full data set. Externally studentized residuals use the residual mean square from the model fitted to all the data except the ith observation.
    • hi is the leverage for observation Xi (see below).

    The most important residual plot is of ordinary residuals against the predicted Y value. The scatter of the residuals in the vertical direction should be symmetrical around zero, and should be constant in relation to the predicted Y value. Standardized residuals often give a very similar picture to ordinary residuals, although they can differ when one value has unusually large leverage.

  • Leverage & leverage plots

    Another useful diagnostic measure is the leverage coefficient (H) - Unit 14 examines it in more detail. Outliers with respect to the X-variable will have a high leverage - in other words they will have a considerable influence on the model parameters. The coefficient for each observed value of X (hi) is computed as follows:

    Algebraically speaking -

    hi = 1 + xi2
    n Σ(x2)
    • hi is the leverage for observation Xi,
    • n is the number of bivariate observations,
    • xi = Xi
    • Xi is the ith value of X, and is the mean of all X values,
    • x = X −
    • X is the entire set of individual X values.

    A leverage plot is of leverage values plotted against index value (usually the order in which values are entered or obtained). Observations with a leverage of more than 2p/n (where p = Σhi) should be examined more closely.

  • Detecting influential observations

    We can define an influential point as one whose removal from the dataset would cause a large change in the fit. Any influential point is likely (but not necessarily certain) to either be an outlier or to have large leverage. The most popular measure of influence is Cook's Distance. This combines the information from residuals and leverage:

    Algebraically speaking -

    hi = 1 ri2 hi
    p 1-hi
    • p is Σhi, where hi is the leverage for observation Xi,
    • ri is the studentized residual of the ith observation,
    • ri is the standardized residual for observation i.

Related topics :

Serial correlation