Principles
The regression model
Simple linear regression provides a means to model a straight line relationship between two variables. In classical (or asymmetric ) regression one variable (Y) is called the response or dependent variable, and the other (X) is called the explanatory or independent variable. This is in contrast to correlation where there is no distinction between Y and X in terms of which is an explanatory variable and which a
response variable.
The regression model is given by:
Y = α + βX + ε
where α is the y intercept (the value of Y where X = 0), β is the slope of the line, and ε is a random error term.
It may also be given as:
Y = β_{0} + β_{1}X + ε
where β
_{0} is the y intercept, β
_{1} is the slope of the line, and ε is a random error term.
In the traditional regression model, values of Xvariable are assumed to be fixed by the experimenter. The model is still valid if X is random (as is more commonly the case), but only if X is measured without error. If there is substantial measurement error on X, and the values of the estimated parameters are of interest, then errorsinvariables regression should be used. Errors on the response variable are assumed to be independent and identically and normally distributed.
Estimating model parameters
The parameters of the regression model are estimated from the data using ordinary least squares.
Testing the significance of a regression
There are several ways the significance of a regression can be tested. Providing errors are normally and identically distributed, a parametric test can be used. Analysis of variance is often the preferred approach, although one can also use a ttest to test whether the slope is significantly different from zero. If errors are not normally and identically distributed, then a randomization test should be used.
 Using analysis of variance (ANOVA)
The total sums of squares of the response variable (Y) is partitioned into the variation explained by the regression and the unexplained error variation. The error sums of squares is obtained by subtracting the regression sums of squares from the total sums of squares.
Mean squares are obtained by dividing sums of squares by their respective degrees of freedom. The significance test is carried out by dividing the mean square regression by the mean square error. Under a null hypothesis of a zero slope this Fratio will be distributed as F with 1 and n − 2 degrees of freedom.
Source of variation Regression Error Total
 df 1 n − 2 n − 1
 SS SS_{Reg} SS_{Error} SS_{Total
}  MS MS_{Reg} (s^{2}_{}) MS_{Error} (s^{2}_{Y.X})
 Fratio MS_{Reg} / MS_{Error}
 P value

The coefficient of determination, r^{2}, (R^{2} is sometimes used instead) compares the explained variation with the total variation of the data. It can therefore be estimated from the ratio of the regression sums of squares to the total sum of squares.
Algebraically speaking 
r^{2} 
= 
SS_{Reg}


SS_{Total}

where:
 SS_{Reg} is the sums of squares explained by the regression,
 SS_{Total} is the total sums of squares.

Often the adjusted coefficient of determination (r^{2}_{adj}) is quoted instead. The adjustment takes account of the sample size and the number of explanatory variables. With simple linear regression (only one explanatory variable) the adjustment only becomes noticeable for small sample sizes (n < 20).
Algebraically speaking 
r^{2}_{adj} 
= 
1 − (1 − r^{2}) 
n − 1


n − 1 − k 
where:
 r^{2} is the unadjusted coefficient of determination,
 n is the number of bivariate observations, and
 k is the number of explanatory variables

 Using a ttest on the slope
Alternatively a ttest can be carried using the studentized slope statistic. The null hypothesis may be either that β= 0, or β is some expected value.
This test assumes you have a large sample.
Algebraically speaking 
t 
= 
b − β 

SE_{b}

where
 b is the estimated slope and β is the slope under the null hypothesis (for testing whether the slope is significant or not, β = 0)
 SE_{b} is the standard error of the slope coefficient (b)
SE_{b} 
= 
√ 

= 
√ 

Variance(Y − ) 
MS_{error} 


Variance(X)/(n1)
 ΣX^{2} − (ΣX)^{2}/n 
where
 MS_{error} is the mean square error which can be obtained from the analysis of variance above, or directly from Σ(Y − ) ^{2}/ (n− 2) where is the expected (predicted) value of Y for each value of X,
 X and Y are the individual observations,
 n is the number of bivariate observations.

 Using a ttest to compare 2 slopes
The slopes of two different regression lines can be compared using a ttest provided we can assume that the variances of Y for the two lines are equal. The two estimates of the variance are pooled as in the twosample ttest. The ttest then compares the difference of the two slopes to the standard error of the difference.
Algebraically speaking 
t = 
b_{2} − b_{1} 

√(s^{2}_{ pooled} (1/Σx_{1}^{2} + 1/Σx_{2}^{2})) 
where
 t is the estimated tstatistic; under the null hypothesis that β1=β2, it is a random quantile of the t
distribution with (n_{1} + n_{2}  4) degrees of freedom,
 b_{1} and b_{1} are the estimated slopes,
 x_{1} = X − in regression 1.
 x_{2} = X − in regression 2.
 s^{2}_{pooled} is the pooled estimate of the error variance as below:
s^{2}_{pooled} = 
SS(1)_{error} + S(2)_{error} 

(n_{1} + n_{2} − 4) 
where
 SS(1)_{error} and SS(2)_{error} are the sums of squares (error) for the two regressions,
 n_{1} + n_{2} are the number of bivariate observations for each regression.

If there are more than two slopes, one must use an approach analogous to analysis of variance called analysis of covariance. We consider an alternate approach to this topic in Unit 13.
Standard errors and confidence intervals
Normal approximation 95% confidence intervals can be attached to various statistics produced by a regression analysis by multiplying the appropriate standard error by t with n − 2 degrees of freedom. The standard errors are as follows:
1. Standard error of slope
This is the most frequently calculated standard error since it is used in significance tests of the slope.
SE_{b} 
= 
√ 

MS_{error} 

ΣX^{2} − (ΣX)^{2}/n 
2. Standard error of observed sample mean () at
The variance of Y around the point , is now less than s^{2}_{} because we have accounted for some of the variation in Y in terms of the variation in X.
SE_{} 
= 
√ 

MS_{error} 

n 
3. Standard error of an estimated Y value () for a specified value of X_{p}
When the estimate is at any value of X other than the mean, we have to add an additional error term. Hence the further away that X_{p} is from its mean, the greater is the error of the estimate. If one calculates a series of confidence intervals for different values of X, one gets a biconcave confidence belt reflecting the lower reliability of our estimates as we move further away from the mean.
SE_{} 
= 
√ 

MS_{error} 
[ 
1 
+ 
(X_{p} − )^{2} 
] 


n  ΣX^{2} − (ΣX)^{2}/n 
4. Estimated standard error of a predicted Y value (Y) for a specified value of X_{p}
If one wishes to predict the Y value that would be obtained in a new experiment on the basis of the regression equation, then the error is again increased.
SE_{} 
= 
√ 

MS_{error} 
[ 
1 + 
1 
+ 
(X_{p} − )^{2} 
] 


n  ΣX^{2} − (ΣX)^{2}/n 
5. Estimated standard error of a predicted mean () of n replicates for a specified value of X_{p}
It is sometimes more useful to predict the result of a mean of a number of replicates (m) for a given value of X.
SE_{} 
= 
√ 

MS_{error} 
[ 
1 
+ 
1 
+ 
(X_{p} − )^{2} 
] 



n 
m 
ΣX^{2} − (ΣX)^{2}/n 
where
 MS_{error} is the mean square error which is Σ(Y − ) ^{2}/ (n− 2) where is the expected value of Y for each value of X,
 X and Y are the individual observations,
 X_{p} is the value of X for which Y is being estimated or predicted,
 is the mean of the X observations,
 n is the number of bivariate observations, and m is the number of replicates for the predicted mean.

Prediction of X from Y
Sometimes we may need to estimate X from Y, rather than the more usual procedure of estimating Y from X. If the X values are fixed and measured without error, it would not be valid to simply regress X on Y rather than vice versa as the assumptions of regression would not be met. Instead one should use the method of inverse regression. This method is unbiased providing the usual assumptions of regression are met  in particular that X is measured without error.
An unbiased estimate of X () for a specified Y value (Y_{p}) is obtained simply by reversing the equation. Hence:

= 
Y_{p} − a 

b 
where
 b is the slope and a is the intercept of the regression on Y on X

However, the confidence interval of this estimate is rather tedious to obtain as it requires computation of two further statistics, commonly known as D and H.
where
 b is the slope of the regression of Y on X,
 t is a quantile from the tdistribution for the chosen type I error rate (α) and n2 degrees of freedom,
 SE_{b} is the standard error of the slope.
H 
= 
t 
√ 
MS_{error} 
[ 
D 
( 
1 
+ 
1 
) 
+ 
(Y_{p} − )^{2} 
] 



D 
n 
ΣX^{2} − (ΣX)^{2}/n 
where
 t and D are as specified above,
 n is the number of bivariate observations,
 Y_{p} is the value of Y for which X is being predicted,
 is the mean of Y
 X and Y are the individual observations.
Then
Lower confidence limit 
= 

+ 
b (Y_{p} − ) 
− 
H 

D 
Upper confidence limit 
= 

+ 
b (Y_{p} − ) 
+ 
H 

D 

Spurious correlations
This is a controversial topic which has generated considerable debate in the journals. If the dependent and independent variables, Y and X, are not independent, then regression or correlation analysis may well indicate they are correlated when in fact the relationship derives solely from the presence of a shared variable. This has come to be known as the 'spurious correlation' issue. It has been suggested that erroneous conclusions derived from spurious correlations may be more widespread and persistent in the literature than pseudoreplication ever was. Prairie & Bird (1989) on the other hand argue that it is not the correlations that are spurious, but the inferences drawn from them. There is, however, general agreement that one cannot use the usual parametric tests of significance in this situation.
One common reason for two variables not being independent is when they share a common term. For example two variables (X and Y) may be standardized by dividing by a third variable (Z), and Y/Z is then regressed on X/Z. Pearson (1897) was the first to note that two variables having no correlation between themselves become correlated when divided by a third uncorrelated variable. In this case the spurious correlation tends to be positive. Another example is when one is plotting Y/X against X  for example some weight specific function against an organism's body weight. In this case the spurious correlation is usually negative. The level of (spurious) correlation depends on the precise way in which the common term is shared, on the variability of each measure and on the amount of measurement error in shared term. But for the two common forms shown above, the coefficient of determination just for the spurious component commonly exceeds 0.5. This means that testing against a null model of zero correlation is highly misleading. Some have argued that regressions of nonindependent ratios should never be carried out, and that researchers should instead use analysis of covariance. The alternative is to specify the appropriate null model using a randomization test.
Another situation where regression is used on variables that are not independent is in the detection of density dependence. Population ecologists commonly detect density dependence by regressing a measure of population change (r) over a specified period against a measure of population size (N) at the start of that period . The slope of this regression is biased downwards, and use of a conventional parametric test based on a null hypothesis of β = 0 would be incorrect. Turchin (2004) reviews the troubled history of these analyses. He concludes that two methods, both of which use appropriate nulls, are adequate. In one method, log population change (ln (N_{t + 1} − ln(N_{t}) is regressed on log population size at time t, and a randomization test is used to assess the correlation coefficient. In the other method log population change is regressed on population size at time t, and a parametric bootstrap is used to test significance.
There still remains the problem of measurement error highlighted recently by Freckleton et al. (2006) The measurement error in density for a given year appears in the corresponding change in population density with equal magnitude but opposite sign. This can lead to a spurious relationship, even using a randomization or bootstrap test with the appropriate null hypothesis. Unfortunately population estimates tend to have a high level of measurement error, both because estimates are based on samples rather than exact counts, and because the group of organisms measured is often not a truly selfcontained population, but part of a wider population. Whilst the former error can (sometimes) be estimated by repeated sampling, the latter is much more intractable. Note that approaches described in the More Information page on errorsinvariables regression are not appropriate as errors are not independent.
Assumptions
 The relationship between Y and X is linear. Violations of this assumption are especially serious  and probably the most common in the literature. It should be checked for initially using a scatterplot of Y against X. After regression parameters are estimated, it can be checked by plotting observed versus predicted values, and by examination of plots of residuals (differences between observed and predicted values) against predicted values. A 'bowed' pattern indicates the model is making systematic errors when making large or small predictions. A transformation of one or both variables may improve linearity of the response.
 The errors are independent. It is assumed that successive residuals are not correlated over time (serial correlation). This can be checked by plotting the serial correlation coefficient
against time lag (a correlogram), and/or by using the DurbinWatson test.
 The variance of the errors is constant (homoscedasticity) (a) versus time and (b) versus the predictions (or versus the independent variable). Failure to meet this assumption will result in confidence intervals that are too wide or too narrow. This can be checked by plotting the residuals versus time, versus predicted values and versus the independent variable. If the variance is not constant, a transformation can be used (although this has implications for linearity). Alternatively one can use a weighted regression in which data with a larger variance are given less weight relative to data with small variance (for details see Steel & Torrie (1980).)
 Errors are normally distributed. This probably the least important assumption as regression line is essentially a mean so central limit theorem applies. Even so presence of a few large outliers can bias estimates of coefficients, and affect confidence intervals. It can be checked using a normal probability plot (a qq plot) of residuals.
 X is measured without error or values of X are fixed by the experimenter. This assumption must be met if the parameter values are of interest; if the regression is purely descriptive, or is being used for prediction, then this assumption can be disregarded.
;
Diagnostics
Regression diagnostics are used to detect outliers and/or influential points in the regression and to assess whether the various assumptions of the model are met. They can also suggest improvements to the regression model.
 Residuals & residual plots
The most important measure used in regression diagnostics is the simple residual (e). This is the difference between each observed and predicted value of Y.
Algebraically speaking 
e_{i} = Y_{i}  _{i} = Y_{i} − (a + bX_{i}) 
where
 Y_{i} and _{i} are the observed and predicted values of Y,
 a is the intercept and b is the slope of the regression model.

Points with unusually high residuals  in other words points that does not fit the current model  are known as regression outliers.
Some researchers prefer to use standardized (or studentized) residuals. These are ordinary residuals divided by an estimate of their standard error so they have equal variance. Since the mean of these errors is zero, and standardised residuals are assumed to be tdistributed, errors greater than the critical value (or > 2) may be described as significantly deviating from the trend of the regression  and subject to further scrutiny.
Other researchers prefer unstandardised residuals, partly because these have the same scale as the outcome variable (Y).
Algebraically speaking 
r_{i} = 
e_{i} 

MS_{error} 
√ 

(1 − h_{i}) 
where
 e_{i} is the residual for observation Y_{i},
 MS_{error} is the mean square residual for the regression model. Internally studentized residuals use the residual mean square from the model fitted to the full data set. Externally studentized residuals use the residual mean square from the model fitted to all the data except the ith observation.
 h_{i} is the leverage for observation X_{i} (see below).

The most important residual plot is of ordinary residuals against the predicted Y value. The scatter of the residuals in the vertical direction should be symmetrical around zero, and should be constant in relation to the predicted Y value. Standardized residuals often give a very similar picture to ordinary residuals, although they can differ when one value has unusually large leverage.
Leverage & leverage plots
Another useful diagnostic measure is the leverage coefficient (H)  Unit 14 examines it in more detail. Outliers with respect to the Xvariable will have a high leverage  in other words they will have a considerable influence on the model parameters. The coefficient for each observed value of X (h_{i}) is computed as follows:
Algebraically speaking 
h_{i} = 
1 
+ 
x_{i}^{2}



n 
Σ(x^{2}) 
where
 h_{i} is the leverage for observation X_{i},
 n is the number of bivariate observations,
 x_{i} = X_{i} −
 X_{i} is the ith value of X, and is the mean of all X values,
 x = X −
 X is the entire set of individual X values.

A leverage plot is of leverage values plotted against index value (usually the order in which values are entered or obtained). Observations with a leverage of more than 2p/n (where p = Σh_{i}) should be examined more closely.
Detecting influential observations
We can define an influential point as one whose removal from the dataset would cause a large change in the fit. Any influential point is likely (but not necessarily certain) to either be an outlier or to have large leverage. The most popular measure of influence is Cook's Distance. This combines the information from residuals and leverage:
Algebraically speaking 
h_{i} = 
1 
r_{i}^{2} 
h_{i}



p 
1h_{i} 
where
 p is Σh_{i}, where h_{i} is the leverage for observation X_{i},
 r_{i} is the studentized residual of the ith observation,
 r_{i} is the standardized residual for observation i.

Related
topics : 
Serial correlation

Piecewise regression
