|
|
The term independent variable suggests that its value can be chosen at will, and the dependent variable is an effect, i.e., causally dependent on the independent variable, as in a stimulus-response model. Although many linear regression models are formulated as models of cause and effect, the direction of causation may just as well go the other way, or indeed there need not be any causal relation at all.
Regression, in general, is the problem of estimating a conditional expected value. Linear regression is called "linear" because the relation of the dependent to the independent variables is a linear function of some parameters. Regression models which are not a linear function of the parameters are called nonlinear regression models. A neural network is an example of a nonlinear regression model.
Still more generally, regression may be viewed as a special case of density estimation. The joint distribution of the dependent and independent variables can be constructed from the conditional distribution of the dependent variable and the marginal distribution of the independent variables. In some problems, it is convenient to work in the other direction: from the joint distribution, the conditional distribution of the dependent variable can be derived.
The earliest form of linear regression was the method of least squares,
which was published by Legendre in 1805,
and by Gauss in 1809.
The term "least squares" is from Legendre's term, moindres quarrés.
However, Gauss claimed that he had known the method since 1795.
Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.
Euler had worked on the same problem (1748) without success.
Gauss published a further development of the theory of least squares in 1821, including a version of the Gauss-Markov theorem.
The term "reversion" was used in the nineteenth century to describe a biological phenomenon,
namely that the progeny of exceptional individuals tend to be less exceptional, and more like their more distant ancestors, than their parents.
Francis Galton studied this phenomenon, and applied the term "regression" to it.
For Galton, regression had only this biological meaning,
but his work (1877, 1885) was extended by Karl Pearson and G.U. Yule to a more general context (1897, 1903).
In the work of Pearson and Yule,
the joint distribution of the dependent and independent variables is assumed to be Gaussian.
This assumption was weakened by R.A. Fisher in his works of 1922 and 1925.
Fisher assumed that the conditional distribution of the dependent variable is Gaussian, but the joint distribution need not be.
In this respect,
Fisher's assumption is closer to Gauss's formulation of 1821.
A linear regression model is typically stated in the form
An equivalent formulation which explicitly shows the linear regression as a model of conditional expectation is
Historical remarks
Statement of the linear regression model
The right hand side may take other forms, but generally comprises a linear combination of the parameters, here denoted α and β. The term ε represents the unpredicted or unexplained variation in the dependent variable; it is conventionally called the "error" whether it is really a measurement error or not. The error term is conventionally assumed to have expected value equal to zero, as a nonzero expected value could be absorbed into α. See also errors and residuals in statistics; the difference between an error and a residual is also dealt with below.
with the conditional distribution of y given x essentially the same as the distribution of the error term.
A linear regression model need not be affine, let alone linear, in the independent variables x. For example,
Often in linear regression problems statisticians rely on the Gauss-Markov assumptions:
Parameter estimation
(See also Gauss-Markov theorem. That result says that under the assumptions above, least-squares estimators are in a certain sense optimal.)
Sometimes stronger assumptions are relied on:
A statistician will usually estimate the unobservable values of the parameters α and β by the method of least squares, which consists of finding the values of and that minimize the sum of squares of the residuals
Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares estimates implies that the sum of the residuals must be 0, and the dot-product of the vector of residuals with the vector of -values must be 0, i.e., we must have
These facts make it possible to use Student's t-distribution with n − 2 degrees of freedom (so named in honor of the pseudonymous "Student") to find confidence intervals for α and β.
Denote by capital Y the column vector whose ith entry is yi, and by capital X the n x 2 matrix whose second column contains the xi as its ith entry, and whose first column contains n 1s. Let ε be the column vector containing the errors εi. Let δ and d be respectively the 2x1 column vector containing α and β and the 2x1 column vector containing the estimates a and b. Then the model can be written as
Then it can be shown that
The matrix In - X (X' X)-1 X' that appears above is a symmetric idempotent matrix of rank n - 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix G'MG is a diagonal matrix. If the matrix M is also idempotent, then the diagonal entries in G'MG must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So In-X(X'X)-1X', after diagonalization, has n − 2 0s and two 1s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with n-2 degrees of freedom.
We sum the observations, the squares of the Y's and X's and the products of X*Y to obtain the following quantities.
We use the summary statistics above to calculate b, the estimate of beta.
We use the estimate of beta and the other statistics to estimate alpha by:
The first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the data.
We plot the residuals, against the independent variable, X. There should be no discernible trend or pattern if the model is satisfactory for this data. Some of the possible problems are:
The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the dependent variable is explained by the independent variable.
The correlation coefficient, r, can be calculated by
References
Historical
Modern theory
Modern practice
External links