Ordinary least squares
In statistics, ordinary least squares is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the function of the independent variable. Some sources consider OLS to be linear regression.
Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.
The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity, consistent for the variance estimate of the residuals when regressors have finite fourth moments and—by the Gauss–Markov theorem—optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator.
Linear model
Suppose the data consists of observations. Each observation includes a scalar response and a column vector of parameters, i.e.,. In a linear regression model, the response variable,, is a linear function of the regressors:or in vector form,
where, as introduced previously, is a column vector of the -th observation of all the explanatory variables; is a vector of unknown parameters; and the scalar represents unobserved random variables of the -th observation. accounts for the influences upon the responses from sources other than the explanatory variables. This model can also be written in matrix notation as
where and are vectors of the response variables and the errors of the observations, and is an matrix of regressors, also sometimes called the design matrix, whose row is and contains the -th observations on all the explanatory variables.
Typically, a constant term is included in the set of regressors, say, by taking for all. The coefficient corresponding to this regressor is called the intercept. Without the intercept, the fitted line is forced to cross the origin when.
Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge.
As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters.
Matrix/vector formulation
Consider an overdetermined systemof linear equations in unknown coefficients,, with. This can be written in matrix form as
where
.
Such a system usually has no exact solution, so the goal is instead to find the coefficients which fit the equations "best", in the sense of solving the quadratic minimization problem
where the objective function is given by
A justification for choosing this criterion is given in [|Properties] below. This minimization problem has a unique solution, provided that the columns of the matrix are linearly independent, given by solving the so-called normal equations:
The matrix is known as the normal matrix or Gram matrix and the matrix is known as the moment matrix of regressand by regressors. Finally, is the coefficient vector of the least-squares hyperplane, expressed as
or
Estimation
Suppose b is a "candidate" value for the parameter vector β. The quantity, called the residual for the i-th observation, measures the vertical distance between the data point and the hyperplane, and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals or residual sum of squares ) is a measure of the overall model fit:where T denotes the matrix transpose, and the rows of X, denoting the values of all the independent variables associated with a particular value of the dependent variable, are Xi = xiT. The value of b which minimizes this sum is called the OLS estimator for β. The function S is quadratic in b with positive-definite Hessian, and therefore this function possesses a unique global minimum at, which can be given by the explicit formulaProofs involving ordinary least squares#Least squares estimator for.CE.B2|
The product N = XT X is a Gram matrix, and its inverse, Q = N−1, is the cofactor matrix of β, closely related to its [|covariance matrix], Cβ.
The matrix −1 XT = Q ''XT is called the Moore–Penrose pseudoinverse matrix of X''. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables.
Prediction
After we have estimated β, the fitted values from the regression will bewhere P = X−1XT is the projection matrix onto the space V spanned by the columns of X. This matrix P is also sometimes called the hat matrix because it "puts a hat" onto the variable y. Another matrix, closely related to P is the annihilator matrix ; this is a projection matrix onto the space orthogonal to V. Both matrices P and M are symmetric and idempotent, and relate to the data matrix X via identities and. Matrix M creates the residuals from the regression:
The variances of the predicted values are found in the main diagonal of the variance-covariance matrix of predicted values:
where P is the projection matrix and s2 is the sample variance.
The full matrix is very large; its diagonal elements can be calculated individually as:
where Xi is the i-th row of matrix X.
Sample statistics
Using these residuals we can estimate the sample variance s2 using the reduced chi-squared statistic:The denominator, n−''p, is the statistical degrees of freedom. The first quantity, s''2, is the OLS estimate for σ2, whereas the second,, is the MLE estimate for σ2. The two estimators are quite similar in large samples; the first estimator is always unbiased, while the second estimator is biased but has a smaller mean squared error. In practice s2 is used more often, since it is more convenient for the hypothesis testing. The square root of s2 is called the regression standard error, standard error of the regression, or standard error of the equation.
It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X. The coefficient of determination ''R2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y'', in the cases where the regression sum of squares equals the sum of squares of residuals:
where TSS is the total sum of squares for the dependent variable,, and is an n×''n matrix of ones. In order for R''2 to be meaningful, the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, R2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.
Simple linear regression model
If the data matrix X contains only two variables, a constant and a scalar regressor xi, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as :The least squares estimates in this case are given by simple formulas
Alternative derivations
In the previous section the least squares estimator was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ; the only difference is in how we interpret this result.Projection
For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations, where β is the unknown. Assuming the system cannot be solved exactly, we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfieswhere is the standard L2 norm in the n-dimensional Euclidean space Rn. The predicted quantity Xβ is just a certain linear combination of the vectors of regressors. Thus, the residual vector will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X. The OLS estimator in this case can be interpreted as the coefficients of vector decomposition of along the basis of X.
In other words, the gradient equations at the minimum can be written as:
A geometrical interpretation of these equations is that the vector of residuals, is orthogonal to the column space of X, since the dot product is equal to zero for any conformal vector, v. This means that is the shortest of all possible vectors, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.
Introducing and a matrix K with the assumption that a matrix is non-singular and KT X = 0, the residual vector should satisfy the following equation:
The equation and solution of linear least squares are thus described as follows:
Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset. Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.