Local regression


Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression.
Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced . They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model.
In some fields, LOESS is known and commonly referred to as Savitzky–Golay filter.
LOESS and LOWESS thus build on "classical" methods, such as linear and nonlinear least squares regression. They address situations in which the classical procedures do not perform well or cannot be effectively applied without undue labor. LOESS combines much of the simplicity of linear least squares regression with the flexibility of nonlinear regression. It does this by fitting simple models to localized subsets of the data to build up a function that describes the deterministic part of the variation in the data, point by point. In fact, one of the chief attractions of this method is that the data analyst is not required to specify a global function of any form to fit a model to the data, only to fit segments of the data.
The trade-off for these features is increased computation. Because it is so computationally intensive, LOESS would have been practically impossible to use in the era when least squares regression was being developed. Most other modern methods for process modelling are similar to LOESS in this respect. These methods have been consciously designed to use our current computational ability to the fullest possible advantage to achieve goals not easily achieved by traditional approaches.
A smooth curve through a set of data points obtained with this statistical technique is called a loess curve, particularly when each smoothed value is given by a weighted quadratic least squares regression over the span of values of the y-axis scattergram criterion variable. When each smoothed value is given by a weighted linear least squares regression over the span, this is known as a lowess curve. However, some authorities treat lowess and loess as synonyms.

History

Local regression and closely related procedures have a long and rich history, having been discovered and rediscovered in different fields on multiple occasions. An early work by Robert Henderson studying the problem of graduation introduced local regression using cubic polynomials.
Specifically, let denote an ungraduated sequence of observations. Following Henderson, suppose that only the terms from to are to be taken into account when computing the graduated value of, and is the weight to be assigned to. Henderson then uses a local polynomial approximation, and sets up the following four equations for the coefficients:
Solving these equations for the polynomial coefficients yields the graduated value,.
Henderson went further. In preceding years, many 'summation formula' methods of graduation had been developed, which derived graduation rules based on summation formulae. Two such rules are the 15-point and 21-point rules of Spencer. These graduation rules were carefully designed to have a quadratic-reproducing property: If the ungraduated values exactly follow a quadratic formula, then the graduated values equal the ungraduated values. This is an important property: a simple moving average, by contrast, cannot adequately model peaks and troughs in the data. Henderson's insight was to show that any such graduation rule can be represented as a local cubic fit for an appropriate choice of weights.
Further discussions of the historical work on graduation and local polynomial fitting can be found in Macaulay, Cleveland and Loader ; and Murray and Bellhouse.
The Savitzky-Golay filter, introduced by Abraham Savitzky and Marcel J. E. Golay significantly expanded the method. Like the earlier graduation work, their focus was data with an equally-spaced predictor variable, where local regression can be represented as a convolution. Savitzky and Golay published extensive sets of convolution coefficients for different orders of polynomial and smoothing window widths.
Local regression methods started to appear extensively in statistics literature in the 1970s; for example, Charles J. Stone, Vladimir Katkovnik and William S. Cleveland. Katkovnik is the earliest book devoted primarily to local regression methods.
Theoretical work continued to appear throughout the 1990s. Important contributions include Jianqing Fan and Irène Gijbels studying efficiency properties, and David Ruppert and Matthew P. Wand developing an asymptotic distribution theory for multivariate local regression.
An important extension of local regression is Local Likelihood Estimation, formulated by Robert Tibshirani and Trevor Hastie. This replaces the local least-squares criterion with a likelihood-based criterion, thereby extending the local regression method to the Generalized linear model setting; for example binary data, count data or censored data.
Practical implementations of local regression began appearing in statistical software in the 1980s. Cleveland introduces the LOWESS routines, intended for smoothing scatterplots. This implements local linear fitting with a single predictor variable, and also introduces robustness downweighting to make the procedure resistant to outliers. An entirely new implementation, LOESS, is described in Cleveland and Susan J. Devlin. LOESS is a multivariate smoother, able to handle spatial data with two predictor variables, and uses local quadratic fitting. Both LOWESS and LOESS are implemented in the S and R programming languages. See also Cleveland's Local Fitting Software.
While Local Regression, LOWESS and LOESS are sometimes used interchangeably, this usage should be considered incorrect. Local Regression is a general term for the fitting procedure; LOWESS and LOESS are two distinct implementations.

Model definition

Local regression uses a data set consisting of observations one or more ‘independent’ or ‘predictor’ variables, and a ‘dependent’ or ‘response’ variable. The dataset will consist of a number observations. The observations of the predictor variable can be denoted, and corresponding observations of the response variable by.
For ease of presentation, the development below assumes a single predictor variable; the extension to multiple predictors is conceptually straightforward. A functional relationship between the predictor and response variables is assumed:
where is the unknown ‘smooth’ regression function to be estimated, and represents the conditional expectation of the response, given a value of the predictor variables. In theoretical work, the ‘smoothness’ of this function can be formally characterized by placing bounds on higher order derivatives. The represents random error; for estimation purposes these are assumed to have mean zero. Stronger assumptions may be made when assessing properties of the estimates.
Local regression then estimates the function, for one value of at a time. Since the function is assumed to be smooth, the most informative data points are those whose values are close to. This is formalized with a bandwidth and a kernel or weight function, with observations assigned weights
A typical choice of, used by Cleveland in LOWESS, is for, although any similar function can be used. Questions of bandwidth selection and specification are deferred for now.
A local model, expressed as
is then fitted by weighted least squares: choose regression coefficients
to minimize
The local regression estimate of is then simply the intercept estimate:
while the remaining coefficients can be interpreted
as derivative estimates.
It is to be emphasized that the above procedure produces the estimate for one value of. When considering a new value of, a new set of weights must be computed, and the regression coefficient estimated afresh.

Matrix representation of the local regression estimate

As with all least squares estimates, the estimated regression coefficients can be expressed in closed form :
where is a vector of the local regression coefficients;
is the design matrix with entries ; is a diagonal matrix of the smoothing weights ; and is a vector of the responses.
This matrix representation is crucial for studying the theoretical properties of local regression estimates. With appropriate definitions of the design and weight matrices, it immediately generalizes to the multiple-predictor setting.

Selection issues: bandwidth, local model, fitting criteria

Implementation of local regression requires specification and selection of several components:
  1. The bandwidth, and more generally the localized subsets of the data.
  2. The degree of local polynomial, or more generally, the form of the local model.
  3. The choice of weight function.
  4. The choice of fitting criterion.
Each of these components has been the subject of extensive study; a summary is provided below.

Localized subsets of data; Bandwidth

The bandwidth controls the resolution of the local regression estimate. If h is too small, the estimate may show high-resolution features that represent noise in the data, rather than any real structure in the mean function. Conversely, if h is too large, the estimate will only show low-resolution features, and important structure may be lost. This is the bias-variance tradeoff; if h is too small, the estimate exhibits large variation; while at large h, the estimate exhibits large bias.
Careful choice of bandwidth is therefore crucial when applying local regression. Mathematical methods for bandwidth selection require, firstly, formal criteria to assess the performance of an estimate. One such criterion is prediction error: if a new observation is made at, how well does the estimate predict the new response ?
Performance is often assessed using a squared-error loss function. The mean squared prediction error is
The first term is the random variation of the observation; this is entirely independent of the local regression estimate. The second term,
is the mean squared estimation error. This relation shows that, for squared error loss, minimizing prediction error and estimation error are equivalent problems.
In global bandwidth selection, these measures can be integrated over the space, or averaged over the actual . Some standard techniques from model selection can be readily adapted to local regression:
  1. Cross Validation, which estimates the mean-squared prediction error.
  2. Mallow's Cp and Akaike's Information Criterion, which estimate mean squared estimation error.
  3. Other methods which attempt to estimate bias and variance variance components of the estimation error directly.
Any of these criteria can be minimized to produce an automatic bandwidth selector. Cleveland and Devlin prefer a graphical method to visually display the bias-variance trade-off and guide bandwidth choice.
One question not addressed above is, how should the bandwidth depend upon the fitting point ? Often a constant bandwidth is used, while LOWESS and LOESS prefer a nearest-neighbor bandwidth, meaning h is smaller in regions with many data points. Formally, the smoothing parameter,, is the fraction of the total number n of data points that are used in each local fit. The subset of data used in each weighted least squares fit thus comprises the points whose explanatory variables' values are closest to the point at which the response is being estimated.
More sophisticated methods attempt to choose the bandwidth adaptively; that is, choose a bandwidth at each fitting point by applying criteria such as cross-validation locally within the smoothing window. An early example of this is Jerome H. Friedman's "supersmoother", which uses cross-validation to choose among local linear fits at different bandwidths.