Generalized additive model
In statistics, a generalized additive model is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.
GAMs were originally developed by Trevor Hastie and Robert Tibshirani to blend properties of generalized linear models with additive models. They can be interpreted as the discriminative generalization of the naive Bayes generative model.
The model relates a univariate response variable, Y, to some predictor variables, xi. An exponential family distribution is specified for Y along with a link function g relating the expected value of Y to the predictor variables via a structure such as
The functions fi may be functions with a specified parametric form or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for f1, and then use a factor model for f2. This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.
Theoretical background
It had been known since the 1950s that any multivariate continuous function could be represented as sums and compositions of univariate functions,Unfortunately, though the Kolmogorov–Arnold representation theorem asserts the existence of a function of this form, it gives no mechanism whereby one could be constructed. Certain constructive proofs exist, but they tend to require highly complicated functions, and thus are not suitable for modeling approaches. Therefore, the generalized additive model drops the outer sum, and demands instead that the function belong to a simpler class,
where is a smooth monotonic function. Writing for the inverse of, this is traditionally written as
When this function is approximating the expectation of some observed quantity, it could be written as
Which is the standard formulation of a generalized additive model. It was then shown that the backfitting algorithm will always converge for these functions.
Generality
The GAM model class is quite broad, given that smooth function is a rather broad category. For example, a covariate may be multivariate and the corresponding a smooth function of several variables, or might be the function mapping the level of a factor to the value of a random effect. Another example is a varying coefficient term such as where and are both covariates. Or if is itself an observation of a function, we might include a term such as . could also be a simple parametric function as might be used in any generalized linear model. The model class has been generalized in several directions, notably beyond exponential family response distributions, beyond modelling of only the mean and beyond univariate data.GAM fitting methods
The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers via the backfitting algorithm. Backfitting works by iterative smoothing of partial residuals and provides a very general modular estimation method capable of using a wide variety of smoothing methods to estimate the terms. A disadvantage of backfitting is that it is difficult to integrate with the estimation of the degree of smoothness of the model terms, so that in practice the user must set these, or select between a modest set of pre-defined smoothing levels.If the are represented using smoothing splines then the degree of smoothness can be estimated as part of model fitting using generalized cross validation, or by restricted maximum likelihood which exploits the duality between spline smoothers and Gaussian random effects. This full spline approach carries an computational cost, where is the number of observations for the response variable, rendering it somewhat impractical for moderately large datasets. More recent methods have addressed this computational cost either by up front reduction of the size of the basis used for smoothing or by finding sparse representations of the smooths using Markov random fields, which are amenable to the use of sparse matrix methods for computation. These more computationally efficient methods use GCV or REML or take a fully Bayesian approach for inference about the degree of smoothness of the model components. Estimating the degree of smoothness via REML can be viewed as an empirical Bayes method.
An alternative approach with particular advantages in high dimensional settings is to use boosting, although this typically requires bootstrapping for uncertainty quantification. GAMs fit using bagging and boosting have been found to generally outperform GAMs fit using spline methods.
The rank reduced framework
Many modern implementations of GAMs and their extensions are built around the reduced rank smoothing approach, because it allows well founded estimation of the smoothness of the component smooths at comparatively modest computational cost, and also facilitates implementation of a number of model extensions in a way that is more difficult with other methods. At its simplest the idea is to replace the unknown smooth functions in the model with basis expansionswhere the are known basis functions, usually chosen for good approximation theoretic properties, and the are coefficients to be estimated as part of model fitting. The basis dimension is chosen to be sufficiently large that we expect it to overfit the data to hand, but small enough to retain computational efficiency. If then the computational cost of model estimation this way will be.
Notice that the are only identifiable to within an intercept term, so identifiability constraints have to be imposed on the smooth terms to remove this ambiguity. Sharpest inference about the is generally obtained by using the sum-to-zero constraints
i.e. by insisting that the sum of each the evaluated at its observed covariate values should be zero. Such linear constraints can most easily be imposed by reparametrization at the basis setup stage, so below it is assumed that this has been done.
Having replaced all the in the model with such basis expansions we have turned the GAM into a generalized linear model, with a model matrix that simply contains the basis functions evaluated at the observed values. However, because the basis dimensions,, have been chosen to be a somewhat larger than is believed to be necessary for the data, the model is over-parameterized and will overfit the data if estimated as a regular GLM. The solution to this problem is to penalize departure from smoothness in the model fitting process, controlling the weight given to the smoothing penalties using smoothing parameters. For example, consider the situation in which all the smooths are univariate functions. Writing all the parameters in one vector,, suppose that is the deviance for the model. Minimizing the deviance by the usual iteratively re-weighted least squares would result in overfit, so we seek to minimize
where the integrated square second derivative penalties serve to penalize wiggliness of the during fitting, and the smoothing parameters control the tradeoff between model goodness of fit and model smoothness. In the example would ensure that the estimate of would be a straight line in.
Given the basis expansion for each the wiggliness penalties can be expressed as quadratic forms in the model coefficients. That is we can write
where is a matrix of known coefficients computable from the penalty and basis, is the vector of coefficients for , and is just padded with zeros so that the second equality holds and we can write the penalty in terms of the full coefficient vector. Many other smoothing penalties can be written in the same way, and given the smoothing parameters the model fitting problem now becomes
which can be found using a penalized version of the usual iteratively reweighted least squares algorithm for GLMs: the algorithm is unchanged except that the sum of quadratic penalties is added to the working least squared objective at each iteration of the algorithm.
Penalization has several effects on inference, relative to a regular GLM. For one thing the estimates are subject to some smoothing bias, which is the price that must be paid for limiting estimator variance by penalization. However, if smoothing parameters are selected appropriately the smoothing bias introduced by penalization should be less than the reduction in variance that it produces, so that the net effect is a reduction in mean square estimation error, relative to not penalizing. A related effect of penalization is that the notion of degrees of freedom of a model has to be modified to account for the penalties' action in reducing the coefficients' freedom to vary. For example, if is the diagonal matrix of IRLS weights at convergence, and is the GAM model matrix, then the model effective degrees of freedom is given by where
is the effective degrees of freedom matrix. In fact summing just the diagonal elements of corresponding to the coefficients of gives the effective degrees of freedom for the estimate of .
Bayesian smoothing priors
Smoothing bias complicates interval estimation for these models, and the simplest approach turns out to involve a Bayesian approach. Understanding this Bayesian view of smoothing also helps to understand the REML and full Bayes approaches to smoothing parameter estimation. At some level smoothing penalties are imposed because we believe smooth functions to be more probable than wiggly ones, and if that is true then we might as well formalize this notion by placing a prior on model wiggliness. A very simple prior might be, but we can immediately recognize this as a multivariate normal prior with mean and precision matrix. Since the penalty allows some functions through unpenalized, is rank deficient, and the prior is actually improper, with a covariance matrix given by the Moore–Penrose pseudoinverse of .
Now if this prior is combined with the GLM likelihood, we find that the posterior mode for is exactly the found above by penalized IRLS. Furthermore, we have the large sample result that
which can be used to produce confidence/credible intervals for the smooth components,.
The Gaussian smoothness priors are also the basis for fully Bayesian inference with GAMs, as well as methods estimating GAMs as mixed models that are essentially empirical Bayes methods.