Gradient boosting


Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals instead of residuals as in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. As with other boosting methods, a gradient-boosted trees model is built in stages, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

History

The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function. Explicit regression gradient boosting algorithms were subsequently developed, by Jerome H. Friedman, simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.
The latter two papers introduced the view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.

Informal introduction

Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner iteratively. It is easiest to explain in the least-squares regression setting, where the goal is to teach a model to predict values of the form by minimizing the mean squared error, where indexes over some training set of size of actual values of the output variable :
  • the predicted value
  • the observed value
  • the size of the sample, i.e. the number of observations in
If the algorithm has stages, at each stage , suppose some imperfect model . In order to improve, our algorithm should add some new estimator,. Thus,
or, equivalently,
Therefore, gradient boosting will fit to the residual. As in other boosting variants, each attempts to correct the errors of its predecessor. A generalization of this idea to loss functions other than squared error, and to classification and ranking problems, follows from the observation that residuals for a given model are proportional to the negative gradients of the mean squared error loss function :
So, gradient boosting could be generalized to a gradient descent algorithm by plugging in a different loss and its gradient.

Algorithm

Many supervised learning problems involve an output variable and a vector of input variables, related to each other with some probabilistic distribution. The goal is to find some function that best approximates the output variable from the values of input variables. This is formalized by introducing some loss function and minimizing it in expectation:
The gradient boosting method assumes a real-valued. It seeks an approximation in the form of a weighted sum of functions from some class, called base learners:
where is the weight at stage. We are usually given a training set of known values of and corresponding values of. In accordance with the empirical risk minimization principle, the method tries to find an approximation that minimizes the average value of the loss function on the training set, i.e., minimizes the empirical risk. It does so by starting with a model, consisting of a constant function, and incrementally expands it in a greedy fashion:
for, where is a base learner function.
Unfortunately, choosing the best function at each step for an arbitrary loss function is a computationally infeasible optimization problem in general. Therefore, we restrict our approach to a simplified version of the problem. The idea is to apply a steepest descent step to this minimization problem. The basic idea is to find a local minimum of the loss function by iterating on. In fact, the local maximum-descent direction of the loss function is the negative gradient. Hence, moving a small amount such that the linear approximation remains valid:
where. For small, this implies that.
To prove the following, consider the objective
Doing a Taylor expansion around the fixed point up to first order
Now differentiating w.r.t to, only the derivative of the second term remains
. This is the direction of steepest ascent and hence we must move in the opposite direction in order to move in the direction of steepest descent.
Furthermore, we can optimize by finding the value for which the loss function has a minimum:
If we considered the continuous case, i.e., where is the set of arbitrary differentiable functions on, we would update the model in accordance with the following equations
where is the step length, defined as
In the discrete case however, i.e. when the set is finite, we choose the candidate function closest to the gradient of for which the coefficient may then be calculated with the aid of line search on the above equations. Note that this approach is a heuristic and therefore doesn't yield an exact solution to the given problem, but rather an approximation.
In pseudocode, the generic gradient boosting method is:
Input: training set a differentiable loss function number of iterations.
Algorithm:
  1. Initialize model with a constant value:
  2. :
  3. For = 1 to :
  4. # Compute so-called pseudo-residuals:
  5. #:
  6. # Fit a base learner closed under scaling to pseudo-residuals, i.e. train it using the training set.
  7. # Compute multiplier by solving the following one-dimensional optimization problem:
  8. #:
  9. # Update the model:
  10. #:
  11. Output

    Gradient tree boosting

Gradient boosting is typically used with decision trees of a fixed size as base learners. For this special case, Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.
Generic gradient boosting at the m-th step would fit a decision tree to pseudo-residuals. Let be the number of its leaves. The tree partitions the input space into disjoint regions and predicts a constant value in each region. Using the indicator notation, the output of for input x can be written as the sum:
where is the value predicted in the region.
Then the coefficients are multiplied by some value, chosen using line search so as to minimize the loss function, and the model is updated as follows:
Friedman proposes to modify this algorithm so that it chooses a separate optimal value for each of the tree's regions, instead of a single for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients from the tree-fitting procedure can be then simply discarded and the model update rule becomes:
When the loss is mean-squared error the coefficients coincide with the coefficients of the tree-fitting procedure.

Tree size

The number of terminal nodes in the trees is a parameter which controls the maximum allowed level of interaction between variables in the model. With , no interaction between variables is allowed. With the model may include effects of the interaction between up to two variables, and so on. can be adjusted for a data set at hand.
Hastie et al. comment that typically work well for boosting and results are fairly insensitive to the choice of in this range, is insufficient for many applications, and is unlikely to be required.

Regularization

Fitting the training set too closely can lead to degradation of the model's generalization ability, that is, its performance on unseen examples. Several so-called regularization techniques reduce this overfitting effect by constraining the fitting procedure.
One natural regularization parameter is the number of gradient boosting iterations M. Increasing M reduces the error on training set, but increases risk of overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set.
Another regularization parameter for tree boosting is tree depth. The higher this value the more likely the model will overfit the training data.

Shrinkage

An important part of gradient boosting is regularization by shrinkage which uses a modified update rule:
where parameter is called the "learning rate".
Empirically, it has been found that using small learning rates yields dramatic improvements in models' generalization ability over gradient boosting without shrinking. However, it comes at the price of increasing computational time both during training and querying: lower learning rate requires more iterations.

Stochastic gradient boosting

Soon after the introduction of gradient boosting, Friedman proposed a minor modification to the algorithm, motivated by Breiman's bootstrap aggregation method. Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.
Subsample size is some constant fraction of the size of the training set. When, the algorithm is deterministic and identical to the one described above. Smaller values of introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Friedman obtained that leads to good results for small and moderate sized training sets. Therefore, is typically set to 0.5, meaning that one half of the training set is used to build each base learner.
Also, like in bagging, subsampling allows one to define an out-of-bag error of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.