Regularization (mathematics)


In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer to a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.
Although regularization procedures can be divided in many ways, the following delineation is particularly helpful:
  • Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.
  • Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, including stochastic gradient descent for training deep neural networks, and ensemble methods.
In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement, and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more aligned to the data or to enforce regularization. There is a whole research branch dealing with all possible regularizations. In practice, one usually tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition.
In machine learning, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce the generalization error, i.e. the error score with the trained model on the evaluation set and not the training data.
One of the earliest uses of regularization is Tikhonov regularization, related to the method of least squares.

Regularization in machine learning

In machine learning, a key challenge is enabling models to accurately predict outcomes on unseen data, not just on familiar training data. Regularization is crucial for addressing overfitting—where a model memorizes training data details but cannot generalize to new data. The goal of regularization is to encourage models to learn the broader patterns within the data rather than memorizing it. Techniques like early stopping, L1 and L2 regularization, and dropout are designed to prevent overfitting and underfitting, thereby enhancing the model's ability to adapt to and perform well with new data, thus improving model generalization.

Early stopping

Stops training when validation performance deteriorates, preventing overfitting by halting before the model memorizes training data.

L1 and L2 regularization

Adds penalty terms to the cost function to discourage complex models:
  • L1 regularization leads to sparse models by adding a penalty based on the absolute value of coefficients.
  • L2 regularization encourages smaller, more evenly distributed weights by adding a penalty based on the square of the coefficients.

    Dropout

In the context of neural networks, the Dropout technique repeatedly ignores random subsets of neurons during training, which simulates the training of multiple neural network architectures at once to improve generalization.

Classification

Empirical learning of classifiers is always an underdetermined problem, because it attempts to infer a function of any given only examples.
A regularization term is added to a loss function:
where is an underlying loss function that describes the cost of predicting when the label is, such as the square loss or hinge loss; and is a parameter which controls the importance of the regularization term. is typically chosen to impose a penalty on the complexity of. Concrete notions of complexity used include restrictions for smoothness and bounds on the vector space norm.
A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.
Regularization can serve multiple purposes, including learning simpler models, inducing models to be sparse and introducing group structure into the learning problem.
The same idea arose in many fields of science. A simple form of regularization applied to integral equations is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization, have become popular.

Generalization

Regularization can be motivated as a technique to improve the generalizability of a learned model.
The goal of this learning problem is to find a function that fits or predicts the outcome that minimizes the expected error over all possible inputs and labels. The expected error of a function is:
where and are the domains of input data and their labels respectively.
Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over the available samples:
Without bounds on the complexity of the function space available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements were made with noise, this model may suffer from overfitting and display poor expected error. Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.

Tikhonov regularization (ridge regression)

These techniques are named for Andrey Nikolayevich Tikhonov, who applied regularization to integral equations and made important contributions in many other areas.
When learning a linear function, characterized by an unknown vector such that, one can add the -norm of the vector to the loss expression in order to prefer solutions with smaller norms. Tikhonov regularization is one of the most common forms. It is also known as ridge regression. It is expressed as:
where would represent samples used for training.
In the case of a general function, the norm of the function in its reproducing kernel Hilbert space is:
As the norm is differentiable, learning can be advanced by gradient descent.

Tikhonov-regularized least squares

The learning problem with the least squares loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal is the one for which the gradient of the loss function with respect to is 0.
where the third statement is a first-order condition.
By construction of the optimization problem, other values of give larger values for the loss function. This can be verified by examining the second derivative.
During training, this algorithm takes time. The terms correspond to the matrix inversion and calculating, respectively. Testing takes time.

Early stopping

Early stopping can be viewed as regularization in time. Intuitively, a training procedure such as gradient descent tends to learn more and more complex functions with increasing iterations. By regularizing for time, model complexity can be controlled, improving generalization.
Early stopping is implemented using one data set for training, one statistically independent data set for validation and another for testing. The model is trained until performance on the validation set no longer improves and then applied to the test set.

Theoretical motivation in least squares

Consider the finite approximation of Neumann series for an invertible matrix where :
This can be used to approximate the analytical solution of unregularized least squares, if is introduced to ensure the norm is less than one.
The exact solution to the unregularized least squares learning problem minimizes the empirical error, but may fail. By limiting, the only free parameter in the algorithm above, the problem is regularized for time, which may improve its generalization.
The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk
with the gradient descent update:
The base case is trivial. The inductive case is proved as follows:

Regularizers for sparsity

Assume that a dictionary with dimension is given such that a function in the function space can be expressed as:
Enforcing a sparsity constraint on can lead to simpler and more interpretable models. This is useful in many real-life applications such as computational biology. An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power.
A sensible sparsity constraint is the norm, defined as the number of non-zero elements in. Solving a regularized learning problem, however, has been demonstrated to be NP-hard.
The norm can be used to approximate the optimal Norm | norm via convex relaxation. It can be shown that the Norm | norm induces sparsity. In the case of least squares, this problem is known as LASSO in statistics and basis pursuit in signal processing.
Norm | regularization can occasionally produce non-unique solutions. A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line. This can be problematic for certain applications, and is overcome by combining Norm | with Norm | regularization in elastic net regularization, which takes the following form:
Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights.
Elastic net regularization is commonly used in practice and is implemented in many machine learning libraries.