Normalization (machine learning)
In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely data normalization and activation normalization. Data normalization includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range. This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.
Activation normalization, on the other hand, is specific to deep learning, and includes methods that rescale the activation of hidden neurons inside neural networks.
Normalization is often used to:
- increase the speed of training convergence,
- reduce sensitivity to variations and feature scales in input data,
- reduce overfitting,
- and produce better model generalization to unseen data.
Batch normalization
Batch normalization operates on the activations of a layer for each mini-batch.Consider a simple feedforward network, defined by chaining together modules:
where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. is the input vector, is the output vector from the first module, etc.
BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after, then the network would operate accordingly:
The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time.
Concretely, suppose we have a batch of inputs, fed all at once into the network. We would obtain in the middle of the network some vectors:
The BatchNorm module computes the coordinate-wise mean and variance of these vectors:
where indexes the coordinates of the vectors, and indexes the elements of the batch. In other words, we are considering the -th coordinate of each vector in the batch, and computing the mean and variance of these numbers.
It then normalizes each coordinate to have zero mean and unit variance:
The is a small positive constant such as added to the variance for numerical stability, to avoid division by zero.
Finally, it applies a linear transformation:
Here, and are parameters inside the BatchNorm module. They are learnable parameters, typically trained by gradient descent.
The following is a Python implementation of BatchNorm:
import numpy as np
def batchnorm:
# Mean and variance of each feature
mu = np.mean # shape
var = np.var # shape
# Normalize the activations
x_hat = / np.sqrt # shape
# Apply the linear transform
y = gamma * x_hat + beta # shape
return y
Interpretation
and allow the network to learn to undo the normalization, if this is beneficial. BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters and detractors.
Special cases
The original paper recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is,, not. Also, the bias does not matter, since it would be canceled by the subsequent mean subtraction, so it is of the form. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to zero.For convolutional neural networks, BatchNorm must preserve the translation-invariance of these models, meaning that it must treat all outputs of the same kernel as if they are different data points within a batch. This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.
Concretely, suppose we have a 2-dimensional convolutional layer defined by:
where:
- is the activation of the neuron at position in the -th channel of the -th layer.
- is a kernel tensor. Each channel corresponds to a kernel, with indices.
- is the bias term for the -th channel of the -th layer.
where is the batch size, is the height of the feature map, and is the width of the feature map.
That is, even though there are only data points in a batch, all outputs from the kernel in this batch are treated equally.
Subsequently, normalization and the linear transform is also done per kernel:
Similar considerations apply for BatchNorm for n-dimensional convolutions.
The following is a Python implementation of BatchNorm for 2D convolutions:
import numpy as np
def batchnorm_cnn:
# Calculate the mean and variance for each channel.
mean = np.mean
var = np.var
# Normalize the input tensor.
x_hat = / np.sqrt
# Scale and shift the normalized tensor.
y = gamma * x_hat + beta
return y
It is also possible to apply BatchNorm to LSTMs.
Improvements
BatchNorm has been very popular and there were many attempted improvements. Some examples include:- ghost batching: randomly partition a batch into sub-batches and perform BatchNorm separately on each;
- weight decay on and ;
- and combining BatchNorm with GroupNorm.
where is a hyperparameter to be optimized on a validation set.
Other works attempt to eliminate BatchNorm, such as the Normalizer-Free ResNet.
Layer normalization
Layer normalization is a popular alternative to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of transformer models.For a given data input and layer, LayerNorm computes the mean and variance over all the neurons in the layer. Similar to BatchNorm, learnable parameters and are applied. It is defined by:
where:
and the index ranges over the neurons in that layer.
Examples
For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have:Notice that the batch index is removed, while the channel index is added.
In recurrent neural networks and transformers, LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep is, where is the dimension of the hidden vector, then LayerNorm will be applied with:
where:
Root mean square layer normalization
Root mean square layer normalization :Essentially, it is LayerNorm where we enforce. It is also called L2 normalization. It is a special case of Lp normalization, or power normalization:where is a constant.
Adaptive
Adaptive layer norm computes the in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNNs, and has been used effectively in diffusion transformers. For example, in a DiT, the conditioning information is processed by a multilayer perceptron into, which is then applied in the LayerNorm module of a transformer.Weight normalization
Weight normalization is a technique inspired by BatchNorm that normalizes weight matrices in a neural network, rather than its activations.One example is spectral normalization, which divides weight matrices by their spectral norm. The spectral normalization is used in generative adversarial networks such as the Wasserstein GAN. The spectral radius can be efficiently computed by the following algorithm:
By reassigning after each update of the discriminator, we can upper-bound, and thus upper-bound.
The algorithm can be further accelerated by memoization: at step, store. Then, at step, use as the initial guess for the algorithm. Since is very close to, so is to, thus allowing rapid convergence.