Batch normalization
In artificial neural networks, batch normalization is a normalization technique used to make training faster and more stable by adjusting the inputs to each layer—re-centering them around zero and re-scaling them to a standard size. It was introduced by Sergey Ioffe and Christian Szegedy in 2015.
Experts still debate why batch normalization works so well. It was initially thought to tackle internal covariate shift, a problem where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network. However, newer research suggests it doesn’t fix this shift but instead smooths the objective function—a mathematical guide the network follows to improve—enhancing performance. In very deep networks, batch normalization can initially cause a severe gradient explosion—where updates to the network grow uncontrollably large—but this is managed with shortcuts called skip connections in residual networks. Another theory is that batch normalization adjusts data by handling its size and path separately, speeding up training.
Internal covariate shift
Each layer in a neural network has inputs that follow a specific distribution, which shifts during training due to two main factors: the random starting values of the network’s settings and the natural variation in the input data. This shifting pattern affecting the inputs to the network’s inner layers is called internal covariate shift. While a strict definition isn’t fully agreed upon, experiments show that it involves changes in the means and variances of these inputs during training.Batch normalization was first developed to address internal covariate shift. During training, as the parameters of preceding layers adjust, the distribution of inputs to the current layer changes accordingly, such that the current layer needs to constantly readjust to new distributions. This issue is particularly severe in deep networks, because small changes in shallower hidden layers will be amplified as they propagate within the network, resulting in significant shift in deeper hidden layers. Batch normalization was proposed to reduced these unwanted shifts to speed up training and produce more reliable models.
Beyond possibly tackling internal covariate shift, batch normalization offers several additional advantages. It allows the network to use a higher learning rate—a setting that controls how quickly the network learns—without causing problems like vanishing or exploding gradients, where updates become too small or too large. It also appears to have a regularizing effect, improving the network’s ability to generalize to new data, reducing the need for dropout, a technique used to prevent overfitting. Additionally, networks using batch normalization are less sensitive to the choice of starting settings or learning rates, making them more robust and adaptable.
Procedures
Transformation
In a neural network, batch normalization is achieved through a normalization step that fixes the means and variances of each layer's inputs. Ideally, the normalization would be conducted over the entire training set, but to use this step jointly with stochastic optimization methods, it is impractical to use the global information. Thus, normalization is restrained to each mini-batch in the training process.Let us use B to denote a mini-batch of size m of the entire training set. The empirical mean and variance of B could thus be denoted as
and.
For a layer of the network with d-dimensional input,, each dimension of its input is then normalized separately,
, where and ; and are the per-dimension mean and standard deviation, respectively.
is added in the denominator for numerical stability and is an arbitrarily small positive constant. The resulting normalized activation have zero mean and unit variance, if is not taken into account. To restore the representation power of the network, a transformation step then follows as
where the parameters and are subsequently learned in the optimization process.
Formally, the operation that implements batch normalization is a transform called the Batch Normalizing transform. The output of the BN transform is then passed to other network layers, while the normalized output remains internal to the current layer.
Backpropagation
The described BN transform is a differentiable operation, and the gradient of the loss l with respect to the different parameters can be computed directly with the chain rule.Specifically, depends on the choice of activation function, and the gradient against other parameters could be expressed as a function of :
,,
,,
and.
Inference
During the training stage, the normalization steps depend on the mini-batches to ensure efficient and reliable training. However, in the inference stage, this dependence is not useful any more. Instead, the normalization step in this stage is computed with the population statistics such that the output could depend on the input in a deterministic manner. The population mean,, and variance,, are computed as:, and.
The population statistics thus is a complete representation of the mini-batches.
The BN transform in the inference step thus becomes
where is passed on to next layers instead of. Since the parameters are fixed in this transformation, the batch normalization procedure is essentially applying a linear transform to the activation function.
Theory
Although batch normalization has become popular due to its strong empirical performance, the working mechanism of the method is not yet well-understood. The explanation made in the original paper was that batch norm works by reducing internal covariate shift, but this has been challenged by more recent work. One experiment trained a VGG-16 network under 3 different training regimes: standard, batch norm, and batch norm with noise added to each layer during training. In the third model, the noise has non-zero mean and non-unit variance, i.e. it explicitly introduces covariate shift. Despite this, it showed similar accuracy to the second model, and both performed better than the first, suggesting that covariate shift is not the reason that batch norm improves performance.Using batch normalization causes the items in a batch to no longer be iid, which can lead to difficulties in training due to lower quality gradient estimation.
Smoothness
One alternative explanation is that the improvement with batch normalization is instead due to producing a smoother parameter space and smoother gradients, as formalized by a smaller Lipschitz constant.Consider two identical networks, one contains batch normalization layers and the other does not, the behaviors of these two networks are then compared. Denote the loss functions as and, respectively. Let the input to both networks be, and the output be, for which, where is the layer weights. For the second network, additionally goes through a batch normalization layer. Denote the normalized activation as, which has zero mean and unit variance. Let the transformed activation be, and suppose and are constants. Finally, denote the standard deviation over a mini-batch as.
First, it can be shown that the gradient magnitude of a batch normalized network,, is bounded, with the bound expressed as
Since the gradient magnitude represents the Lipschitzness of the loss, this relationship indicates that a batch normalized network could achieve greater Lipschitzness comparatively. Notice that the bound gets tighter when the gradient correlates with the activation, which is a common phenomena. The scaling of is also significant, since the variance is often large.
Secondly, the quadratic form of the loss Hessian with respect to activation in the gradient direction can be bounded as
The scaling of indicates that the loss Hessian is resilient to the mini-batch variance, whereas the second term on the right hand side suggests that it becomes smoother when the Hessian and the inner product are non-negative. If the loss is locally convex, then the Hessian is positive semi-definite, while the inner product is positive if is in the direction towards the minimum of the loss. It could thus be concluded from this inequality that the gradient generally becomes more predictive with the batch normalization layer.
It then follows to translate the bounds related to the loss with respect to the normalized activation to a bound on the loss with respect to the network weights:
, where and.
In addition to the smoother landscape, it is further shown that batch normalization could result in a better initialization with the following inequality:
, where and are the local optimal weights for the two networks, respectively.
Some scholars argue that the above analysis cannot fully capture the performance of batch normalization, because the proof only concerns the largest eigenvalue, or equivalently, one direction in the landscape at all points. It is suggested that the complete eigenspectrum needs to be taken into account to make a conclusive analysis.
Measure
Since it is hypothesized that batch normalization layers could reduce internal covariate shift, an experiment is set up to measure quantitatively how much covariate shift is reduced. First, the notion of internal covariate shift needs to be defined mathematically. Specifically, to quantify the adjustment that a layer's parameters make in response to updates in previous layers, the correlation between the gradients of the loss before and after all previous layers are updated is measured, since gradients could capture the shifts from the first-order training method. If the shift introduced by the changes in previous layers is small, then the correlation between the gradients would be close to 1.The correlation between the gradients are computed for four models: a standard VGG network, a VGG network with batch normalization layers, a 25-layer deep linear network trained with full-batch gradient descent, and a DLN network with batch normalization layers. Interestingly, it is shown that the standard VGG and DLN models both have higher correlations of gradients compared with their counterparts, indicating that the additional batch normalization layers are not reducing internal covariate shift.