Softmax function


The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution over possible outcomes. It is a generalization of the logistic function to multiple dimensions, and is used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

Definition

The softmax function takes as input a tuple of real numbers, and normalizes it into a probability distribution consisting of probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.
Formally, the standard softmax function, where, takes a tuple and computes each component of vector with
In words, the softmax applies the standard exponential function to each element of the input tuple , and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input tuple. For example, the standard softmax of is approximately, which amounts to assigning almost all of the total unit weight in the result to the position of the tuple's maximal element.
In general, instead of a different base can be used. As above, if then larger input components will result in larger output probabilities, and increasing the value of will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if then smaller input components will result in larger output probabilities, and decreasing the value of will create probability distributions that are more concentrated around the positions of the smallest input values. Writing or yields the expressions:
A value proportional to the reciprocal of is sometimes referred to as the temperature:, where is typically 1 or the Boltzmann constant and is the temperature. A higher temperature results in a more uniform output distribution, while a lower temperature results in a sharper output distribution, with one value dominating.
In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter is varied.
The softmax function is a multiple-variable generalization of the logistic function.

Interpretations

Smooth arg max

The Softmax function is a smooth approximation to the arg max function: the function whose value is the index of a tuple's largest element. The name "softmax" may be misleading. Softmax is not a smooth maximum. The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning. This section uses the term "softargmax" for clarity.
Formally, instead of considering the arg max as a function with categorical output , consider the arg max function with one-hot representation of the output :
where the output coordinate if and only if is the arg max of, meaning is the unique maximum value of. For example, in this encoding since the third argument is the maximum.
This can be generalized to multiple arg max values by dividing the 1 between all max args; formally where is the number of arguments assuming the maximum. For example, since the second and third argument are both the maximum. In case all arguments are equal, this is simply Points with multiple arg max values are singular points – these are the points where arg max is discontinuous – while points with a single arg max are known as non-singular or regular points.
With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as, softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input as, However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal, the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, but and for all inputs: the closer the points are to the singular set, the slower they converge. However, softargmax does converge compactly on the non-singular set.
Conversely, as, softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring, and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization".
It is also the case that, for any fixed, if one input is much larger than the others relative to the temperature,, the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1:
However, if the difference is small relative to the temperature, the value is not close to the arg max. For example, a difference of 10 is small relative to a temperature of 100:
As, temperature goes to zero,, so eventually all differences become large, which gives another interpretation for the limit behavior.

Statistical mechanics

In statistical mechanics, the softargmax function is known as the Boltzmann distribution : the index set are the microstates of the system; the inputs are the energies of that state; the denominator is known as the partition function, often denoted by ; and the factor is called the coldness.

Applications

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression, multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks. Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of distinct linear functions, and the predicted probability for the th class given a sample tuple and a weighting vector is:
This can be seen as the composition of linear functions and the softmax function. The operation is equivalent to applying a linear operator defined by to tuples, thus transforming the original, probably highly-dimensional, input to vectors in a -dimensional space.

Neural networks

The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss regime, giving a non-linear variant of multinomial logistic regression.
Since the function maps a tuple and a specific index to a real value, the derivative needs to take the index into account:
This expression is symmetrical in the indexes and thus may also be expressed as
Here, the Kronecker delta is used for simplicity.
To ensure stable numerical computations subtracting the maximum value from the input tuple is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.
If the function is scaled with the parameter, then these expressions must be multiplied by.
See multinomial logit for a probability model which uses the softmax activation function.

Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:
where the action value corresponds to the expected reward of following action a and is called a temperature parameter. For high temperatures, all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature, the probability of the action with the highest expected reward tends to 1.

Computational complexity and remedies

In neural network applications, the number of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words. This can make the calculations for the softmax layer computationally expensive. What's more, the gradient descent backpropagation method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.
Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax. The hierarchical softmax uses a binary tree structure where the outcomes are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming latent variables. The desired probability of a leaf can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf. Ideally, when the tree is balanced, this would reduce the computational complexity from to. In practice, results depend on choosing a good strategy for clustering the outcomes into classes. A Huffman tree was used for this in Google's word2vec models to achieve scalability.
A second kind of remedies is based on approximating the softmax with modified loss functions that avoid the calculation of the full normalization factor. These include methods that restrict the normalization sum to a sample of outcomes.