Kullback–Leibler divergence


In mathematical statistics, the Kullback–Leibler 'divergence', denoted, is a type of statistical distance: a measure of how much an approximating probability distribution is different from a true probability distribution. Mathematically, it is defined as
A simple [|interpretation] of the KL divergence of from is the expected excess surprisal from using the approximation instead of when the actual is. While it is a measure of how different two distributions are and is thus a distance in some sense, it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.
KL divergence is always a non-negative real number, with value 0 if and only if the two distributions in question are identical. It has diverse applications, both theoretical, such as characterizing the relative entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference; and practical, such as applied statistics, fluid mechanics, neuroscience, bioinformatics, and machine learning.

Introduction and context

Consider two probability distributions, a true and an approximating. Often, represents the data, the observations, or a measured probability distribution and distribution represents instead a theory, a model, a description, or another approximation of. However, sometimes the true distribution represents a model and the approximating distribution represents data that are intended to match the true distribution. The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of using a code optimized for rather than one optimized for.
Note that the roles of and can be reversed in some situations where that is easier to compute and the goal is to minimize, such as with the expectation–maximization algorithm and evidence lower bound computations. This role-reversal approach exploits that if and only if and that, in many cases, reducing one has the effect of reducing the other.

Etymology

The relative entropy was introduced by Solomon Kullback and Richard Leibler in as "the mean information for discrimination between and per observation from ", where one is comparing two probability measures, and are the hypotheses that one is selecting from measure . They denoted this by, and defined the "'divergence' between and " as the symmetrized quantity, which had already been defined and used by Harold Jeffreys in 1948. In, the symmetrized form is again referred to as the "divergence", and the relative entropies in each direction are referred to as a "directed divergences" between two distributions; Kullback preferred the term discrimination information. The term "divergence" is in contrast to a distance, since the symmetrized divergence does not satisfy the triangle inequality. Numerous references to earlier uses of the symmetrized divergence and to other statistical distances are given in. The asymmetric "directed divergence" has come to be known as the Kullback–Leibler divergence, while the symmetrized "divergence" is now referred to as the Jeffreys divergence.

Definition

For discrete probability distributions and defined on the same sample space, the relative entropy from to is defined to be
which is equivalent to
In other words, it is the expectation of the logarithmic difference between the probabilities and, where the expectation is taken using the probabilities.
Relative entropy is only defined in this way if, for all, implies . Otherwise, it is often defined as but the value is possible even if everywhere, provided that is infinite in extent. Analogous comments apply to the continuous and general measure cases defined below.
Whenever is zero the contribution of the corresponding term is interpreted as zero because
For distributions and of a continuous random variable, relative entropy is defined to be the integral
where and denote the probability densities of and.
More generally, if and are probability measures on a measurable space and is absolutely continuous with respect to, then the relative entropy from to is defined as
where is the Radon–Nikodym derivative of with respect to, i.e. the unique almost everywhere defined function on such that which exists because is absolutely continuous with respect to. Also we assume the expression on the right-hand side exists. Equivalently, this can be written as
which is the entropy of relative to. Continuing in this case, if is any measure on for which densities and with and exist (meaning that and are both absolutely continuous with respect to then the relative entropy from to is given as
Note that such a measure for which densities can be defined always exists, since one can take although in practice it will usually be one that applies in the context such as counting measure for discrete distributions, or Lebesgue measure or a convenient variant thereof such as Gaussian measure or the uniform measure on the sphere, Haar measure on a Lie group etc. for continuous distributions.
The logarithms in these formulae are usually taken to base 2 if information is measured in units of bits, or to base if information is measured in nats. Most formulas involving relative entropy hold regardless of the base of the logarithm.
Various conventions exist for referring to in words. Often it is referred to as the divergence between and, but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of from or as the divergence from ''to. This reflects the asymmetry in Bayesian inference, which starts from a prior and updates to the posterior. Another common way to refer to is as the relative entropy of with respect to'' or the information gain from over.

Basic example

Kullback gives the following example. Let and be the distributions shown in the table and figure. is the distribution on the left side of the figure, a binomial distribution with and. is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes , each with probability.
012

Relative entropies and are calculated as follows. This example uses the natural log with base E |, designated to get results in nats :

Interpretations

Statistics

In the field of statistics, the Neyman–Pearson lemma states that the most powerful way to distinguish between the two distributions and based on an observation is through the log of the ratio of their likelihoods:. The KL divergence is the expected value of this statistic if is actually drawn from. Kullback motivated the statistic as an expected log likelihood ratio.

Coding

In the context of coding theory, can be constructed by measuring the expected number of extra bits required to code samples from using a code optimized for rather than the code optimized for.

Inference

In the context of machine learning, is often called the information gain achieved if would be used instead of which is currently used. By analogy with information theory, it is called the relative entropy of with respect to.
Expressed in the language of Bayesian inference, is a measure of the information gained by revising one's beliefs from the prior probability distribution to the posterior probability distribution. In other words, it is the amount of information lost when is used to approximate.

Information geometry

In applications, typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while typically represents a theory, model, description, or approximation of. In order to find a distribution that is closest to, we can minimize the KL divergence and compute an information projection.
While it is a statistical distance, it is not a metric, the most familiar type of distance, but instead it is a divergence. While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general does not equal, and the asymmetry is an important part of the geometry. The infinitesimal form of relative entropy, specifically its Hessian, gives a metric tensor that equals the Fisher information metric; see. Fisher information metric on the certain probability distribution let determine the natural gradient for information-geometric optimization algorithms. Its quantum version is Fubini-study metric. Relative entropy satisfies a generalized Pythagorean theorem for exponential families, and this allows one to minimize relative entropy by geometric means, for example by information projection and in maximum likelihood estimation.
The relative entropy is the Bregman divergence generated by the negative entropy, but it is also of the form of an -divergence. For probabilities over a finite alphabet, it is unique in being a member of both of these classes of statistical divergences. The application of Bregman divergence can be found in mirror descent.

Finance (game theory)

Consider a growth-optimizing investor in a fair game with mutually exclusive outcomes
.
The rate of return expected by such an investor is equal to the relative entropy
between the investor's believed probabilities and the official odds.
This is a special case of a much more general connection between financial returns and divergence measures.
Financial risks are connected to via information geometry. Investors' views, the prevailing market view, and risky scenarios form triangles on the relevant manifold of probability distributions. The shape of the triangles determines key financial risks. For instance, obtuse triangles in which investors' views and risk scenarios appear on "opposite sides" relative to the market describe negative risks, acute triangles describe positive exposure, and the right-angled situation in the middle corresponds to zero risk. Extending this concept, relative entropy can be hypothetically utilised to identify the behaviour of informed investors, if one takes this to be represented by the magnitude and deviations away from the prior expectations of fund flows, for example.