F-divergence
In probability theory, an -divergence is a certain type of function that measures the difference between two probability distributions and. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of -divergence.
History
These divergences were introduced by Alfréd Rényi in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov processes. f-divergences were studied further independently by, and and are sometimes known as Csiszár -divergences, Csiszár–Morimoto divergences, or Ali–Silvey distances.Definition
Non-singular case
Let and be two probability distributions over a space, such that, that is, is absolutely continuous with respect to . Then, for a convex function such that is finite for all,, and , the -divergence of from is defined asWe call the generator of.
In concrete applications, there is usually a reference distribution on , such that, then we can use Radon–Nikodym theorem to take their probability densities and, giving
When there is no such reference distribution ready at hand, we can simply define, and proceed as above. This is a useful technique in more abstract proofs.
Extension to [singular measures]
The above definition can be extended to cases where is no longer satisfied.Since is convex, and , the function must be nondecreasing, so there exists, taking value in.
Since for any, we have , we can extend f-divergence to the.
Properties
Basic relations between f-divergences
- Linearity: given a finite sequence of nonnegative real numbers and generators.
- iff for some.
Basic properties of f-divergences
then, for some convex function f. For example, Bregman divergences in general do not have such property and can increase in Markov processes.
Analytic properties
The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances.Basic variational representation
Let be the convex conjugate of. Let be the effective domain of, that is,. Then we have two variational representations of, which we describe below.
Under the above setup,
This is Theorem 7.24 in.
Example applications
Using this theorem on total variation distance, with generator its convex conjugate is, and we obtainFor chi-squared divergence, defined by, we obtain
Since the variation term is not affine-invariant in, even though the domain over which varies is affine-invariant, we can use up the affine-invariance to obtain a leaner expression.
Replacing by and taking the maximum over, we obtain
which is just a few steps away from the Hammersley–Chapman–Robbins bound and the Cramér–Rao bound.
For -divergence with, we have, with range. Its convex conjugate is with range, where.
Applying this theorem yields, after substitution with,
or, releasing the constraint on,
Setting yields the variational representation of -divergence obtained above.
The domain over which varies is not affine-invariant in general, unlike the -divergence case. The -divergence is special, since in that case, we can remove the from.
For general, the domain over which varies is merely scale invariant. Similar to above, we can replace by, and take minimum over to obtain
Setting, and performing another substitution by, yields two variational representations of the squared Hellinger distance:
Applying this theorem to the KL-divergence, defined by, yields
This is strictly less efficient than the Donsker–Varadhan representation
This defect is fixed by the next theorem.
Improved variational representation
Assume the setup in the beginning of this section.This is Theorem 7.25 in.
Example applications
Applying this theorem to KL-divergence yields the Donsker–Varadhan representation.Attempting to apply this theorem to the general -divergence with does not yield a closed-form solution.
Common examples of ''f''-divergences
The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of -divergence, or linear sums of -divergences.For each f-divergence, its generating function is not uniquely defined, but only up to, where is any real constant. That is, for any that generates an f-divergence, we have. This freedom is not only convenient, but actually necessary.
| Divergence | Corresponding f | Discrete Form |
| -divergence, | ||
| Total variation distance | ||
| α-divergence | - | |
| KL-divergence | ||
| reverse KL-divergence | ||
| Jensen–Shannon divergence | ||
| Jeffreys divergence | ||
| squared Hellinger distance | ||
| Neyman -divergence | ||
| Pearson -divergence |
Let be the generator of -divergence, then and are convex inversions of each other, so. In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric.
In the literature, the -divergences are sometimes parametrized as
which is equivalent to the parametrization in this page by substituting.
Relations to other statistical divergences
Here, we compare f-divergences with other statistical divergences.Rényi divergence
The Rényi divergences is a family of divergences defined bywhen. It is extended to the cases of by taking the limit.
Simple algebra shows that, where is the -divergence defined above.