Mutual information
In probability theory and information theory, the mutual information of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.
Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair is from the product of the marginal distributions of and. MI is the expected value of the pointwise mutual information.
The quantity was defined and analyzed by Claude Shannon in his landmark paper "A Mathematical Theory of Communication", although he did not call it "mutual information". This term was coined later by Robert Fano. Mutual Information is also known as information gain.
Definition
Let be a pair of random variables with values over the space. If their joint distribution is and the marginal distributions are and, the mutual information is defined aswhere is the Kullback–Leibler divergence, and is the outer product distribution which assigns probability to each.
Expressed in terms of the entropy and the conditional entropy of the random variables and, one also has :
Notice, as per property of the Kullback–Leibler divergence, that is equal to zero precisely when the joint distribution coincides with the product of the marginals, i.e. when and are independent. is non-negative. It is a measure of the price for encoding as a pair of independent random variables when in reality they are not.
If the natural logarithm is used, the unit of mutual information is the nat. If the log base 2 is used, the unit of mutual information is the shannon, also known as the bit. If the log base 10 is used, the unit of mutual information is the hartley, also known as the ban or the dit.
In terms of PMFs for discrete distributions
The mutual information of two jointly discrete random variables and is calculated as a double sum:where is the joint probability mass function of and, and and are the marginal probability mass functions of and respectively.
In terms of PDFs for continuous distributions
In the case of jointly continuous random variables, the double sum is replaced by a double integral:where is now the joint probability density function of and, and and are the marginal probability density functions of and respectively.
Motivation
Intuitively, mutual information measures the information that and share: It measures how much knowing one of these variables reduces uncertainty about the other. For example, if and are independent, then knowing does not give any information about and vice versa, so their mutual information is zero. At the other extreme, if is a deterministic function of and is a deterministic function of then all information conveyed by is shared with : knowing determines the value of and vice versa. As a result, the mutual information is the same as the uncertainty contained in alone, namely the entropy of . A very special case of this is when and are the same random variable.Mutual information is a measure of the inherent dependence expressed in the joint distribution of and relative to the marginal distribution of and under the assumption of independence. Mutual information therefore measures dependence in the following sense: if and only if and are independent random variables. This is easy to see in one direction: if and are independent, then, and therefore:
Moreover, mutual information is nonnegative and symmetric.
Properties
Nonnegativity
Using Jensen's inequality on the definition of mutual information we can show that is non-negative, i.e.Symmetry
The proof is given considering the relationship with entropy, as shown below.Supermodularity under independence
If is independent of, thenRelation to conditional and joint entropy
Mutual information can be equivalently expressed as:where and are the marginal entropies, and are the conditional entropies, and is the joint entropy of and.
Notice the analogy to the union, difference, and intersection of two sets: in this respect, all the formulas given above are apparent from the Venn diagram reported at the beginning of the article.
In terms of a communication channel in which the output is a noisy version of the input, these relations are summarised in the figure:
Because is non-negative, consequently,. Here we give the detailed deduction of for the case of jointly discrete random variables:
The proofs of the other identities above are similar. The proof of the general case is similar, with integrals replacing sums.
Intuitively, if entropy is regarded as a measure of uncertainty about a random variable, then is a measure of what does not say about. This is "the amount of uncertainty remaining about after is known", and thus the right side of the second of these equalities can be read as "the amount of uncertainty in, minus the amount of uncertainty in which remains after is known", which is equivalent to "the amount of uncertainty in which is removed by knowing ". This corroborates the intuitive meaning of mutual information as the amount of information that knowing either variable provides about the other.
Note that in the discrete case and therefore. Thus, and one can formulate the basic principle that a variable contains at least as much information about itself as any other variable can provide.
Relation to Kullback–Leibler divergence
For jointly discrete or jointly continuous pairs, mutual information is the Kullback–Leibler divergence from the product of the marginal distributions,, of the joint distribution, that is,Furthermore, let be the conditional mass or density function. Then, we have the identity
The proof for jointly discrete random variables is as follows:
Similarly this identity can be established for jointly continuous random variables.
Note that here the Kullback–Leibler divergence involves integration over the values of the random variable only, and the expression still denotes a random variable because is random. Thus mutual information can also be understood as the expectation over of the Kullback–Leibler divergence of the conditional distribution of given from the univariate distribution of : the more different the distributions and are on average, the greater the information gain.
Bayesian estimation of mutual information
If samples from a joint distribution are available, a Bayesian approach can be used to estimate the mutual information of that distribution. The first work to do this, which also showed how to do Bayesian estimation of many other information-theoretic properties besides mutual information, was. Subsequent researchers have rederived and extendedthis analysis. See for a recent paper based on a prior specifically tailored to estimation of mutual information per se. Besides, recently an estimation method accounting for continuous and multivariate outputs, , was proposed in
Independence assumptions
The Kullback-Leibler divergence formulation of the mutual information is predicated on that one is interested in comparing to the fully factorized outer product. In many problems, such as non-negative matrix factorization, one is interested in less extreme factorizations; specifically, one wishes to compare to a low-rank matrix approximation in some unknown variable ; that is, to what degree one might haveAlternately, one might be interested in knowing how much more information carries over its factorization. In such a case, the excess information that the full distribution carries over the matrix factorization is given by the Kullback-Leibler divergence
The conventional definition of the mutual information is recovered in the extreme case that the process has only one value for.
Variations
Several variations on mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.Metric
Many applications require a metric, that is, a distance measure between pairs of points. The quantitysatisfies the properties of a metric, where equality is understood to mean that can be completely determined from.
This distance metric is also known as the variation of information.
If are discrete random variables then all the entropy terms are non-negative, so and one can define a normalized distance
Plugging in the definitions shows that
This is known as the Rajski Distance. In a set-theoretic interpretation of information, this is effectively the Jaccard distance between and.
Finally,
is also a metric.
Conditional mutual information
Sometimes it is useful to express the mutual information of two random variables conditioned on a third.For jointly discrete random variables this takes the form
which can be simplified as
For jointly continuous random variables this takes the form
which can be simplified as
Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true that
for discrete, jointly distributed random variables. This result has been used as a basic building block for proving other inequalities in information theory.
Interaction information
Several generalizations of mutual information to more than two random variables have been proposed, such as total correlation and dual total correlation. The expression and study of multivariate higher-degree mutual information was achieved in two seemingly independent works: McGill who called these functions "interaction information", and Hu Kuo Ting. Interaction information is defined for one variable as follows:and for
Some authors reverse the order of the terms on the right-hand side of the preceding equation, which changes the sign when the number of random variables is odd. Note that