Fisher information
In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.
The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized and explored by the statistician Sir Ronald Fisher. The Fisher information matrix is used to calculate the covariance matrices associated with maximum-likelihood estimates. It can also be used in the formulation of test statistics, such as the Wald test.
In Bayesian statistics, the Fisher information plays a role in the derivation of non-informative prior distributions according to Jeffreys' rule. It also appears as the large-sample covariance of the posterior distribution, provided that the prior is sufficiently smooth. The same result is used when approximating the posterior with Laplace's approximation, where the Fisher information appears as the covariance of the fitted Gaussian.
Statistical systems of a scientific nature whose likelihood functions obey shift invariance have been shown to obey maximum Fisher information. The level of the maximum depends upon the nature of the system constraints.
Definition
The Fisher information is a way of measuring the amount of information that an observable random variable carries about an unknown parameter upon which the probability of depends. Let be the probability density function for conditioned on the value of. It describes the probability that we observe a given outcome of, given a known value of. If is sharply peaked with respect to changes in, it is easy to indicate the "correct" value of from the data, or equivalently, that the data provides a lot of information about the parameter. If is flat and spread-out, then it would take many samples of to estimate the actual "true" value of that would be obtained using the entire population being sampled. This suggests studying some kind of variance with respect to.Formally, the partial derivative with respect to of the natural logarithm of the likelihood function is called the score. Under certain regularity conditions, if is the true parameter, it can be shown that the expected value of the score, evaluated at the true parameter value, is 0:
The Fisher information is defined to be the variance of the score:
Note that. A random variable carrying high Fisher information implies that the absolute value of the score is often high. The Fisher information is not a function of a particular observation, as the random variable X has been averaged out.
If is twice differentiable with respect to θ, and under certain additional regularity conditions, then the Fisher information may also be written as
Begin by taking the second derivative of :
Now take the expectation value of each term on both sides.
Next, we show that the last term is equal to 0.
Therefore,
Thus, the Fisher information may be seen as the curvature of the support curve. Near the maximum likelihood estimate, low Fisher information indicates that the maximum appears to be "blunt", that is, there are many points in the neighborhood that provide a similar log-likelihood. Conversely, a high Fisher information indicates that the maximum is "sharp".
Regularity conditions
The regularity conditions are as follows:- The partial derivative of f with respect to θ exists almost everywhere.
- The integral of f can be differentiated under the integral sign with respect to θ.
- The support of f does not depend on θ.
In terms of likelihood
Because the likelihood of θ given X is always proportional to the probability f, their logarithms necessarily differ by a constant that is independent of θ, and the derivatives of these logarithms with respect to θ are necessarily equal. Thus one can substitute in a log-likelihood l instead of in the definitions of Fisher Information.Samples of any size
The value X can represent a single sample drawn from a single distribution or can represent a collection of samples drawn from a collection of distributions. If there are n samples and the corresponding n distributions are statistically independent then the Fisher information will necessarily be the sum of the single-sample Fisher information values, one for each single sample from its distribution. In particular, if the n distributions are independent and identically distributed then the Fisher information will necessarily be n times the Fisher information of a single sample from the common distribution. Stated in other words, the Fisher Information of i.i.d. observations of a sample of size n from a population is equal to the product of n and the Fisher Information of a single observation from the same population.Informal derivation of the Cramér–Rao bound
The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any unbiased estimator of θ. and provide the following method of deriving the Cramér–Rao bound, a result which describes use of the Fisher information.Informally, we begin by considering an unbiased estimator. Mathematically, "unbiased" means that
This expression is zero independent of θ, so its partial derivative with respect to θ must also be zero. By the product rule, this partial derivative is also equal to
For each θ, the likelihood function is a probability density function, and therefore. By using the chain rule on the partial derivative of and then dividing and multiplying by, one can verify that
Using these two facts in the above, we get
Factoring the integrand gives
Squaring the expression in the integral, the Cauchy–Schwarz inequality yields
The second bracketed factor is defined to be the Fisher Information, while the first bracketed factor is the mean-squared error of the estimator. Since the estimator is unbiased, its MSE equals its variance. By rearranging, the inequality tells us that
In other words, the precision to which we can estimate θ is fundamentally limited by the Fisher information of the likelihood function.
Alternatively, the same conclusion can be obtained directly from the Cauchy–Schwarz inequality for random variables,, applied to the random variables and, and observing that for unbiased estimators we have
Examples
Single-parameter Bernoulli experiment
A Bernoulli trial is a random variable with two possible outcomes, 0 and 1, with 1 having a probability of θ. The outcome can be thought of as determined by the toss of a biased coin, with the probability of heads being θ and the probability of tails being.Let X be a Bernoulli trial of one sample from the distribution. The Fisher information contained in X may be calculated to be:
Because Fisher information is additive, the Fisher information contained in n independent Bernoulli trials is therefore
If is one of the possible outcomes of n independent Bernoulli trials and is the j th outcome of the i th trial, then the probability of is given by
The sample mean of the i th trial is. The expected value of the sample mean is
where the sum is over all possible trial outcomes. The expected value of the square of the sample mean is
so the variance in the value of the mean is
It is seen that the Fisher information is the reciprocal of the variance of the mean number of successes in n Bernoulli trials. This is generally true. In this case, the Cramér–Rao bound is an equality.
Estimate ''θ'' from ''X'' ~ Bern (√''θ'')
As another toy example consider a random variable with possible outcomes 0 and 1, with probabilities and, respectively, for some. Our goal is estimating from observations of.The Fisher information reads in this caseThis expression can also be derived directly from the change of reparametrization formula given below. More generally, for any sufficiently regular function such that, the Fisher information to retrieve from is similarly computed to be
Matrix form
When there are N parameters, so that θ is an vector the Fisher information takes the form of an matrix. This matrix is called the Fisher information matrix and has typical elementThe FIM is a positive semidefinite matrix. If it is positive definite, then it defines a Riemannian metric on the N-dimensional parameter space. The topic information geometry uses this to connect Fisher information to differential geometry, and in that context, this metric is known as the Fisher information metric.
Under certain regularity conditions, the Fisher information matrix may also be written as
The result is interesting in several ways:
- It is equal to minus the expected Hessian of the relative entropy.
- It can be used as a Riemannian metric for defining Fisher-Rao geometry when it is positive-definite.
- It can be understood as a metric induced from the Euclidean metric, after appropriate change of variable.
- In its complex-valued form, it is the Fubini–Study metric.
- It is the key part of the proof of Wilks' theorem, which allows confidence region estimates for maximum likelihood estimation without needing the Likelihood Principle.
- In cases where the analytical calculations of the FIM above are difficult, it is possible to form an average of easy Monte Carlo estimates of the Hessian of the negative log-likelihood function as an estimate of the FIM. The estimates may be based on values of the negative log-likelihood function or the gradient of the negative log-likelihood function; no analytical calculation of the Hessian of the negative log-likelihood function is needed.