Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inference uses a prior distribution to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".
Introduction to Bayes' rule
Formal explanation
Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:where
- stands for any hypothesis whose probability may be affected by data. Often there are competing hypotheses, and the task is to determine which is the most probable.
- , the prior probability, is the estimate of the probability of the hypothesis before the data, the current evidence, is observed.
- , the evidence, corresponds to new data that were not used in computing the prior probability.
- , the posterior probability, is the probability of given, i.e., after is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
- is the probability of observing given and is called the likelihood. As a function of with fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence,, while the posterior probability is a function of the hypothesis,.
- is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered and hence does not factor into determining the relative probabilities of different hypotheses.
In cases where , the logical negation of, is a valid likelihood, Bayes' rule can be rewritten as follows:
because
and
This focuses attention on the term
If that term is approximately 1, then the probability of the hypothesis given the evidence,, is about, about 50% likely - equally likely or not likely. If that term is very small, close to zero, then the probability of the hypothesis, given the evidence, is close to 1 or the conditional hypothesis is quite likely. If that term is very large, much larger than 1, then the hypothesis, given the evidence, is quite unlikely. If the hypothesis is unlikely, then is small and is much larger than 1 and this term can be approximated as and relevant probabilities can be compared directly to each other.
One quick and easy way to remember the equation would be to use rule of multiplication:
Alternatives to Bayesian updating
Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote: "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."
Indeed, there are non-Bayesian updating rules that also avoid Dutch books following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability. The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.
Inference over exclusive and exhaustive possibilities
If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.General formulation
Suppose a process is generating independent and identically distributed events, but the probability distribution is unknown. Let the event space represent the current state of belief for this process. Each model is represented by event. The conditional probabilities are specified to define the models. is the degree of belief in. Before the first inference step, is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.Suppose that the process is observed to generate. For each, the prior is updated to the posterior. From Bayes' theorem:
Upon observation of further evidence, this procedure may be repeated.
Multiple observations
For a sequence of independent and identically distributed observations, it can be shown by induction that repeated application of the above is equivalent towhere
Parametric formulation: motivating the formal description
By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is, however, equally applicable to discrete distributions.Let the vector span the parameter space. Let the initial prior distribution over be, where is a set of parameters to the prior itself, or hyperparameters. Let be a sequence of independent and identically distributed event observations, where all are distributed as for some. Bayes' theorem is applied to find the posterior distribution over :
where
Formal description of Bayesian inference
Definitions
- , a data point in general. This may in fact be a vector of values.
- , the parameter of the data point's distribution, i.e., This may be a vector of parameters.
- , the hyperparameter of the parameter distribution, i.e., This may be a vector of hyperparameters.
- is the sample, a set of observed data points, i.e.,.
- , a new data point whose distribution is to be predicted.
Bayesian inference
- The prior distribution is the distribution of the parameter before any data is observed, i.e. . The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations.
- The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. This is also termed the likelihood, especially when viewed as a function of the parameter, sometimes written.
- The marginal likelihood is the distribution of the observed data marginalized over the parameter, i.e. It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise. If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
- The posterior distribution is the distribution of the parameter after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference: This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
- In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution is not obtained in a closed form distribution, mainly because the parameter space for can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations and parameter. In such situations, we need to resort to approximation techniques.
- General case: Let be the conditional distribution of given and let be the distribution of. The joint distribution is then. The conditional distribution of given is then determined by
Bayesian prediction
- The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:
- The prior predictive distribution is the distribution of a new data point, marginalized over the prior:
In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the facts that the average of normally distributed random variables is also normally distributed, and the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.
Both types of predictive distributions have the form of a compound probability distribution. In fact, if the prior distribution is a conjugate prior, such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters, while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.