Dirichlet process


In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.
As an example, a bag of 100 real-world dice is a random probability mass function —to sample this random pmf you put your hand in the bag and draw out a die, that is, you draw a pmf. A bag of dice manufactured using a crude process 100 years ago will likely have probabilities that deviate wildly from the uniform pmf, whereas a bag of state-of-the-art dice used by Las Vegas casinos may have barely perceptible imperfections. We can model the randomness of pmfs with the Dirichlet distribution.
The Dirichlet process is specified by a base distribution and a positive real number called the concentration parameter. The base distribution is the expected value of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The scaling parameter specifies how strong this discretization is: in the limit of, the realizations are all concentrated at a single value, while in the limit of the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as increases.
The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distribution. In the same way as the Dirichlet distribution is the conjugate prior for the categorical distribution, the Dirichlet process is the conjugate prior for infinite, nonparametric discrete distributions. A particularly important application of Dirichlet processes is as a prior probability distribution in infinite mixture models.
The Dirichlet process was formally introduced by Thomas S. Ferguson in 1973.
It has since been applied in data mining and machine learning, among others for natural language processing, computer vision and bioinformatics.

Introduction

Dirichlet processes are usually used when modelling data that tends to repeat previous values in a so-called "rich get richer" fashion. Specifically, suppose that the generation of values can be simulated by the following algorithm.

At the same time, another common model for data is that the observations are assumed to be independent and identically distributed according to some distribution. The goal of introducing Dirichlet processes is to be able to describe the procedure outlined above in this i.i.d. model.
The observations in the algorithm are not independent, since we have to consider the previous results when generating the next value. They are, however, exchangeable. This fact can be shown by calculating the joint probability distribution of the observations and noticing that the resulting formula only depends on which values occur among the observations and how many repetitions they each have. Because of this exchangeability, de Finetti's representation theorem applies and it implies that the observations are conditionally independent given a distribution. This is a random variable itself and has a distribution. This distribution is called a Dirichlet process. In summary, this means that we get an equivalent procedure to the above algorithm:
  1. Draw a distribution from
  2. Draw observations independently from.
In practice, however, drawing a concrete distribution is impossible, since its specification requires an infinite amount of information. This is a common phenomenon in the context of Bayesian non-parametric statistics where a typical task is to learn distributions on function spaces, which involve effectively infinitely many parameters. The key insight is that in many applications the infinite-dimensional distributions appear only as an intermediary computational device and are not required for either the initial specification of prior beliefs or for the statement of the final inference.

Formal definition

Given a measurable set S, a base probability distribution H and a positive real number, the Dirichlet process is a stochastic process whose sample path is a probability distribution over S, such that the following holds. For any measurable finite partition of S, denoted,
where denotes the Dirichlet distribution and the notation means that the random variable has the distribution.

Alternative views

There are several equivalent views of the Dirichlet process. Besides the formal definition above, the Dirichlet process can be defined implicitly through de Finetti's theorem as described in the first section; this is often called the Chinese restaurant process. A third alternative is the stick-breaking process, which defines the Dirichlet process constructively by writing a distribution sampled from the process as, where are samples from the base distribution, is an indicator function centered on and the are defined by a recursive scheme that repeatedly samples from the beta distribution.

The Chinese restaurant process

A widely employed metaphor for the Dirichlet process is based on the so-called Chinese restaurant process. The metaphor is as follows:
Imagine a Chinese restaurant in which customers enter. A new customer sits down at a table with a probability proportional to the number of customers already sitting there. Additionally, a customer opens a new table with a probability proportional to the scaling parameter. After infinitely many customers entered, one obtains a probability distribution over infinitely many tables to be chosen.
This probability distribution over the tables is a random sample of the probabilities of observations drawn from a Dirichlet process with scaling parameter.
If one associates draws from the base measure with every table, the resulting distribution over the sample space is a random sample of a Dirichlet process.
The Chinese restaurant process is related to the Pólya urn sampling scheme which yields samples from finite Dirichlet distributions.
Because customers sit at a table with a probability proportional to the number of customers already sitting at the table, two properties of the DP can be deduced:
  1. The Dirichlet process exhibits a self-reinforcing property: The more often a given value has been sampled in the past, the more likely it is to be sampled again.
  2. Even if is a distribution over an uncountable set, there is a nonzero probability that two samples will have exactly the same value because the probability mass will concentrate on a small number of tables.

    The stick-breaking process

A third approach to the Dirichlet process is the so-called stick-breaking process view. Conceptually, this involves repeatedly breaking off and discarding a random fraction of a "stick" that is initially of length 1. Remember that draws from a Dirichlet process are distributions over a set. As noted previously, the distribution drawn is discrete with probability 1. In the stick-breaking process view, we explicitly use the discreteness and give the probability mass function of this discrete distribution as:
where is the indicator function which evaluates to zero everywhere, except for. Since this distribution is random itself, its mass function is parameterized by two sets of random variables: the locations and the corresponding probabilities. In the following, we present without proof what these random variables are.
The locations are independent and identically distributed according to, the base distribution of the Dirichlet process. The probabilities are given by a procedure resembling the breaking of a unit-length stick :
where are independent random variables with the beta distribution. The resemblance to 'stick-breaking' can be seen by considering as the length of a piece of a stick. We start with a unit-length stick and in each step we break off a portion of the remaining stick according to and assign this broken-off piece to. The formula can be understood by noting that after the first k − 1 values have their portions assigned, the length of the remainder of the stick is and this piece is broken according to and gets assigned to.
The smaller is, the less of the stick will be left for subsequent values, yielding more concentrated distributions.
The stick-breaking process is similar to the construction where one samples sequentially from marginal beta distributions in order to generate a sample from a Dirichlet distribution.

The Pólya urn scheme

Yet another way to visualize the Dirichlet process and Chinese restaurant process is as a modified Pólya urn scheme sometimes called the Blackwell–MacQueen sampling scheme. Imagine that we start with an urn filled with black balls. Then we proceed as follows:
  1. Each time we need an observation, we draw a ball from the urn.
  2. If the ball is black, we generate a new colour uniformly, label a new ball this colour, drop the new ball into the urn along with the ball we drew, and return the colour we generated.
  3. Otherwise, label a new ball with the colour of the ball we drew, drop the new ball into the urn along with the ball we drew, and return the colour we observed.
The resulting distribution over colours is the same as the distribution over tables in the Chinese restaurant process. Furthermore, when we draw a black ball, if rather than generating a new colour, we instead pick a random value from a base distribution and use that value to label the new ball, the resulting distribution over labels will be the same as the distribution over the values in a Dirichlet process.