Mixture model


In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.
Mixture models should not be confused with models for compositional data, i.e., data whose components are constrained to sum to a constant value. However, compositional models can be thought of as mixture models, where members of the population are sampled at random. Conversely, mixture models can be thought of as compositional models, where the total size reading population has been normalized to 1.

Structure

General mixture model

A typical finite-dimensional mixture model is a hierarchical model consisting of the following components:
  • N random variables that are observed, each distributed according to a mixture of K components, with the components belonging to the same parametric family of distributions but with different parameters. However, it is also possible to have a finite mixture model where each component belongs to a different parametric family of distributions, for example, a mixture of a multivariate normal distribution and a generalized hyperbolic distribution.
  • N random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution
  • A set of K mixture weights, which are probabilities that sum to 1.
  • A set of K parameters, each specifying the parameter of the corresponding mixture component. In many cases, each "parameter" is actually a set of parameters. For example, if the mixture components are Gaussian distributions, there will be a mean and variance for each component. If the mixture components are categorical distributions, there will be a vector of V probabilities summing to 1.
In addition, in a Bayesian setting, the mixture weights and parameters will themselves be random variables, and prior distributions will be placed over the variables. In such a case, the weights are typically viewed as a K-dimensional random vector drawn from a Dirichlet distribution, and the parameters will be distributed according to their respective conjugate priors.
Mathematically, a basic parametric mixture model can be described as follows:
In a Bayesian setting, all parameters are associated with random variables, as follows:
This characterization uses F and H to describe arbitrary distributions over observations and parameters, respectively. Typically H will be the conjugate prior of F. The two most common choices of F are Gaussian aka "normal" and categorical. Other common possibilities for the distribution of the mixture components are:
  • Binomial distribution, for the number of "positive occurrences" given a fixed number of total occurrences
  • Multinomial distribution, similar to the binomial distribution, but for counts of multi-way occurrences
  • Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
  • Poisson distribution, for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence
  • Exponential distribution, for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence
  • Log-normal distribution, for positive real numbers that are assumed to grow exponentially, such as incomes or prices
  • Multivariate normal distribution, for vectors of correlated outcomes that are individually Gaussian-distributed
  • Multivariate Student's t-distribution, for vectors of heavy-tailed correlated outcomes
  • A vector of Bernoulli-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below

    Specific examples

Gaussian mixture model

A typical non-Bayesian Gaussian mixture model looks like this:
File:bayesian-gaussian-mixture.svg|right|300px|thumb|Bayesian Gaussian mixture model using plate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication means a vector of size K.
A Bayesian version of a Gaussian mixture model is as follows:

Multivariate Gaussian mixture model

A Bayesian Gaussian mixture model is commonly extended to fit a vector of unknown parameters, or multivariate normal distributions. In a multivariate distribution one may model a vector of parameters using a Gaussian mixture model prior distribution on the vector of estimates given by
where the ith vector component is characterized by normal distributions with weights, means and covariance matrices. To incorporate this prior into a Bayesian estimation, the prior is multiplied with the known distribution of the data conditioned on the parameters to be estimated. With this formulation, the posterior distribution is also a Gaussian mixture model of the form
with new parameters and that are updated using the EM algorithm.
Although EM-based parameter updates are well-established, providing the initial estimates for these parameters is currently an area of active research. Note that this formulation yields a closed-form solution to the complete posterior distribution. Estimations of the random variable may be obtained via one of several estimators, such as the mean or maximum of the posterior distribution.
Such distributions are useful for assuming patch-wise shapes of images and clusters, for example. In the case of image representation, each Gaussian may be tilted, expanded, and warped according to the covariance matrices. One Gaussian distribution of the set is fit to each patch in the image. Notably, any distribution of points around a cluster may be accurately given enough Gaussian components, but scarcely over K=20 components are needed to accurately model a given image distribution or cluster of data.

Categorical mixture model

A typical non-Bayesian mixture model with categorical observations looks like this:
  • as above
  • as above
  • as above
  • dimension of categorical observations, e.g., size of word vocabulary
  • probability for component of observing item
  • vector of dimension composed of must sum to 1
The random variables:
File:bayesian-categorical-mixture.svg|right|300px|thumb|Bayesian categorical mixture model using plate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication means a vector of size K; likewise for .
A typical Bayesian mixture model with categorical observations looks like this:
  • as above
  • as above
  • as above
  • dimension of categorical observations, e.g., size of word vocabulary
  • probability for component of observing item
  • vector of dimension composed of must sum to 1
  • shared concentration hyperparameter of for each component
  • concentration hyperparameter of
The random variables:

Examples

A financial model

Financial returns often behave differently in normal situations and during crisis times. A mixture model for return data seems reasonable. Sometimes the model used is a jump-diffusion model, or as a mixture of two normal distributions. See and for further context.

House prices

Assume that we observe the prices of N different houses. Different types of houses in different neighborhoods will have vastly different prices, but the price of a particular type of house in a particular neighborhood will tend to cluster fairly closely around the mean. One possible model of such prices would be to assume that the prices are accurately described by a mixture model with K different components, each distributed as a normal distribution with unknown mean and variance, with each component specifying a particular combination of house type/neighborhood. Fitting this model to observed prices, e.g., using the expectation-maximization algorithm, would tend to cluster the prices according to house type/neighborhood and reveal the spread of prices in each type/neighborhood.

Topics in a document

Assume that a document is composed of N different words from a total vocabulary of size V, where each word corresponds to one of K possible topics. The distribution of such words could be modelled as a mixture of K different V-dimensional categorical distributions. A model of this sort is commonly termed a topic model. Note that expectation maximization applied to such a model will typically fail to produce realistic results, due to the excessive number of parameters. Some sorts of additional assumptions are typically necessary to get good results. Typically two sorts of additional components are added to the model:
  1. A prior distribution is placed over the parameters describing the topic distributions, using a Dirichlet distribution with a concentration parameter that is set significantly below 1, so as to encourage sparse distributions.
  2. Some sort of additional constraint is placed over the topic identities of words, to take advantage of natural clustering.
  3. *For example, a Markov chain could be placed on the topic identities, corresponding to the fact that nearby words belong to similar topics.
  4. *Another possibility is the latent Dirichlet allocation model, which divides up the words into D different documents and assumes that in each document only a small number of topics occur with any frequency.