Minimum mean square error estimator
In statistics and signal processing, a minimum mean square error estimator is an estimation method which minimizes the mean square error, which is a common measure of estimator quality, of the fitted values of a dependent variable. In the Bayesian setting, MMSE more specifically refers to estimation with quadratic loss function. In such case, the MMSE estimator is given by the posterior mean of the parameter to be estimated. Since the posterior mean is cumbersome to calculate, the form of the MMSE estimator is usually constrained to be within a certain class of functions. Linear MMSE estimators are a popular choice since they are easy to use, easy to calculate, and very versatile. It has given rise to many popular estimators such as the Wiener–Kolmogorov filter and Kalman filter.
Motivation
The term MMSE more specifically refers to estimation in a Bayesian setting with quadratic cost function. The basic idea behind the Bayesian approach to estimation stems from practical situations where we often have some prior information about the parameter to be estimated. For instance, we may have prior information about the range that the parameter can assume; or we may have an old estimate of the parameter that we want to modify when a new observation is made available; or the statistics of an actual random signal such as speech. This is in contrast to the non-Bayesian approach like minimum-variance unbiased estimator where absolutely nothing is assumed to be known about the parameter in advance and which does not account for such situations. In the Bayesian approach, such prior information is captured by the prior probability density function of the parameters; and based directly on Bayes' theorem, it allows us to make better posterior estimates as more observations become available. Thus unlike non-Bayesian approach where parameters of interest are assumed to be deterministic, but unknown constants, the Bayesian estimator seeks to estimate a parameter that is itself a random variable. Furthermore, Bayesian estimation can also deal with situations where the sequence of observations are not necessarily independent. Thus Bayesian estimation provides yet another alternative to the MVUE. This is useful when the MVUE does not exist or cannot be found.Definition
Let be a hidden random vector variable, and let be a known random vector variable, both of them not necessarily of the same dimension. An estimator of is any function of the measurement. The estimation error vector is given by and its mean squared error is given by the trace of error covariance matrixwhere the expectation is taken over conditioned on. When is a scalar variable, the MSE expression simplifies to. Note that MSE can equivalently be defined in other ways, since
The MMSE estimator is then defined as the estimator achieving minimal MSE:
Properties
- When the means and variances are finite, the MMSE estimator is uniquely defined and is given by:
- The MMSE estimator is unbiased :
- The MMSE estimator is asymptotically unbiased and it converges in distribution to the normal distribution:
- The orthogonality principle: When is a scalar, an estimator constrained to be of certain form is an optimal estimator, i.e. if and only if
- If and are jointly Gaussian, then the MMSE estimator is linear, i.e., it has the form for matrix and constant. This can be directly shown using the Bayes' theorem. As a consequence, to find the MMSE estimator, it is sufficient to find the linear MMSE estimator.
Linear MMSE estimator
One possibility is to abandon the full optimality requirements and seek a technique minimizing the MSE within a particular class of estimators, such as the class of linear estimators. Thus, we postulate that the conditional expectation of given is a simple linear function of,, where the measurement is a random vector, is a matrix and is a vector. This can be seen as the first order Taylor approximation of. The linear MMSE estimator is the estimator achieving minimum MSE among all estimators of such form. That is, it solves the following optimization problem:
One advantage of such linear MMSE estimator is that it is not necessary to explicitly calculate the posterior probability density function of. Such linear estimator only depends on the first two moments of and. So although it may be convenient to assume that and are jointly Gaussian, it is not necessary to make this assumption, so long as the assumed distribution has well defined first and second moments. The form of the linear estimator does not depend on the type of the assumed underlying distribution.
The expression for optimal and is given by:
where, the is cross-covariance matrix between and, the is auto-covariance matrix of.
Thus, the expression for linear MMSE estimator, its mean, and its auto-covariance is given by
where the is cross-covariance matrix between and.
Lastly, the error covariance and minimum mean square error achievable by such estimator is
Let us have the optimal linear MMSE estimator given as, where we are required to find the expression for and. It is required that the MMSE estimator be unbiased. This means,
Plugging the expression for in above, we get
where and. Thus we can re-write the estimator as
and the expression for estimation error becomes
From the orthogonality principle, we can have, where we take. Here the left-hand-side term is
When equated to zero, we obtain the desired expression for as
The is cross-covariance matrix between X and Y, and is auto-covariance matrix of Y. Since, the expression can also be re-written in terms of as
Thus the full expression for the linear MMSE estimator is
Since the estimate is itself a random variable with, we can also obtain its auto-covariance as
Putting the expression for and, we get
Lastly, the covariance of linear MMSE estimation error will then be given by
The first term in the third line is zero due to the orthogonality principle. Since, we can re-write in terms of covariance matrices as
This we can recognize to be the same as Thus the minimum mean square error achievable by such a linear estimator is
Univariate case
For the special case when both and are scalars, the above relations simplify towhere is the Pearson's correlation coefficient between and.
The above two equations allows us to interpret the correlation coefficient either as normalized slope of linear regression
or as square root of the ratio of two variances
When, we have and. In this case, no new information is gleaned from the measurement which can decrease the uncertainty in. On the other hand, when, we have and. Here is completely determined by, as given by the equation of straight line.
Computation
Standard method like Gauss elimination can be used to solve the matrix equation for. A more numerically stable method is provided by QR decomposition method. Since the matrix is a symmetric positive definite matrix, can be solved twice as fast with the Cholesky decomposition, while for large sparse systems conjugate gradient method is more effective. Levinson recursion is a fast method when is also a Toeplitz matrix. This can happen when is a wide sense stationary process. In such stationary cases, these estimators are also referred to as Wiener–Kolmogorov filters.Linear MMSE estimator for linear observation process
Let us further model the underlying process of observation as a linear process:, where is a known matrix and is random noise vector with the mean and cross-covariance. Here the required mean and the covariance matrices will beThus the expression for the linear MMSE estimator matrix further modifies to
Putting everything into the expression for, we get
Lastly, the error covariance is
The significant difference between the estimation problem treated above and those of least squares and Gauss–Markov estimate is that the number of observations m, need not be at least as large as the number of unknowns, n,. The estimate for the linear observation process exists so long as the m-by-m matrix exists; this is the case for any m if, for instance, is positive definite. Physically the reason for this property is that since is now a random variable, it is possible to form a meaningful estimate even with no measurements. Every new measurement simply provides additional information which may modify our original estimate. Another feature of this estimate is that for m < n, there need be no measurement error. Thus, we may have, because as long as is positive definite, the estimate still exists. Lastly, this technique can handle cases where the noise is correlated.
Alternative form
An alternative form of expression can be obtained by using the matrix identitywhich can be established by post-multiplying by and pre-multiplying by to obtain
and
Since can now be written in terms of as, we get a simplified expression for as
In this form the above expression can be easily compared with ridge regression, weighed least square and Gauss–Markov estimate. In particular, when, corresponding to infinite variance of the apriori information concerning, the result is identical to the weighed linear least square estimate with as the weight matrix. Moreover, if the components of are uncorrelated and have equal variance such that where is an identity matrix, then is identical to the ordinary least square estimate. When apriori information is available as and the are uncorrelated and have equal variance, we have, which is identical to ridge regression solution.