Akaike information criterion


The Akaike information criterion is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC is founded on information theory. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model.
In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.
The Akaike information criterion is named after the Japanese statistician Hirotugu Akaike, who formulated it. It now forms the basis of a paradigm for the foundations of statistics and is also widely used for statistical inference.

Definition

Suppose that we have a statistical model of some data. Let be the number of estimated parameters in the model. Let be the maximized value of the likelihood function for the model. Then the AIC value of the model is the following.
Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit, but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.
Suppose that the data is generated by some unknown process f. We consider two candidate models to represent f: g1 and g2. If we knew f, then we could find the information lost from using g1 to represent f by calculating the Kullback–Leibler divergence, ; similarly, the information lost from using g2 to represent f could be found by calculating. We would then, generally, choose the candidate model that minimized the information loss.
We cannot choose with certainty, because we do not know f. showed, however, that we can estimate, via AIC, how much more information is lost by g1 than by g2. The estimate, though, is only valid asymptotically; if the number of data points is small, then some correction is often necessary.
Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. Such validation commonly includes checks of the model's residuals and tests of the model's predictions. For more on this topic, see statistical model validation.

How to use AIC in practice

To apply AIC in practice, we start with a set of candidate models, and then find the models' corresponding AIC values. There will almost always be information lost due to using a candidate model to represent the "true model," i.e. the process that generated the data. We wish to select, from among the candidate models, the model that minimizes the information loss. We cannot choose with certainty, but we can minimize the estimated information loss.
Suppose that there are R candidate models. Denote the AIC values of those models by AIC1, AIC2, AIC3,..., AICR. Let AICmin be the minimum of those values. Then the quantity exp can be interpreted as being proportional to the probability that the ith model minimizes the information loss.
As an example, suppose that there are three candidate models, whose AIC values are 100, 102, and 110. Then the second model is times as probable as the first model to minimize the information loss. Similarly, the third model is times as probable as the first model to minimize the information loss.
In this example, we would omit the third model from further consideration. We then have three options: gather more data, in the hope that this will allow clearly distinguishing between the first two models; simply conclude that the data is insufficient to support selecting one model from among the first two; take a weighted average of the first two models, with weights proportional to 1 and 0.368, respectively, and then do statistical inference based on the weighted multimodel.
The quantity is known as the relative likelihood of model i. It is closely related to the likelihood ratio used in the likelihood-ratio test. Indeed, if all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the likelihood-ratio test. There are, however, important distinctions. In particular, the likelihood-ratio test is valid only for nested models, whereas AIC has no such restriction.

Hypothesis testing

Every statistical hypothesis test can be formulated as a comparison of statistical models. Hence, every statistical hypothesis test can be replicated via AIC. Two examples are briefly described in the subsections below. Details for those examples, and many more examples, are given by and.

Replicating Student's ''t''-test

As an example of a hypothesis test, consider the t-test to compare the means of two normally-distributed populations. The input to the t-test comprises a random sample from each of the two populations.
To formulate the test as a comparison of models, we construct two different models. The first model models the two populations as having potentially different means and standard deviations. The likelihood function for the first model is thus the product of the likelihoods for two distinct normal distributions; so it has four parameters:. To be explicit, the likelihood function is as follows.
The second model models the two populations as having the same means and the same standard deviations. The likelihood function for the second model thus sets and in the above equation; so it only has two parameters.
We then maximize the likelihood functions for the two models ; after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if the second model was only 0.01 times as likely as the first model, then we would omit the second model from further consideration: so we would conclude that the two populations have different means.
The t-test assumes that the two populations have identical standard deviations; the test tends to be unreliable if the assumption is false and the sizes of the two samples are very different. Comparing the means of the populations via AIC, as in the example above, has the same disadvantage. However, one could create a third model that allows different standard deviations. This third model would have the advantage of not making such assumptions at the cost of an additional parameter and thus degree of freedom.

Comparing categorical data sets

For another example of a hypothesis test, suppose that we have two populations, and each member of each population is in one of two categories—category #1 or category #2. Each population is binomially distributed. We want to know whether the distributions of the two populations are the same. We are given a random sample from each of the two populations.
Let be the size of the sample from the first population. Let be the number of observations in category #1; so the number of observations in category #2 is. Similarly, let be the size of the sample from the second population. Let be the number of observations in category #1.
Let be the probability that a randomly-chosen member of the first population is in category #1. Hence, the probability that a randomly-chosen member of the first population is in category #2 is. Note that the distribution of the first population has one parameter. Let be the probability that a randomly-chosen member of the second population is in category #1. Note that the distribution of the second population also has one parameter.
To compare the distributions of the two populations, we construct two different models. The first model models the two populations as having potentially different distributions. The likelihood function for the first model is thus the product of the likelihoods for two distinct binomial distributions; so it has two parameters:,. To be explicit, the likelihood function is as follows.
The second model models the two populations as having the same distribution. The likelihood function for the second model thus sets in the above equation; so the second model has one parameter.
We then maximize the likelihood functions for the two models ; after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if the second model was only 0.01 times as likely as the first model, then we would omit the second model from further consideration: so we would conclude that the two populations have different distributions.

Foundations of statistics

is generally regarded as comprising hypothesis testing and estimation. Hypothesis testing can be done via AIC, as discussed above. Regarding estimation, there are two types: point estimation and interval estimation. Point estimation can be done within the AIC paradigm: it is provided by maximum likelihood estimation. Interval estimation can also be done within the AIC paradigm: it is provided by likelihood intervals. Hence, statistical inference generally can be done within the AIC paradigm.
The most commonly used paradigms for statistical inference are frequentist inference and Bayesian inference. AIC, though, can be used to do statistical inference without relying on either the frequentist paradigm or the Bayesian paradigm: because AIC can be interpreted without the aid of significance levels or Bayesian priors. In other words, AIC can be used to form a foundation of statistics that is distinct from both frequentism and Bayesianism.