Scoring rule
In decision theory, both a scoring rule as well as a scoring function
provide an ex post summary measure for the evaluation of the quality of a prediction or forecast. They assign a numeric score to a single prediction given the actual outcome. Depending on the sign convention, this score can be interpreted as a loss or a reward for the forecaster.
Scoring rules assess probabilistic predictions or forecasts, i.e. predictions of the whole probability distribution of the outcome. On the other hand, scoring functions assess point predictions, i.e. predictions of a property or functional of the probability distribution of the outcome. Examples of such a property are the expectation and the median.
File:Calibration plot.png|thumb|A calibration curve allows to judge how well model predictions are calibrated, by comparing the predicted quantiles to the observed quantiles. Blue is the best calibrated model, see calibration.Scoring rules answer the question "how good is a predicted probability distribution given the observation of the actual outcome?" Scoring rules that are proper are proven to have the lowest expected score if the predicted distribution equals the underlying distribution of the target variable. Although this might differ for individual observations, this should result in a minimization of the expected score if the "correct" distributions are predicted.
In the same way, scoring functions answer the question "how good is a point prediction given the observation of the actual outcome?". Scoring functions that are consistent are proven to have the lowest expected score if the point prediction equals the true functional of the underlying distribution of the target variable.
Scoring rules and scoring functions are often used as "cost functions" or "loss functions" of forecasting models. If a sample of forecasts and observations of the outcome is collected, they can be evaluated as the empirical mean of the given sample, often also called the "score". Scores of predictions of different models or forecasters can then be compared to conclude which model or forecaster is best.
For example, consider a probabilistic model that predicts a gaussian distribution with mean and standard deviation. A common interpretation of probabilistic models is that they aim to quantify their own predictive uncertainty. In this example, an observed target variable is then held compared to the predicted distribution and assigned a score. When a probabilistic model is trained on a scoring rule, it should "teach" the model to predict when its uncertainty is low, and when its uncertainty is high, and it should result in calibrated predictions, while minimizing the predictive uncertainty.
Although the example given concerns the probabilistic forecasting of a real valued target variable, a variety of different scoring rules have been designed with different target variables in mind. Scoring rules exist for binary and categorical probabilistic classification, as well as for univariate and multivariate probabilistic regression.
Definitions
Consider a sample space or observation domain,, which comprises the potential outcomes of a future observation; a σ-algebra of subsets of and a convex class of probability measures on. A function defined on and taking values in the extended real line,, is -quasi-integrable if it is measurable withrespect to and is quasi-integrable with respect to all.
A functional is a potentially set-valued mapping from the class of probability distributions to a Euclidean space, i.e. with.
Probabilistic forecast
A probabilistic forecast is any probability measure, i.e. a distribution of potential future observations.Point forecast
A point forecast for the functional is any value.Scoring rule
A scoring rule is any extended real-valued function such that is -quasi-integrable for all. represents the loss or penalty when the forecast is issued and the observation materializes.Scoring function
A scoring function is any real-valued function where represents the loss or penalty when the point forecast is issued and the observation materializes.Orientation / Sign convention
Scoring rules and scoring functions are negatively oriented if smaller values mean better. Changing the convention can be accomplished by multiplying the score by. Here we adhere to the negative orientation, hence the association with "loss".Expected score
We write for the expected score of a probabilistic prediction with respect to the underlying distribution :Similar, the expected score of a point prediction with resprect to the underlying distribution :
Sample average score
A way to estimate the expected score is by means of the sample average score. Given a sample of prediction-observation pairs e.g. for probabilistic predictions and observations,, for point predictions, the average score is calculated as- for scoring rules:
- for scoring functions:
By invoking some law of large numbers argument, the sample average scores are consistent estimators of the expectation.
Properties
Propriety and consistency
Strictly proper scoring rules and strictly consistent scoring functions encourage honest forecasts by maximization of the expected reward: If a forecaster is given a reward of if realizes, then the highest expected reward is obtained by reporting the true probability distribution.Proper scoring rules
A scoring rule is proper relative to if its expected score is minimized when the forecasted distribution matches the distribution of the observation.It is strictly proper if the above equation holds with equality if and only if.
Consistent scoring functions
A scoring function is consistent for the functional relative to the class ifIt is strictly consistent if it is consistent and equality in the above equation implies that.
Affine transformation
After an affine transformation a strictly proper scoring rule remains strictly proper, a strictly consistent scoring function remains strictly consistent. That is, if is a strictly proper scoring rule then with is also a strictly proper scoring rule, though if then the optimization sense of the scoring rule switches between maximization and minimization. For scoring functions the same statement applies with the obvious changes.Locality
A proper scoring rule is said to be local if its estimate for the probability of a specific event depends only on the probability of that event. This statement is vague in most descriptions but we can, in most cases, think of this as the optimal solution of the scoring problem "at a specific event" is invariant to all changes in the observation distribution that leave the probability of that event unchanged. All binary scores are local because the probability assigned to the event that did not occur is determined so there is no degree of flexibility to vary over.Affine functions of the logarithmic scoring rule are the only strictly proper local scoring rules on a finite set that is not binary.
Decomposition
The expectation value of a proper scoring rule can be decomposed into the sum of three components, called uncertainty, reliability, and resolution, which characterize different attributes of probabilistic forecasts:If a score is proper and negatively oriented, all three terms are positive definite.
The uncertainty component is equal to the expected score of the forecast which constantly predicts the average event frequency.
The reliability component penalizes poorly calibrated forecasts, in which the predicted probabilities do not coincide with the event frequencies.
The equations for the individual components depend on the particular scoring rule.
For the Brier Score, they are given by
where is the average probability of occurrence of the binary event, and is the conditional event probability, given, i.e.
Examples of proper scoring rules
There are an infinite number of scoring rules, including entire parameterized families of strictly proper scoring rules. The ones shown below are simply popular examples.Categorical variables
For a categorical response variable with mutually exclusive events,, a probabilistic forecaster or algorithm will return a probability vector with probabilities for each of the outcomes.If materializes, one often abbreviates the score as.
Logarithmic score
The logarithmic scoring rule is a strictly proper and local scoring rule. This is also the negative of Shannon entropy, which is commonly used as a scoring criterion in Bayesian inference. This scoring rule has strong foundations in information theory.Here, the score is calculated as the logarithm of the probability estimate for the actual outcome. That is, a prediction of 80% that correctly proved true would receive a score of. This same prediction also assigns 20% likelihood to the opposite case, and so if the prediction proves false, it would receive a score based on the 20%:. The goal of a forecaster is to maximize the score and for the score to be as large as possible, and −0.22 is indeed larger than −1.6.
If one treats the truth or falsity of the prediction as a variable with value 1 or 0 respectively, and the expressed probability as, then one can write the logarithmic scoring rule as. Note that any logarithmic base may be used, since strictly proper scoring rules remain strictly proper under linear transformation. That is:
is strictly proper for all.
Brier/Quadratic score
The quadratic scoring rule is a strictly proper scoring rulewhere is the probability assigned to the correct answer.
The Brier score, originally proposed by Glenn W. Brier in 1950, can be obtained by an affine transform from the quadratic scoring rule.
Where when the th event is correct and otherwise. It can be thought of as a generalization of mean squared error to probabilistic forecasts.
An important difference between these two rules is that a forecaster should strive to maximize the quadratic score yet minimize the Brier score. This is due to a negative sign in the linear transformation between them.