E-values
In statistical hypothesis testing, e-values quantify the evidence in the data against a null hypothesis. They serve as a more robust alternative to p-values, addressing some shortcomings of the latter.
In contrast to p-values, e-values can deal with optional continuation: e-values of subsequent experiments may simply be multiplied to provide a new, "product" e-value that represents the evidence in the joint experiment. This works even if, as often happens in practice, the decision to perform later experiments may depend in vague, unknown ways on the data observed in earlier experiments, and it is not known beforehand how many trials will be conducted: the product e-value remains a meaningful quantity, leading to tests with Type-I error control. For this reason, e-values and their sequential extension, the e-process, are the fundamental building blocks for anytime-valid statistical methods. Another advantage over p-values is that any weighted average of e-values remains an e-value, even if the individual e-values are arbitrarily dependent. This is one of the reasons why e-values have also turned out to be useful tools in multiple testing.
E-values can be interpreted in a number of different ways: first, an e-value can be interpreted as rescaling of a test that is presented on a more appropriate scale that facilitates merging them.
Second, the reciprocal of an e-value is a p-value, but not just any p-value: a special p-value for which a rejection `at level p' retains a generalized Type-I error guarantee. Third, they are broad generalizations of likelihood ratios and are also related to, yet distinct from, Bayes factors. Fourth, they have an interpretation as bets. Fifth, in a sequential context, they can also be interpreted as increments of nonnegative supermartingales. Interest in e-values has exploded since 2019, when the term 'e-value' was coined and a number of breakthrough results were achieved by several research groups. The first overview article appeared in 2023.
Definition and mathematical background
Let the null hypothesis be given as a set of distributions for data. Usually with each a single outcome and a fixed sample size or some stopping time. We shall refer to such , which represent the full sequence of outcomes of a statistical experiment, as a sample or batch of outcomes. But in some cases may also be an unordered bag of outcomes or a single outcome.An e-variable or e-statistic is a nonnegative random variable such that under all, its expected value is bounded by 1:
The value taken by e-variable is called the e-value. In practice, the term e-value is often used when one is really referring to the underlying e-variable.
Interpretations
As the continuous interpretation of a test
A test for a null hypothesis is traditionally modeled as a function from the data to. A test is said to be valid for level ifThis is classically conveniently summarized as a function from the data to that satisfies
Moreover, this is sometimes generalized to permit external randomization by letting the test take value in. Here, its value is interpreted as a probability with which one should subsequently reject the hypothesis.
An issue with modelling a test in this manner, is that the traditional decision space or does not encode the level at which the test rejects. This is odd at best, because a rejection at level 1% is a much stronger claim than a rejection at level 10%. A more suitable decision space seems to be.
The e-value can be interpreted as resolving this problem. Indeed, we can rescale from to and to by rescaling the test by its level:
where we denote a test on this evidence scale by to avoid confusion. Such a test is then valid if
That is: it is valid if it is an e-value.
In fact, this reveals that e-values bounded to are rescaled randomized tests, that are continuously interpreted as evidence against the hypothesis. The standard e-value that takes value in appears as a generalization of a level 0 test.
This interpretation shows that e-values are indeed fundamental to testing: they are equivalent to tests, thinly veiled by a rescaling. From this perspective, it may be surprising that typical e-values look very different from traditional tests: maximizing the objective
for an alternative hypothesis would yield traditional Neyman-Pearson style tests. Indeed, this maximizes the probability under that.
But if we continuously interpret the value of the test as evidence against the hypothesis, then we may also be interested in maximizing different targets such as
This yields tests that are remarkably different from traditional Neyman-Pearson tests, and more suitable when merged through multiplication as they are positive with probability 1 under. From this angle, the main innovation of the e-value compared to traditional testing is to maximize a different power target.
As p-values with a stronger data-dependent-level Type-I error guarantee
For any e-variable and any and all, it holds thatThis means is a valid p-value. Moreover, the e-value based test with significance level, which rejects if, has a Type-I error bounded by. But, whereas with standard p-values the inequality above is usually an equality or near-equality, this is not the case with e-variables. This makes e-value-based tests more conservative than those based on standard p-values.
In exchange for this conservativeness, the p-value comes with a stronger guarantee. In particular, for every possibly data-dependent significance level, we have
if and only if. This means that a p-value satisfies this guarantee if and only if it is the reciprocal of an e-variable.
The interpretation of this guarantee is that, on average, the relative Type-I error distortion caused by using a data-dependent level is controlled for every choice of the data-dependent significance level. Traditional p-values only satisfy this guarantee for data-independent or pre-specified levels.
This stronger guarantee is also called the post-hoc Type-I error, as it allows one to choose the significance level after observing the data: post-hoc. A p-value that satisfies this guarantee is also called a post-hoc p-value. As is a post-hoc p-value if and only if for some e-value, it is possible to view this as an alternative definition of an e-value.
Under this post-hoc Type-I error, the problem of choosing the significance level vanishes: we can simply choose the smallest data-dependent level at which we reject the hypothesis by setting it equal to the post-hoc p-value:. Indeed, at this data-dependent level we have
since is an e-variable.
As a consequence, we can truly reject at level and still retain the post-hoc Type-I error guarantee. For a traditional p-value, rejecting at level p comes with no such guarantee.
Moreover, a post-hoc p-value inherits optional continuation and merging properties of e-values. But instead of an arithmetic weighted average, a weighted harmonic average of post-hoc p-values is still a post-hoc p-value.
As generalizations of likelihood ratios
Let be a simple null hypothesis. Let be any other distribution on, and letbe their likelihood ratio. Then is an e-variable. Conversely, any e-variable relative to a simple null can be written as a likelihood ratio with respect to some distribution. Thus, when the null is simple, e-variables coincide with likelihood ratios. E-variables exist for general composite nulls as well though, and they may then be thought of as generalizations of likelihood ratios. The two main ways of constructing e-variables, UI and RIPr both lead to expressions that are variations of likelihood ratios as well.
Two other standard generalizations of the likelihood ratio are the generalized likelihood ratio as used in the standard, classical likelihood ratio test and the Bayes factor. Importantly, neither nor are e-variables in general: generalized likelihood ratios in sense are not e-variables unless the alternative is simple. Bayes factors are e-variables if the null is simple. To see this, note that, if represents a statistical model, and a prior density on, then we can set as above to be the Bayes marginal distribution with density
and then is also a Bayes factor of vs.. If the null is composite, then some special e-variables can be written as Bayes factors with some very special priors, but most Bayes factors one encounters in practice are not e-variables and many e-variables one encounters in practice are not Bayes factors.
As bets
Suppose you can buy a ticket for 1 monetary unit, with nonnegative pay-off. The statements " is an e-variable" and "if the null hypothesis is true, you do not expect to gain any money if you engage in this bet" are logically equivalent. This is because being an e-variable means that the expected gain of buying the ticket is the pay-off minus the cost, i.e. , which has expectation. Based on this interpretation, the product e-value for a sequence of tests can be interpreted as the amount of money you have gained by sequentially betting with pay-offs given by the individual e-variables and always re-investing all your gains.The betting interpretation becomes particularly visible if we rewrite an e-variable as where has expectation under all and is chosen so that a.s. Any e-variable can be written in the form although with parametric nulls, writing it as a likelihood ratio is usually mathematically more convenient. The form on the other hand is often more convenient in nonparametric settings. As a prototypical example, consider the case that with the taking values in the bounded interval. According to, the are i.i.d. according to a distribution with mean ; no other assumptions about are made. Then we may first construct a family of e-variables for single outcomes,, for any . We may then define a new e-variable for the complete data vector by taking the product
where is an estimate for, based only on past data, and designed to make as large as possible in the "e-power" or "GRO" sense. Waudby-Smith and Ramdas use this approach to construct "nonparametric" confidence intervals for the mean that tend to be significantly narrower than those based on more classical methods such as Chernoff, Hoeffding and Bernstein bounds.