Power (statistics)


In frequentist statistics, power is the probability of detecting an effect given that some prespecified effect actually exists using a given test in a given context. In typical use, it is a function of the specific test that is used, the sample size, and the effect size.
More formally, in the case of a simple hypothesis test with two hypotheses, the power of the test is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true. It is commonly denoted by, where is the probability of making a type II error conditional on there being a true effect or association.

Background

ing uses data from samples to assess, or make inferences about, a statistical population. For example, we may measure the yields of samples of two varieties of a crop, and use a two sample test to assess whether the mean values of this yield differs between varieties.
Under a frequentist hypothesis testing framework, this is done by calculating a test statistic for the dataset, which has a known theoretical probability distribution if there is no difference. If the actual value calculated on the sample is sufficiently unlikely to arise under the null hypothesis, we say we identified a statistically significant effect.
The threshold for significance can be set small to ensure there is little chance of falsely detecting a non-existent effect. However, failing to identify a significant effect does not imply there was none. If we insist on being careful to avoid false positives, we may create false negatives instead. It may simply be too much to expect that we will be able to find satisfactorily strong evidence of a very subtle difference even if it exists. Statistical power is an attempt to quantify this issue.
In the case of the comparison of the two crop varieties, it enables us to answer questions like:
  • Is there a big danger of two very different varieties producing samples that just happen to look indistinguishable by pure chance?
  • How much effort do we need to put into this comparison to avoid that danger?
  • How different do these varieties need to be before we can expect to notice a difference?

    Description

Suppose we are conducting a hypothesis test. We define two hypotheses the null hypothesis, and the alternative hypothesis. If we design the test such that α is the significance level then the power of the test is 1 − β where β is the probability of failing to reject when the alternative is true.
Probability to rejectProbability to not reject
If is Trueα1 − α
If is True1 − β β

To make this more concrete, a typical statistical test would be based on a test statistic t calculated from the sampled data, which has a particular probability distribution under. A desired significance level α would then define a corresponding "rejection region", a set of values t is unlikely to take if was correct. If we reject in favor of only when the sample t takes those values, we would be able to keep the probability of falsely rejecting within our desired significance level. At the same time, if defines its own probability distribution for t, the power of the test would be the probability, under, that the sample t falls into our defined rejection region and causes to be correctly rejected.
Statistical power is one minus the type II error probability and is also the sensitivity of the hypothesis testing procedure to detect a true effect. There is usually a trade-off between demanding more stringent tests and trying to have a high probability of rejecting the null under the alternative hypothesis. Statistical power may also be extended to the case where multiple hypotheses are being tested based on an experiment or survey. It is thus also common to refer to the power of a study, evaluating a scientific project in terms of its ability to answer the research questions they are seeking to answer.

Applications

The main application of statistical power is "power analysis", a calculation of power usually done before an experiment is conducted using data from pilot studies or a literature review. Power analyses can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. For example: "How many times do I need to toss a coin to conclude it is rigged by a certain amount?" If resources and thus sample sizes are fixed, power analyses can also be used to calculate the minimum effect size that is likely to be detected.
Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis. An underpowered study is likely be inconclusive, failing to allow one to choose between hypotheses at the desired significance level, while an overpowered study will spend great expense on being able to report significant effects even if they are tiny and so practically meaningless. If a large number of underpowered studies are done and statistically significant results published, published findings are more likely false positives than true results, contributing to a replication crisis. However, excessive demands for power could be connected to wasted resources and ethical problems, for example the use of a large number of animal test subjects when a smaller number would have been sufficient. It could also induce researchers trying to seek funding to overstate their expected effect sizes, or avoid looking for more subtle interaction effects that cannot be easily detected.
Power analysis is primarily a frequentist statistics tool. In Bayesian statistics, hypothesis testing of the type used in classical power analysis is not done. In the Bayesian framework, one updates his or her prior beliefs using the data obtained in a given study. In principle, a study that would be deemed underpowered from the perspective of hypothesis testing could still be used in such an updating process. However, power remains a useful measure of how much a given experiment size can be expected to refine one's beliefs. A study with low power is unlikely to lead to a large change in beliefs.
In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric test and a nonparametric test of the same hypothesis. Tests may have the same size, and hence the same false positive rates, but different ability to detect true effects. Consideration of their theoretical power proprieties is a key reason for the common use of likelihood ratio tests.

Rule of thumb for t-test

Lehr's rule of thumb says that the sample size for the common case of a two-sided two-sample t-test with power 80% and significance level should be:
where is an estimate of the population variance and the to-be-detected difference in the mean values of both samples. This expression can be rearranged, implying for example that 80% power is obtained when looking for a difference in means that exceeds about 4 times the group-wise standard error of the mean.
For a one sample t-test 16 is to be replaced with 8. Other values provide an appropriate approximation when the desired power or significance level are different.
However, a full power analysis should always be performed to confirm and refine this estimate.

Factors influencing power

Statistical power may depend on a number of factors. Some factors may be particular to a specific testing situation, but in normal use, power depends on the following three aspects that can be potentially controlled by the practitioner:
For a given test, the significance criterion determines the desired degree of rigor, specifying how unlikely it is for the null hypothesis of no effect to be rejected if it is in fact true. The most commonly used threshold is a probability of rejection of 0.05, though smaller values like 0.01 or 0.001 are sometimes used. This threshold then implies that the observation must be at least that unlikely to be considered strong enough evidence against the null. Picking a smaller value to tighten the threshold, so as to reduce the chance of a false positive, would also reduce power. Some statistical tests will inherently produce better power, albeit often at the cost of requiring stronger assumptions.
The magnitude of the effect of interest defines what is being looked for by the test. It can be the expected effect size if it exists, as a scientific hypothesis that the researcher has arrived at and wishes to test. Alternatively, in a more practical context it could be determined by the size the effect must be to be useful, for example that which is required to be clinically significant. An effect size can be a direct value of the quantity of interest, or it can be a standardized measure that also accounts for the variability in the population. If the researcher is looking for a larger effect, then it should be easier to find with a given experimental or analytic setup, and so power is higher.
The nature of the sample underlies the information being used in the test. This will usually involve the sample size, and the sample variability, if that is not implicit in the definition of the effect size. More broadly, the precision with which the data are measured can also be an important factor, as well as the design of an experiment or observational study. Ultimately, these factors lead to an expected amount of sampling error. A smaller sampling error could be obtained by larger sample sizes from a less variability population, from more accurate measurements, or from more efficient experimental designs, and such smaller errors would lead to improved power, albeit usually at a cost in resources. How increased sample size translates to higher power is a measure of the efficiency of the test—for example, the sample size required for a given power.