Replication crisis
The replication crisis, also known as the reproducibility or replicability crisis, refers to widespread failures to reproduce published scientific results. Because the reproducibility of empirical results is the cornerstone of the scientific method, such failures undermine the credibility of theories and challenge substantial parts of scientific knowledge.
Psychology and medicine have been focal points for replication efforts, with researchers systematically reexamining classic studies to verify their reliability and, when failures emerge, to identify the underlying causes. Data strongly indicates that other natural and social sciences are also affected.
The phrase "replication crisis" was coined in the early 2010s as part of a growing awareness of the problem. Considerations of causes and remedies have given rise to a new scientific discipline known as metascience, which uses methods of empirical research to examine empirical research practice.
Researchers distinguish two forms of reproducibility. Reproducibility in a narrow sense refers to reexamining and validating the analysis of a set of data. The second category, replication, involves repeating an experiment or study with new, independent data to verify the original conclusions.
Background
Replication
has been called "the cornerstone of science". Environmental health scientist Stefan Schmidt began a 2009 review with this description of replication:But no universal definition of replication or related concepts has been agreed on. Replication types include:
- direct,
- systematic, and
- conceptual.
Replication failures do not indicate that affected fields lack scientific rigor. Rather, they reflect the normal operation of science—a mechanism by which unsupported hypotheses are eliminated, but which often functions slowly and inconsistently.
A hypothesis is generally considered supported when the results match the predicted pattern and that pattern is found to be statistically significant. Under null hypothesis assumption, results are deemed statistically significant when their probability falls below a predetermined threshold. This generally answers the question of how unlikely such results would be by chance alone if no true effect existed in the statistical population. If the probability associated with the test statistic exceeds the chosen critical value, the results are considered statistically significant. The p-value represents the probability of obtaining results at least as extreme as observed, assuming the null hypothesis is true. The standard threshold p < 0.05 means accepting a 5% false positive rate. Some fields use smaller p-values, such as p < 0.01 or p < 0.001. But a smaller chance of a false positive often requires greater sample sizes or a greater chance of a false negative. Although p-value testing is the most commonly used method, it is not the only one.
Statistics
Certain terms commonly used in discussions of the replication crisis have technically precise meanings, which are presented here.In the most common case, null hypothesis testing, there are two hypotheses, a null hypothesis and an alternative hypothesis. The null hypothesis is typically of the form "X and Y are statistically independent". For example, the null hypothesis might be "taking drug X does not change 1-year recovery rate from disease Y", and the alternative hypothesis is that it does change.
As testing for full statistical independence is difficult, the full null hypothesis is often reduced to a simplified null hypothesis "the effect size is 0", where "effect size" is a real number that is 0 if the full null hypothesis is true, and the larger the effect size is, the more the null hypothesis is false. For example, if X is binary, then the effect size might be defined as the change in the expectation of Y upon a change of X:Note that the effect size as defined above might be zero even if X and Y are not independent, such as when their relationship is non-linear or when one variable affects different subgroups oppositely. Since different definitions of "effect size" capture different ways for X and Y to be dependent, there are many different definitions of effect size.
In practice, effect sizes cannot be directly observed, but must be measured by statistical estimators. For example, the above definition of effect size is often measured by Cohen's d estimator. The same effect size might have multiple estimators, as they have tradeoffs between efficiency, bias, variance, etc. This further increases the number of possible statistical quantities that can be computed on a single dataset. When an estimator for an effect size is used for statistical testing, it is called a test statistic.
A null hypothesis test is a decision procedure which takes in some data, and outputs either or. If it outputs, it is usually stated as "there is a statistically significant effect" or "the null hypothesis is rejected".
Often, the statistical test is a threshold test, which is structured as follows:
- Gather data.
- Compute a test statistic for the data.
- Compare the test statistic against a critical value/'threshold. If, then output, else, output.
There are 4 possible outcomes of a null hypothesis test: false negative, true negative, false positive, true positive. A false negative means that is true, but the test outcome is ; a true negative means that is true, and the test outcome is, etc.
| Probability to reject | Probability to not reject | |
| If is True | α | 1-α |
| If is True | 1-β | β |
Significance level, false positive rate, or the alpha level, is the probability of finding the alternative to be true when the null hypothesis is true:For example, when the test is a one-sided threshold test, then where means "the data is sampled from ".
Statistical power, true positive rate, is the probability of finding the alternative to be true when the alternative hypothesis is true:where is also called the false negative rate. For example, when the test is a one-sided threshold test, then.
Given a statistical test and a data set, the corresponding p-value' is the probability that the test statistic is at least as extreme, conditional on. For example, for a one-sided threshold test, If the null hypothesis is true, then the p-value is distributed uniformly on. Otherwise, it is typically peaked at and roughly exponential, though the precise shape of the p-value distribution depends on what the alternative hypothesis is.
Because the p-values are distributed uniformly on under the null hypothesis, researchers can set any significance level by computing the p-value, then output if. This is usually stated as "the null hypothesis is rejected at significance level ", or "", such as "smoking is correlated with cancer ".
History
The replication crisis dates to a number of events in the early 2010s. Felipe Romero identified four precursors to the crisis:- Social priming failures: In the early 2010s, two direct replication attempts failed to reproduce results from social psychologist John Bargh's much-cited "elderly-walking" study. This experiment was part of a series of three studies that had been widely cited throughout the years, was regularly taught in university courses, and had inspired many conceptual replications. These replication failures triggered intense disagreement between replication researchers and the original authors. Notably, many of the conceptual replications of the original studies also failed to replicate in subsequent direct replications.
- Experiments on extrasensory perception: Social psychologist Daryl Bem conducted a series of experiments supposedly providing evidence for the controversial phenomenon of extrasensory perception. Bem faced substantial criticism of his study's methodology. Reanalysis of his data found no evidence for extrasensory perception. The experiment also failed to replicate in subsequent direct replications. According to Romero, what the community found particularly upsetting was that many of the flawed procedures and statistical tools used in Bem's studies were part of common research practice in psychology.
- Biomedical replication failures: Scientists from biotech companies Amgen and Bayer Healthcare reported alarmingly low replication rates of landmark findings in preclinical oncological research.
- P-hacking studies and questionable research practices: Since the late 2000s, a number of studies in metascience showed how commonly adopted practices in many scientific fields, such as exploiting the flexibility of the process of data collection and reporting, could greatly increase the probability of false positive results. These studies suggested how a significant proportion of published literature in several scientific fields could be nonreplicable research.
Although the beginning of the replication crisis can be traced to the early 2010s, some authors point out that concerns about replicability and research practices in the social sciences had been expressed much earlier. Romero notes that authors voiced concerns about the lack of direct replications in psychological research in the late 1960s and early 1970s. He also writes that certain studies in the 1990s were already reporting that journal editors and reviewers are generally biased against publishing replication studies.
In the social sciences, the blog Data Colada has been credited with contributing to the start of the replication crisis.
University of Virginia professor and cognitive psychologist Barbara Spellman has written that many criticisms of research practices and concerns about replicability of research are not new. She reports that between the late 1950s and the 1990s, scholars were already expressing concerns about a possible crisis of replication, a suspiciously high rate of positive findings, questionable research practices, the effects of publication bias, issues with statistical power, and bad standards of reporting.
Spellman also identifies reasons that the reiteration of these criticisms and concerns in recent years led to a full-blown crisis and challenges to the status quo. First, technological improvements facilitated conducting and disseminating replication studies, and analyzing large swaths of literature for systemic problems. Second, the research community's increasing size and diversity made the work of established members more easily scrutinized by other community members unfamiliar with them. According to Spellman, these factors, coupled with increasingly limited resources and misaligned incentives for doing scientific work, led to a crisis in psychology and other fields.
According to Andrew Gelman, the works of Paul Meehl, Jacob Cohen, and Tversky and Kahneman in the 1960s-70s were early warnings of replication crisis. In discussing the origins of the problem, Kahneman himself noted historical precedents in subliminal perception and dissonance reduction replication failures.
It had been repeatedly pointed out since 1962 that most psychological studies have low power, but low power persisted for 50 years, indicating a structural and persistent problem in psychological research.