Statistical hypothesis test
A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests are in use and noteworthy.
History
While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to John Arbuthnot, followed by Pierre-Simon Laplace, in analyzing the human sex ratio at birth; see.Choice of null hypothesis
has argued that the epistemological importance of the choice of null hypothesis has gone largely unacknowledged. When the null hypothesis is predicted by theory, a more precise experiment will be a more severe test of the underlying theory. When the null hypothesis defaults to "no difference" or "no effect", a more precise experiment is a less severe test of the theory that motivated performing the experiment. An examination of the origins of the latter practice may therefore be useful:1778: Pierre Laplace compares the birthrates of boys and girls in multiple European cities. He states: "it is natural to conclude that these possibilities are very nearly in the same ratio". Thus, the null hypothesis in this case that the birthrates of boys and girls should be equal given "conventional wisdom".
1900: Karl Pearson develops the chi squared test to determine "whether a given form of frequency curve will effectively describe the samples drawn from a given population." Thus the null hypothesis is that a population is described by some distribution predicted by theory. He uses as an example the numbers of five and sixes in the Weldon dice throw data.
1904: Karl Pearson develops the concept of "contingency" in order to determine whether outcomes are independent of a given categorical factor. Here the null hypothesis is by default that two things are unrelated. The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the principle of indifference that led Fisher and others to dismiss the use of "inverse probabilities".
Modern origins and early controversy
Modern significance testing is largely the product of Karl Pearson, William Sealy Gosset, and Ronald Fisher, while hypothesis testing was developed by Jerzy Neyman and Egon Pearson. Ronald Fisher began his life in statistics as a Bayesian, but Fisher soon grew disenchanted with the subjectivity involved, and sought to provide a more "objective" approach to inductive inference.Fisher emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.
Fisher popularized the "significance test". He required a null-hypothesis and a sample. His calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error.
The p-value was devised as an informal, but objective, index meant to help a researcher determine whether to modify future experiments or strengthen one's faith in the null hypothesis. Hypothesis testing was devised by Neyman and Pearson as a more objective alternative to Fisher's p-value, also meant to determine researcher behaviour, but without requiring any inductive inference by the researcher.
Neyman & Pearson considered a different problem to Fisher. They initially considered two simple hypotheses. They calculated two probabilities and typically selected the hypothesis associated with the higher probability. Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities.
Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing. Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion.
The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference.
Events intervened: Neyman accepted a position in the University of California, Berkeley in 1938, breaking his partnership with Pearson and separating the disputants. World War II provided an intermission in the debate. The dispute between Fisher and Neyman terminated with Fisher's death in 1962. Neyman wrote a well-regarded eulogy. Some of Neyman's later publications reported p-values and significance levels.
Null hypothesis significance testing (NHST)
The modern version of hypothesis testing is generally called the null hypothesis significance testing and is a hybrid of the Fisher approach with the Neyman-Pearson approach. In 2000, Raymond S. Nickerson wrote an article stating that NHST was "arguably the most widely used method of analysis of data collected in psychological experiments and has been so for about 70 years" and that it was at the same time "very controversial".This fusion resulted from confusion by writers of statistical textbooks beginning in the 1940s. Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.
Sometime around 1940, authors of statistical text books began combining the two approaches by using the p-value in place of the test statistic to test against the Neyman–Pearson "significance level".
| # | Fisher's null hypothesis testing | Neyman–Pearson decision theory |
| 1 | Set up a statistical null hypothesis. The null need not be a nil hypothesis. | Set up two statistical hypotheses, H1 and H2, and decide about α, β, and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis. |
| 2 | Report the exact level of significance. Do not refer to "accepting" or "rejecting" hypotheses. If the result is "not significant", draw no conclusions and make no decisions, but suspend judgement until further data is available. | If the data falls into the rejection region of H1, accept H2; otherwise accept H1. Accepting a hypothesis does not mean that you believe in it, but only that you act as if it were true. |
| 3 | Use this procedure only if little is known about the problem at hand, and only to draw provisional conclusions in the context of an attempt to understand the experimental situation. | The usefulness of the procedure is limited among others to situations where you have a disjunction of hypotheses and where you can make meaningful cost-benefit trade-offs for choosing alpha and beta. |
Philosophy
Hypothesis testing and philosophy intersect. Inferential statistics, which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher David Hume wrote, "All knowledge degenerates into probability." Competing practical definitions of probability reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the philosophy of science.Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical.
Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments.
Hypothesis testing is of continuing interest to philosophers.
Education
Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught. Many conclusions reported in the popular press are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly. An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis. An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures. Statistical hypothesis testing is considered a mature area within statistics, but a limited amount of development continues.An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions that persisted among instructors. While the problem was addressed more than a decade ago, and calls for educational reform continue, students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing. Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.
Raymond S. Nickerson commented: