Effect size


In statistics, an effect size is a quantitative measure of the magnitude of a phenomenon.. It can refer to the value of a statistic calculated from a sample of data, the value of one parameter for a hypothetical population, or the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, and the risk of a particular event. Effect sizes are a complementary tool for statistical hypothesis testing, and play an important role in statistical power analyses to assess the sample size required for new experiments. Effect size calculations are fundamental to meta-analysis, which aims to provide the combined effect size based on data from multiple studies. The group of data-analysis methods concerning effect sizes is referred to as estimation statistics.
Effect size is an essential component in the evaluation of the strength of a statistical claim, and it is the first item in the MAGIC criteria. The standard deviation of the effect size is of critical importance, as it indicates how much uncertainty is included in the observed measurement. A standard deviation that is too large will make the measurement nearly meaningless. In meta-analysis, which aims to summarize multiple effect sizes into a single estimate, the uncertainty in studies' effect sizes is used to weight the contribution of each study, so larger studies are considered more important than smaller ones. The uncertainty in the effect size is calculated differently for each type of effect size, but generally only requires knowing the study's sample size, or the number of observations in each group.
Reporting effect sizes or estimates thereof is considered good practice when presenting empirical research findings in many fields. The reporting of effect sizes facilitates the interpretation of the importance of a research result, in contrast to its statistical significance. Effect sizes are particularly significant in social science and medical research, with the latter emphasizing the importance of the magnitude of the average treatment effect.
Effect sizes may be measured in relative or absolute terms. In relative effect sizes, two groups are directly compared with each other, as in odds ratios and relative risks. A larger absolute value always indicates a stronger effect for absolute effect sizes. Many types of measurements can be expressed as either absolute or relative, and these can be used together because they convey different information. A prominent task force in the psychology research community made the following recommendation:

Overview

Population and sample effect sizes

As in statistical estimation, the true effect size is distinguished from the observed effect size. For example, to measure the risk of disease in a population one can measure the risk within a sample of that population. Conventions for describing true and observed effect sizes follow standard statistical practices—one common approach is to use Greek letters like ρ to denote population parameters and Latin letters like r to denote the corresponding statistic. Alternatively, a "hat" can be placed over the population parameter to denote the statistic, e.g. with being the estimate of the parameter.
As in any statistical setting, effect sizes are estimated with sampling error, and may be biased unless the effect size estimator that is used is appropriate for the manner in which the data were sampled and the manner in which the measurements were made. An example of this is publication bias, which occurs when scientists report results only when the estimated effect sizes are large or are statistically significant. As a result, if many researchers carry out studies with low statistical power, the reported effect sizes will tend to be larger than the true effects, if any. Another example where effect sizes may be distorted is in a multiple-trial experiment, where the effect size calculation is based on the averaged or aggregated response across the trials.
Smaller studies sometimes show different, often larger, effect sizes than larger studies. This phenomenon is known as the small-study effect, which may signal publication bias.

Relationship to test statistics

Sample-based effect sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength of, for example, an apparent relationship, rather than assigning a significance level reflecting whether the magnitude of the relationship observed could be due to chance. The effect size does not directly determine the significance level, or vice versa. Given a sufficiently large sample size, a non-null statistical comparison will always show a statistically significant result unless the population effect size is exactly zero. For example, a sample Pearson correlation coefficient of 0.01 is statistically significant if the sample size is 1000. Reporting only the significant p-value from this analysis could be misleading if a correlation of 0.01 is too small to be of interest in a particular application.

Standardized and unstandardized effect sizes

The term effect size can refer to a standardized measure of effect, or to an unstandardized measure. Standardized effect size measures are typically used when:
  • the metrics of variables being studied do not have intrinsic meaning,
  • results from multiple studies are being combined,
  • some or all of the studies use different scales, or
  • it is desired to convey the size of an effect relative to the variability in the population.
In meta-analyses, standardized effect sizes are used as a common measure that can be calculated for different studies and then combined into an overall summary.

Interpretation

The interpretation of an effect size of being small, medium, or large depends on its substantive context and its operational definition. Jacob Cohen suggested interpretation guidelines that are near ubiquitous across many fields. However, Cohen also cautioned:
"The terms 'small,' 'medium,' and 'large' are relative, not only to each other, but to the area of behavioral science or even more particularly to the specific content and research method being employed in any given investigation... In the face of this relativity, there is a certain risk inherent in offering conventional operational definitions for these terms for use in power analysis in as diverse a field of inquiry as behavioral science. This risk is nevertheless accepted in the belief that more is to be gained than lost by supplying a common conventional frame of reference which is recommended for use only when no better basis for estimating the ES index is available."

Sawilowsky recommended that the rules of thumb for effect sizes should be revised, and expanded the descriptions to include very small, very large, and huge. Funder and Ozer suggested that effect sizes should be interpreted based on benchmarks and consequences of findings, resulting in adjustment of guideline recommendations.
noted for a medium effect size, "you'll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. Researchers should interpret the substantive significance of their results by grounding them in a meaningful context or by quantifying their contribution to knowledge, and Cohen's effect size descriptions can be helpful as a starting point." Similarly, a U.S. Dept of Education sponsored report argued that the widespread indiscriminate use of Cohen's interpretation guidelines can be inappropriate and misleading. They instead suggested that norms should be based on distributions of effect sizes from comparable studies. Thus a small effect could be considered large if the effect is larger than similar studies in the field. See Abelson's paradox and Sawilowsky's paradox for related points.
The table below contains descriptors for various magnitudes of d, r, f and omega, as initially suggested by Jacob Cohen, and later expanded by Sawilowsky, and by Funder & Ozer.
Effect sizedrfomega
Very small0.010.0050.005
Small0.200.100.100.10
Medium0.41, 0.500.20, 0.240.20, 0.310.30
Large0.63, 0.800.30, 0.370.32, 0.400.50
Very large0.87, 1.200.40, 0.510.44, 0.60
Huge2.00.711.0

Types

About 50 to 100 different measures of effect size are known. Many effect sizes of different types can be converted to other types, as many estimate the separation of two distributions, so are mathematically related. For example, a correlation coefficient can be converted to a Cohen's d and vice versa.

Correlation family: Effect sizes based on "variance explained"

These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.

Pearson ''r'' or correlation coefficient

, often denoted r and introduced by Karl Pearson, is widely used as an effect size when paired quantitative data are available; for instance if one were studying the relationship between birth weight and longevity. The correlation coefficient can also be used when the data are binary. Pearson's r can vary in magnitude from −1 to 1, with −1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation, and 0 indicating no linear relation between two variables.