Pearson correlation coefficient


In statistics, the Pearson correlation coefficient is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. A key difference is that unlike covariance, this correlation coefficient does not have units, allowing comparison of the strength of the joint association between different pairs of random variables that do not necessarily have the same units. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

Naming and history

It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s, and for which the mathematical formula was derived and published by Auguste Bravais in 1844. The naming of the coefficient is thus an example of Stigler's Law.

Intuitive explanation

The correlation coefficient can be derived by considering the cosine of the angle between two points representing the two sets of x and y co-ordinate data. This expression is therefore a number between -1 and 1 and is equal to unity when all the points lie on a straight line.

Definition

Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. The form of the definition involves a "product moment", that is, the mean of the product of the mean-adjusted random variables; hence the modifier product-moment in the name.

For a population

Pearson's correlation coefficient, when applied to a population, is commonly represented by the Greek letter ρ and may be referred to as the population correlation coefficient or the population Pearson correlation coefficient. Given a pair of random variables , the formula for ρ is
where
The formula for can be expressed in terms of mean and expectation. Since
the formula for can also be written as
where
  • and are defined as above
  • is the mean of
  • is the mean of
  • is the expectation.
The formula for can be expressed in terms of uncentered moments. Since
the formula for can also be written as

For a sample

Pearson's correlation coefficient, when applied to a sample, is commonly represented by and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for by substituting estimates of the covariances and variances based on a sample into the formula above. Given paired data consisting of pairs, is defined as
where
  • is sample size
  • are the individual sample points indexed with i
  • ; and analogously for.
Rearranging gives us this formula for :
where are defined as above.
Rearranging again gives us this formula for :
where are defined as above.
This formula suggests a convenient single-pass algorithm for calculating sample correlations, though depending on the numbers involved, it can sometimes be numerically unstable.
An equivalent expression gives the formula for as the mean of the products of the standard scores as follows:
where
Alternative formulae for are also available. For example, one can use the following formula for :
where
  • are defined as above and:
  • ; and analogously for.

    For jointly Gaussian distributions

If is jointly gaussian, with mean zero and variance, then.

Practical issues

Under heavy noisy conditions, extracting the correlation coefficient between two sets of stochastic variables is nontrivial, in particular where Canonical Correlation Analysis reports degraded correlation values due to the heavy noise contributions. A generalization of the approach is given elsewhere.
In case of missing data, Garren derived the maximum likelihood estimator.
Some distributions do not have a defined variance.

Mathematical properties

The values of both the sample and population Pearson correlation coefficients are on or between −1 and 1. Correlations equal to +1 or −1 correspond to data points lying exactly on a line, or to a bivariate distribution entirely supported on a line. The Pearson correlation coefficient is symmetric: corr = corr.
A key mathematical property of the Pearson correlation coefficient is that it is invariant under separate changes in location and scale in the two variables. That is, we may transform X to and transform Y to, where a, b, c, and d are constants with, without changing the correlation coefficient. More general linear transformations do change the correlation: see for an application of this. In particular, it might be useful to notice that corr = -corr

Interpretation

The correlation coefficient ranges from −1 to 1. An absolute value of exactly 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line. The correlation sign is determined by the regression slope: a value of +1 implies that all data points lie on a line for which Y increases as X increases, whereas a value of -1 implies a line where Y increases while X decreases. A value of 0 implies that there is no linear dependency between the variables.
More generally, is positive if and only if Xi and Yi lie on the same side of their respective means. Thus the correlation coefficient is positive if Xi and Yi tend to be simultaneously greater than, or simultaneously less than, their respective means. The correlation coefficient is negative if Xi and Yi tend to lie on opposite sides of their respective means. Moreover, the stronger either tendency is, the larger is the absolute value of the correlation coefficient.
Rodgers and Nicewander cataloged thirteen ways of interpreting correlation or simple functions of it:
  • Function of raw scores and means
  • Standardized covariance
  • Standardized slope of the regression line
  • Geometric mean of the two regression slopes
  • Square root of the ratio of two variances
  • Mean cross-product of standardized variables
  • Function of the angle between two standardized regression lines
  • Function of the angle between two variable vectors
  • Rescaled variance of the difference between standardized scores
  • Estimated from the balloon rule
  • Related to the bivariate ellipses of isoconcentration
  • Function of test statistics from designed experiments
  • Ratio of two means

    Geometric interpretation

For uncentered data, there is a relation between the correlation coefficient and the angle φ between the two regression lines, and, obtained by regressing y on x and x on y respectively. One can show that if the standard deviations are equal, then, where sec and tan are trigonometric functions.
For centered data, the correlation coefficient can also be viewed as the cosine of the angle θ between the two observed vectors in N-dimensional space.
Both the uncentered and centered correlation coefficients can be determined for a dataset. As an example, suppose five countries are found to have gross national products of 1, 2, 3, 5, and 8 billion dollars, respectively. Suppose these same five countries are found to have 11%, 12%, 13%, 15%, and 18% poverty. Then let x and y be ordered 5-element vectors containing the above data: and.
By the usual procedure for finding the angle θ between two vectors, the uncentered correlation coefficient is
This uncentered correlation coefficient is identical with the cosine similarity. The above data were deliberately chosen to be perfectly correlated:. The Pearson correlation coefficient must therefore be exactly one. Centering the data yields and, from which
as expected.

Interpretation of the size of a correlation

Several authors have offered guidelines for the interpretation of a correlation coefficient. However, all such criteria are in some ways arbitrary. The interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.8 may be very low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences, where there may be a greater contribution from complicating factors.

Inference

Statistical inference based on Pearson's correlation coefficient often focuses on one of the following two aims:
  • One aim is to test the null hypothesis that the true correlation coefficient ρ is equal to 0, based on the value of the sample correlation coefficient r.
  • The other aim is to derive a confidence interval that, on repeated sampling, has a given probability of containing ρ.
Methods of achieving one or both of these aims are discussed below.

Using a permutation test

s provide a direct approach to performing hypothesis tests and constructing confidence intervals. A permutation test for Pearson's correlation coefficient involves the following two steps:
  1. Using the original paired data, randomly redefine the pairs to create a new data set, where the ' are a permutation of the set. The permutation ' is selected randomly, with equal probabilities placed on all n! possible permutations. This is equivalent to drawing the ' randomly without replacement from the set. In bootstrapping, a closely related approach, the i and the ' are equal and drawn with replacement from ;
  2. Construct a correlation coefficient r from the randomized data.
To perform the permutation test, repeat steps and a large number of times. The p-value for the permutation test is the proportion of the r values generated in step that are larger than the Pearson correlation coefficient that was calculated from the original data. Here "larger" can mean either that the value is larger in magnitude, or larger in signed value, depending on whether a two-sided or one-sided test is desired.