Median
The median of a set of numbers is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as the “middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extreme values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.
Median is a 2-quantile; it is the value that partitions a set into two equal parts.
Finite set of numbers
The median of a finite list of numbers is the "middle" number, when those numbers are listed in order from smallest to greatest.If the data set has an odd number of observations, the middle one is selected. For example, the following list of seven numbers,
has the median of 6, which is the fourth value.
If the data set has an even number of observations, there is no distinct middle value and the median is usually defined to be the arithmetic mean of the two middle values. For example, this data set of 8 numbers
has a median value of 4.5, that is..
In general, with this convention, the median can be defined as follows: For a data set of elements, ordered from smallest to greatest,
| Type | Description | Example | Result |
| Midrange | Midway point between the minimum and the maximum of a data set | 1, 2, 2, 3, 4, 7, 9 | 5 |
| Arithmetic mean | Sum of values of a data set divided by number of values: | 4 | |
| Median | Middle value separating the greater and lesser halves of a data set | 1, 2, 2, 3, 4, 7, 9 | 3 |
| Mode | Most frequent value in a data set | 1, 2, 2, 3, 4, 7, 9 | 2 |
Definition and notation
Formally, a median of a population is any value such that at least half of the population is less than or equal to the proposed median and at least half is greater than or equal to the proposed median. As seen above, medians may not be unique. If each set contains more than half the population, then some of the population is exactly equal to the unique median.The median is well-defined for any ordered data and is independent of any distance metric. The median can thus be applied to school classes which are ranked but not numerical, although the result might be halfway between classes if there is an even number of classes.
A geometric median, on the other hand, is defined in any number of dimensions. A related concept, in which the outcome is forced to correspond to a member of the sample, is the medoid.
There is no widely accepted standard notation for the median, but some authors represent the median of a variable x as med, x͂, as μ1/2, or as M. In any of these cases, the use of these or other symbols for the median needs to be explicitly defined when they are introduced.
The median is a special case of other ways of summarizing the typical values associated with a statistical distribution: it is the 2nd quartile, 5th decile, and 50th percentile.
Uses
The median can be used as a measure of location when one attaches reduced importance to extreme values, typically because a distribution is skewed, extreme values are not known, or outliers are untrustworthy, i.e., may be measurement or transcription errors.For example, consider the multiset
The median is 2 in this case, as is the mode, and it might be seen as a better indication of the center than the arithmetic mean of 4, which is larger than all but one of the values. However, the widely cited empirical relationship that the mean is shifted "further into the tail" of a distribution than the median is not generally true. At most, one can say that the two statistics cannot be "too far" apart; see below.
As a median is based on the middle data in a set, it is not necessary to know the value of extreme results in order to calculate it. For example, in a psychology test investigating the time needed to solve a problem, if a small number of people failed to solve the problem at all in the given time a median can still be calculated.
Because the median is simple to understand and easy to calculate, while also a robust approximation to the mean, the median is a popular summary statistic in descriptive statistics. In this context, there are several choices for a measure of variability: the range, the interquartile range, the mean absolute deviation, and the median absolute deviation.
For practical purposes, different measures of location and dispersion are often compared on the basis of how well the corresponding population values can be estimated from a sample of data. The median, estimated using the sample median, has good properties in this regard. While it is not usually optimal if a given population distribution is assumed, its properties are always reasonably good. For example, a comparison of the efficiency of candidate estimators shows that the sample mean is more statistically efficient when—and only when— data is uncontaminated by data from heavy-tailed distributions or from mixtures of distributions. Even then, the median has a 64% efficiency compared to the minimum-variance mean, which is to say the variance of the median will be ~50% greater than the variance of the mean.
Probability distributions
A median of a real-valued random variable is a real number that satisfiesor, equivalently with the complementary events,
Such an always exists, but needs not be uniquely determined. An equivalent phrasing uses the cumulative distribution function of
.
File:visualisation mode median mean.svg|thumb|upright|Mode, median and mean of a probability density function
Note that this definition does not require X to have an absolutely continuous distribution, nor does it require a discrete one. In the former case, the inequalities can be upgraded to equality: a median satisfies
and
Any probability distribution on the real number set has at least one median, but in pathological cases there may be more than one median: if F is constant 1/2 on an interval, then any value of that interval is a median.
Medians of particular distributions
The medians of certain types of distributions can be easily calculated from their parameters; furthermore, they exist even for some distributions lacking a well-defined mean, such as the Cauchy distribution:- The median of a symmetric unimodal distribution coincides with the mode.
- The median of a symmetric distribution which possesses a mean μ also takes the value μ.
- * The median of a normal distribution with mean μ and variance σ2 is μ. In fact, for a normal distribution, mean = median = mode.
- * The median of a uniform distribution in the interval is / 2, which is also the mean.
- The median of a Cauchy distribution with location parameter x0 and scale parameter y is x0, the location parameter.
- The median of a power law distribution x−a, with exponent a > 1 is 21/xmin, where xmin is the minimum value for which the power law holds
- The median of an exponential distribution with rate parameter λ is the natural logarithm of 2 divided by the rate parameter: λ−1ln 2.
- The median of a Weibull distribution with shape parameter k and scale parameter λ is λ1/k.
Properties
Optimality property
The mean absolute error of a real variable c with respect to the random variable X isProvided that the probability distribution of X is such that the above expectation exists, then m is a median of X if and only if m is a minimizer of the mean absolute error with respect to X. In particular, if m is a sample median, then it minimizes the arithmetic mean of the absolute deviations. Note, however, that in cases where the sample contains an even number of elements, this minimizer is not unique.
More generally, a median is defined as a minimum of
as discussed below in the section on multivariate medians.
This optimization-based definition of the median is useful in statistical data-analysis, for example, in k-medians clustering.
Inequality relating means and medians
If the distribution has finite variance, then the distance between the median and the mean is bounded by one standard deviation.This bound was proved by Book and Sher in 1979 for discrete samples, and more generally by Page and Murty in 1982. In a comment on a subsequent proof by O'Cinneide, Mallows in 1991 presented a compact proof that uses Jensen's inequality twice, as follows. Using |·| for the absolute value, we have
The first and third inequalities come from Jensen's inequality applied to the absolute-value function and the square function, which are each convex. The second inequality comes from the fact that a median minimizes the absolute deviation function.
Mallows's proof can be generalized to obtain a multivariate version of the inequality simply by replacing the absolute value with a norm:
where m is a spatial median, that is, a minimizer of the function The spatial median is unique when the data-set's dimension is two or more.
An alternative proof uses the one-sided Chebyshev inequality; it appears in an inequality on location and scale parameters. This formula also follows directly from Cantelli's inequality.