Mathematical statistics
Mathematical statistics is the application of probability theory and other mathematical concepts to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques that are commonly used in statistics include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.
Introduction
Statistical data collection is concerned with the planning of studies, especially with the design of randomized experiments and with the planning of surveys using random sampling. The initial analysis of the data often follows the study protocol specified prior to the study being conducted. The data from a study can also be analyzed to consider secondary hypotheses inspired by the initial results, or to suggest new studies. A secondary analysis of the data from a planned study uses tools from data analysis, and the process of doing this is mathematical statistics.Data analysis is divided into:
- descriptive statistics – the part of statistics that describes data, i.e. summarises the data and their typical properties.
- inferential statistics – the part of statistics that draws conclusions from data : For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and with quantifying the involved uncertainty.
Topics
The following are some of the important topics in mathematical statistics:Probability distributions
A probability distribution is a function that assigns a probability to each measurable subset of the possible outcomes of a random experiment, survey, or procedure of statistical inference. Examples are found in experiments whose sample space is non-numerical, where the distribution would be a categorical distribution; experiments whose sample space is encoded by discrete random variables, where the distribution can be specified by a probability mass function; and experiments with sample spaces encoded by continuous random variables, where the distribution can be specified by a probability density function. More complex experiments, such as those involving stochastic processes defined in continuous time, may demand the use of more general probability measures.A probability distribution can either be univariate or multivariate. A univariate distribution gives the probabilities of a single random variable taking on various alternative values; a multivariate distribution gives the probabilities of a random vector—a set of two or more random variables—taking on various combinations of values. Important and commonly encountered univariate probability distributions include the binomial distribution, the hypergeometric distribution, and the normal distribution. The multivariate normal distribution is a commonly encountered multivariate distribution.
Special distributions
- Normal distribution, the most common continuous distribution
- Bernoulli distribution, for the outcome of a single Bernoulli trial
- Binomial distribution, for the number of "positive occurrences" given a fixed total number of independent occurrences
- Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
- Geometric distribution, for binomial-type observations but where the quantity of interest is the number of failures before the first success; a special case of the negative binomial distribution, where the number of successes is one.
- Discrete uniform distribution, for a finite set of values
- Continuous uniform distribution, for continuously distributed values
- Poisson distribution, for the number of occurrences of a Poisson-type event in a given period of time
- Exponential distribution, for the time before the next Poisson-type event occurs
- Gamma distribution, for the time before the next k Poisson-type events occur
- Chi-squared distribution, the distribution of a sum of squared standard normal variables; useful e.g. for inference regarding the sample variance of normally distributed samples
- Student's t distribution, the distribution of the ratio of a standard normal variable and the square root of a scaled chi squared variable; useful for inference regarding the mean of normally distributed samples with unknown variance
- Beta distribution, for a single probability ; conjugate to the Bernoulli distribution and binomial distribution
Statistical inference
The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy.
For the most part, statistical inference makes propositions about populations, using data drawn from the population of interest via some form of random sampling. More generally, data about a random process is obtained from its observed behavior during a finite period of time. Given a parameter or hypothesis about which one wishes to make inference, statistical inference most often uses:
- a statistical model of the random process that is supposed to generate the data, which is known when randomization has been used, and
- a particular realization of the random process; i.e., a set of data.
Regression
Many techniques for carrying out regression analysis have been developed. Familiar methods, such as linear regression, are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional.
Nonparametric statistics
Nonparametric statistics are values calculated from data in a way that is not based on parameterized families of probability distributions. They include both descriptive and inferential statistics. The typical parameters are the expectations, variance, etc. Unlike parametric statistics, nonparametric statistics make no assumptions about the probability distributions of the variables being assessed.Non-parametric methods are widely used for studying populations that take on a ranked order. The use of non-parametric methods may be necessary when data have a ranking but no clear numerical interpretation, such as when assessing preferences. In terms of levels of measurement, non-parametric methods result in "ordinal" data.
As non-parametric methods make fewer assumptions, their applicability is much wider than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, non-parametric methods are more robust.
One drawback of non-parametric methods is that since they do not rely on assumptions, they are generally less powerful than their parametric counterparts. Low power non-parametric tests are problematic because a common use of these methods is for when a sample has a low sample size. Many parametric methods are proven to be the most powerful tests through methods such as the Neyman–Pearson lemma and the Likelihood-ratio test.
Another justification for the use of non-parametric methods is simplicity. In certain cases, even when the use of parametric methods is justified, non-parametric methods may be easier to use. Due both to this simplicity and to their greater robustness, non-parametric methods are seen by some statisticians as leaving less room for improper use and misunderstanding.
Statistics, mathematics, and mathematical statistics
Mathematical statistics is a key subset of the discipline of statistics. Statistical theorists study and improve statistical procedures with mathematics, and statistical research often raises mathematical questions.Mathematicians and statisticians like Gauss, Laplace, and C. S. Peirce used decision theory with probability distributions and loss functions. The decision-theoretic approach to statistical inference was reinvigorated by Abraham Wald and his successors and makes extensive use of scientific computing, analysis, and optimization; for the design of experiments, statisticians use algebra and combinatorics. But while statistical practice often relies on probability and decision theory, their application can be controversial