Statistics

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.
When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation.
Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation. Descriptive statistics are most often concerned with two sets of properties of a distribution : central tendency seeks to characterize the distribution's central or typical value, while dispersion characterizes the extent to which members of the distribution depart from its center and each other. Inferences made using mathematical statistics employ the framework of probability theory, which deals with the analysis of random phenomena.
A standard statistical procedure involves the collection of data leading to a test of the relationship between two statistical data sets, or a data set and synthetic data drawn from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false, given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors and Type II errors. Multiple problems have come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis.
Statistical measurement processes are also prone to error in regards to the data that they generate. Many of these errors are classified as random or systematic, but other types of errors can also occur. The presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems.

Introduction

Statistics is the discipline that deals with data, facts and figures with which meaningful information is inferred. Data may represent a numerical value, in form of quantitative data, or a label, as with qualitative data. Data may be collected, presented and summarised, in one of two methods called descriptive statistics. Two elementary summaries of data, singularly called a statistic, are the mean and dispersion. Whereas inferential statistics interprets data from a population sample to induce statements and predictions about a population.
Statistics is regarded as a body of science or a branch of mathematics. It is based on probability, a branch of mathematics that studies random events. Statistics is considered the science of uncertainty. This arises from the ways to cope with measurement and sampling error as well as dealing with uncertanties in modelling. Although probability and statistics were once paired together as a single subject, they are conceptually distinct from one another. The former is based on deducing answers to specific situations from a general theory of probability, meanwhile statistics induces statements about a population based on a data set. Statistics serves to bridge the gap between probability and applied mathematical fields.
Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is generally concerned with the use of data in the context of uncertainty and decision-making in the face of uncertainty. Statistics is indexed at 62, a subclass of probability theory and stochastic processes, in the Mathematics Subject Classification. Mathematical statistics is covered in the range 276-280 of subclass QA in the Library of Congress Classification.
The word statistics ultimately comes from the Latin word Status, meaning "situation" or "condition" in society, which in late Latin adopted the meaning "state". Derived from this, political scientist Gottfried Achenwall, coined the German word statistik. In 1770, the term entered the English language through German and referred to the study of political arrangements. The term gained its modern meaning in the 1790s in John Sinclair's works. In modern German, the term statistik is synonymous with mathematical statistics. The term statistic, in singular form, is used to describe a function that returns its value of the same name.

Statistical data

Data collection

Sampling

When full census data cannot be collected, statisticians collect sample data by developing specific experiment designs and survey samples. Statistics itself also provides tools for prediction and forecasting through statistical models.
To use a sample as a guide to an entire population, it is important that it truly represents the overall population. Representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. A major problem lies in determining the extent that the sample chosen is actually representative. Statistics offers methods to estimate and correct for any bias within the sample and data collection procedures. There are also methods of experimental design that can lessen these issues at the outset of a study, strengthening its capability to discern truths about the population.
Sampling theory is part of the mathematical discipline of probability theory. Probability is used in mathematical statistics to study the sampling distributions of sample statistics and, more generally, the properties of statistical procedures. The use of any statistical method is valid when the system or population under consideration satisfies the assumptions of the method. The difference in point of view between classic probability theory and sampling theory is, roughly, that probability theory starts from the given parameters of a total population to deduce probabilities that pertain to samples. Statistical inference, however, moves in the opposite direction—inductively inferring from samples to the parameters of a larger or total population.

Experimental and observational studies

A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on dependent variables. There are two major types of causal statistical studies: experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable on the behavior of the dependent variable are observed. The difference between the two types lies in how the study is actually conducted. Each can be very effective. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements with different levels using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated. While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data—like natural experiments and observational studies—for which a statistician would use a modified, more structured estimation method that produce consistent estimators.

Experiments

The basic steps of a statistical experiment are:

Planning the research, including finding the number of replicates of the study, using the following information: preliminary estimates regarding the size of treatment effects, alternative hypotheses, and the estimated experimental variability. Consideration of the selection of experimental subjects and the ethics of research is necessary. Statisticians recommend that experiments compare one new treatment with a standard treatment or control, to allow an unbiased estimate of the difference in treatment effects.
Design of experiments, using blocking to reduce the influence of confounding variables, and randomized assignment of treatments to subjects to allow unbiased estimates of treatment effects and experimental error. At this stage, the experimenters and statisticians write the experimental protocol that will guide the performance of the experiment and which specifies the primary analysis of the experimental data.
Performing the experiment following the experimental protocol and analyzing the data following the experimental protocol.
Further examining the data set in secondary analyses, to suggest new hypotheses for future study.
Documenting and presenting the results of the study.

Experiments on human behavior have special concerns. The famous Hawthorne study examined changes to the working environment at the Hawthorne plant of the Western Electric Company. The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers. The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected productivity. It turned out that productivity indeed improved. However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a control group and blindness. The Hawthorne effect refers to finding that an outcome changed due to observation itself. Those in the Hawthorne study became more productive not because the lighting was changed but because they were being observed.