Logistic regression
In statistics, a logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See and for formal mathematics, and for a worked example.
Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc., and the logistic model has been the most commonly used model for binary regression since about 1970. Binary variables can be generalized to categorical variables when there are more than two possible values, and the binary logistic regression generalized to multinomial logistic regression. If the multiple categories are ordered, one can use the ordinal logistic regression. See for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform statistical classification, though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a binary classifier.
Analogous linear models for binary variables with a different sigmoid function instead of the logistic function can also be used, most notably the probit model; see. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio. More abstractly, the logistic function is the natural parameter for the Bernoulli distribution, and in this sense is the "simplest" way to convert a real number to a probability.
The parameters of a logistic regression are most commonly estimated by maximum-likelihood estimation. This does not have a closed-form expression, unlike linear least squares; see. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by ordinary least squares plays for scalar responses: it is a simple, well-analyzed baseline model; see for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by Joseph Berkson, beginning in, where he coined "logit"; see.
Applications
General
Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score, which is widely used to predict mortality in injured patients, was originally developed by Boyd using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict the risk of developing a given disease, based on observed characteristics of the patient. Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or for any other party, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc. In economics, it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing. Disaster planners and engineers rely on these models to predict decisions taken by householders or building occupants in small-scale and large-scales evacuations, such as building fires, wildfires, hurricanes among others. These models help in the development of reliable disaster managing plans and safer design for the built environment.Supervised machine learning
Logistic regression is a supervised machine learning algorithm widely used for binary classification tasks, such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results. This approach utilizes the logistic function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories. The essential mechanism of logistic regression is grounded in the logistic function's ability to model the probability of binary outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps any real-valued number to a value within the 0 to 1 interval. This feature renders it particularly suitable for binary classification tasks, such as sorting emails into "spam" or "not spam". By calculating the probability that the dependent variable will be categorized into a specific group, logistic regression provides a probabilistic framework that supports informed decision-making.Example
Problem
As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question:
A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?
The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade 0–100, then simple regression analysis could be used.
The table shows the number of hours each student spent studying, and whether they passed or failed.
| Hours | 0.50 | 0.75 | 1.00 | 1.25 | 1.50 | 1.75 | 1.75 | 2.00 | 2.25 | 2.50 | 2.75 | 3.00 | 3.25 | 3.50 | 4.00 | 4.25 | 4.50 | 4.75 | 5.00 | 5.50 |
| Pass | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
We wish to fit a logistic function to the data consisting of the hours studied and the outcome of the test. The data points are indexed by the subscript k which runs from to. The x variable is called the "explanatory variable", and the y variable is called the "categorical variable" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively.
Model
The logistic function is of the form:where μ is a location parameter and s is a scale parameter. This expression may be rewritten as:
where and is known as the intercept, and : these are the y-intercept and slope of the log-odds as a function of x. Conversely, and.
Note that this model is actually an oversimplification, as it implies that every student will pass if they study indefinitely.
Fit
The usual measure of goodness of fit for a logistic regression uses logistic loss, the negative log-likelihood. For a given xk and yk, write. The are the probabilities that the corresponding will equal one and are the probabilities that they will be zero. We wish to find the values of and which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points, the squared error loss, is taken as a measure of the goodness of fit, and the best fit is obtained when that function is minimized.The log loss for the k-th point is:
The log loss can be interpreted as the "surprisal" of the actual outcome relative to the prediction, and is a measure of information content. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction, and approaches infinity as the prediction gets worse, meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point, in a logistic regression it is not possible to have zero loss at any points, since is either 0 or 1, but.
These can be combined into a single expression:
This expression is more formally known as the cross-entropy of the predicted distribution from the actual distribution, as probability distributions on the two-element space of.
The sum of these, the total loss, is the overall negative log-likelihood, and the best fit is obtained for those choices of and for which is minimized.
Alternatively, instead of minimizing the loss, one can maximize its inverse, the log-likelihood:
or equivalently maximize the likelihood function itself, which is the probability that the given data set is produced by a particular logistic function:
This method is known as maximum likelihood estimation.