Linear discriminant analysis
Linear discriminant analysis, normal discriminant analysis, canonical variates analysis, or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
LDA is closely related to analysis of variance and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable. Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.
LDA is also closely related to principal component analysis and factor analysis in that they both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA, in contrast, does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables must be made.
LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.
Discriminant analysis is used when groups are known a priori. Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification - the act of distributing things into groups, classes or categories of the same type.
History
The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one or multiple continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership.LDA for two classes
Consider a set of observations for each sample of an object or event with known class. This set of samples is called the training set in a supervised learning context. The classification problem is then to find a good predictor for the class of any sample of the same distribution given only an observation.LDA approaches the problem by assuming that the conditional probability density functions and are both the normal distribution with mean and covariance parameters and, respectively. Under this assumption, the Bayes-optimal solution is to predict points as being from the second class if the log of the likelihood ratios is bigger than some threshold T, so that:
Without any further assumptions, the resulting classifier is referred to as quadratic discriminant analysis.
LDA instead makes the additional simplifying homoscedasticity assumption and that the covariances have full rank.
In this case, several terms cancel:
and the above decision criterion
becomes a threshold on the dot product
for some threshold constant c, where
This means that the criterion of an input being in a class is purely a function of this linear combination of the known observations.
It is often useful to see this conclusion in geometrical terms: the criterion of an input being in a class is purely a function of projection of multidimensional-space point onto vector . In other words, the observation belongs to if corresponding is located on a certain side of a hyperplane perpendicular to. The location of the plane is defined by the threshold.
Assumptions
The assumptions of discriminant analysis are the same as those for MANOVA. The analysis is quite sensitive to outliers and the size of the smallest group must be larger than the number of predictor variables.- Multivariate normality: Independent variables are normal for each level of the grouping variable.
- Homogeneity of variance/covariance : Variances among group variables are the same across levels of predictors. Can be tested with Box's M statistic. It has been suggested, however, that linear discriminant analysis be used when covariances are equal, and that quadratic discriminant analysis may be used when covariances are not equal.
- Independence: Participants are assumed to be randomly sampled, and a participant's score on one variable is assumed to be independent of scores on that variable for all other participants.
Discriminant functions
Discriminant analysis works by creating one or more linear combinations of predictors, creating a new latent variable for each function. These functions are called discriminant functions. The number of functions possible is either where = number of groups, or , whichever is smaller. The first function created maximizes the differences between groups on that function. The second function maximizes differences on that function, but also must not be correlated with the previous function. This continues with subsequent functions with the requirement that the new function not be correlated with any of the previous functions.Given group, with sets of sample space, there is a discriminant rule such that if, then. Discriminant analysis then, finds “good” regions of to minimize classification error, therefore leading to a high percent correct classified in the classification table.
Each function is given a discriminant score to determine how well it predicts group placement.
- Structure Correlation Coefficients: The correlation between each predictor and the discriminant score of each function. This is a zero-order correlation.
- Standardized Coefficients: Each predictor's weight in the linear combination that is the discriminant function. Like in a regression equation, these coefficients are partial. Indicates the unique contribution of each predictor in predicting group assignment.
- Functions at Group Centroids: Mean discriminant scores for each grouping variable are given for each function. The farther apart the means are, the less error there will be in classification.
Discrimination rules
- Maximum likelihood: Assigns to the group that maximizes population density.
- Bayes Discriminant Rule: Assigns to the group that maximizes, where πi represents the prior probability of that classification, and represents the population density.
- Fisher's linear discriminant rule: Maximizes the ratio between SSbetween and SSwithin, and finds a linear combination of the predictors to predict group.
Eigenvalues
The eigenvalue can be viewed as a ratio of SSbetween and SSwithin as in ANOVA when the dependent variable is the discriminant function, and the groups are the levels of the IV. This means that the largest eigenvalue is associated with the first function, the second largest with the second, etc..
Effect size
Some suggest the use of eigenvalues as effect size measures, however, this is generally not supported. Instead, the canonical correlation is the preferred measure of effect size. It is similar to the eigenvalue, but is the square root of the ratio of SSbetween and SStotal. It is the correlation between groups and the function.Another popular measure of effect size is the percent of variance for each function. This is calculated by: where is the eigenvalue for the function and is the sum of all eigenvalues. This tells us how strong the prediction is for that particular function compared to the others.
Percent correctly classified can also be analyzed as an effect size. The kappa value can describe this while correcting for chance agreement.
Canonical discriminant analysis for ''k'' classes
Canonical discriminant analysis finds axes that best separate the categories. These linear functions are uncorrelated and define, in effect, an optimal k − 1 space through the n-dimensional cloud of data that best separates the k groups. See “Multiclass LDA” for details below.Because LDA uses canonical variates, it was initially often referred as the "method of canonical variates" or canonical variates analysis.
Fisher's linear discriminant
The terms Fisher's linear discriminant and LDA are often used interchangeably, although Fisher's original article actually describes a slightly different discriminant, which does not make some of the assumptions of LDA such as normally distributed classes or equal class covariances.Suppose two classes of observations have means and covariances. Then the linear combination of features will have means and variances for. Fisher defined the separation between these two distributions to be the ratio of the variance between the classes to the variance within the classes:
This measure is, in some sense, a measure of the signal-to-noise ratio for the class labelling. It can be shown that the maximum separation occurs when
When the assumptions of LDA are satisfied, the above equation is equivalent to LDA.
Be sure to note that the vector is the normal to the discriminant hyperplane. As an example, in a two dimensional problem, the line that best divides the two groups is perpendicular to.
Generally, the data points to be discriminated are projected onto ; then the threshold that best separates the data is chosen from analysis of the one-dimensional distribution. There is no general rule for the threshold. However, if projections of points from both classes exhibit approximately the same distributions, a good choice would be the hyperplane between projections of the two means, and. In this case the parameter c in threshold condition can be found explicitly:
Otsu's method is related to Fisher's linear discriminant, and was created to binarize the histogram of pixels in a grayscale image by optimally picking the black/white threshold that minimizes intra-class variance and maximizes inter-class variance within/between grayscales assigned to black and white pixel classes.