Quantification (machine learning)
In machine learning and data mining, quantification is the task of using supervised learning in order to train models that estimate the relative frequencies of the classes of interest in a sample of unlabelled data items.
For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these tweets which belong to class `Positive', and to do the same for classes `Neutral' and `Negative'.
Quantification may also be viewed as the task of training predictors that estimate a probability distribution, i.e., that generate a predicted distribution that approximates the unknown true distribution of the items across the classes of interest. Quantification is different from classification, since the goal of classification is to predict the class labels of individual data items, while the goal of quantification it to predict the class prevalence values of sets of data items. Quantification is also different from regression, since in regression the training data items have real-valued labels, while in quantification the training data items have class labels.
It has been shown in multiple research works
that performing quantification by classifying all unlabelled instances and then counting the instances that have been attributed to each class usually leads to suboptimal quantification accuracy. This suboptimality may be seen as a direct consequence of 'Vapnik's principle', which states:
In our case, the problem to be solved directly is quantification, while the more general intermediate problem is classification. As a result of the suboptimality of the 'classify and count' method, quantification has evolved as a task in its own right, different from classification.
Quantification tasks
Quantification tasks according to the set of classes
The main variants of quantification, according to the characteristics of the set of classes used, are:- Binary quantification, corresponding to the case in which there are only classes and each data item belongs to exactly one of them;
- Single-label multiclass quantification, corresponding to the case in which there are classes and each data item belongs to exactly one of them;
- Multi-label multiclass quantification, corresponding to the case in which there are classes and each data item can belong to zero, one, or several classes at the same time;
- Ordinal quantification, corresponding to the single-label multiclass case in which a total order is defined on the set of classes.
- Regression quantification, a task which stands to 'standard' quantification as regression stands to classification. Strictly speaking, this task is not a quantification task as defined above, but has enough commonalities with other quantification tasks to be considered one of them.
Binary-only methods include the Mixture Model method, the HDy method, SVM, and SVM.
Methods that can deal with both the binary case and the single-label multiclass case include probabilistic classify and count, adjusted classify and count, probabilistic adjusted classify and count, the Saerens-Latinne-Decaestecker EM-based method, and KDEy.
Methods for multi-label quantification include regression-based quantification and label powerset-based quantification.
Methods for the ordinal case include ordinal versions of the above-mentioned ACC, PACC, and SLD methods,
and ordinal versions of the above-mentioned HDy method.
Methods for the regression case include Regress and splice and Adjusted regress and sum.
Quantification tasks according to the type of data
Several subtasks of quantification may be identified according to the type of data involved. Example such tasks are:- Quantification of networked data. This task consists of performing quantification when the datapoints are members of a relation, i.e., are interlinked. As such, this task is a strict relative of collective classification.
- Quantification over time. This task consists of performing quantification on sets that become available in a temporal sequence, i.e., as a data stream, and finds application in contexts in which class prevalence values must be monitored over time.
Evaluation measures for quantification
- Absolute Error
- Squared Error
- Relative Absolute Error
- Kullback–Leibler divergence
- Pearson Divergence
- Normalized Match Distance
- Root Normalized Order-Aware Distance
Applications
epidemiology,
market research, and ecological modelling,
since these fields are inherently concerned with aggregate data. However, quantification is also useful as a building block for solving other downstream tasks, such as improving the accuracy of classifiers on out-of-distribution data,
allocating resources,
measuring classifier bias,
and estimating the accuracy of classifiers on out-of-distribution data
Resources
- LQ 2021: the 1st International Workshop on Learning to Quantify
- LQ 2022: the 2nd International Workshop on Learning to Quantify
- LQ 2023: the 3rd International Workshop on Learning to Quantify
- LQ 2024: the 4th International Workshop on Learning to Quantify
- LQ 2025: the 5th International Workshop on Learning to Quantify
- LeQua 2022: the 1st Data Challenge on Learning to Quantify
- LeQua 2024: the 2nd Data Challenge on Learning to Quantify
- QuaPy: An open-source Python-based software library for quantification
- QuantificationLib: A Python library for quantification and prevalence estimation