Leakage (machine learning)
In statistics and machine learning, leakage refers to the use of information during model training that would not be available at prediction time. This results in overly optimistic performance estimates, as the model appears to perform better during evaluation than it actually would in a production environment.
Leakage is often subtle and indirect, making it difficult to detect and eliminate. It can lead a statistician or modeler to select a suboptimal model, which may be outperformed by a leakage-free alternative.
Leakage modes
Leakage can occur at multiple stages of the machine learning workflow. Broadly, its sources can be divided into two categories: those arising from features and those arising from training examples.Feature leakage
Feature or column-wise leakage is caused by the inclusion of columns which are one of the following: a duplicate label, a proxy for the label, or the label itself. These features, known as anachronisms, will not be available when the model is used for predictions, and result in leakage if included when the model is trained.For example, including a "MonthlySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate".
Training example leakage
Row-wise leakage is caused by improper sharing of information between rows of data. Types of row-wise leakage include:- Premature featurization; leaking from premature featurization before Cross-validation/Train/Test split
- Duplicate rows between train/validation/test
- Non-independent and identically distributed random (non-IID) data
- * Time leakage
- * Group leakage—not including a grouping split column
Detection
Data leakage in machine learning can be detected through various methods, focusing on performance analysis, feature examination, data auditing, and model behavior analysis. Performance-wise, unusually high accuracy or significant discrepancies between training and test results often indicate leakage. Inconsistent cross-validation outcomes may also signal issues.Feature examination involves scrutinizing feature importance rankings and ensuring temporal integrity in time series data. A thorough audit of the data pipeline is crucial, reviewing pre-processing steps, feature engineering, and data splitting processes. Detecting duplicate entries across dataset splits is also important.
For language models, the Min-K% method can detect the presence of data in a pretraining dataset. It presents a sentence suspected to be present in the pretraining dataset, and computes the log-likelihood of each token, then compute the average of the lowest K of these. If this exceeds a threshold, then the sentence is likely present. This method is improved by comparing against a baseline of the mean and variance.
Analyzing model behavior can reveal leakage. Models relying heavily on counter-intuitive features or showing unexpected prediction patterns warrant investigation. Performance degradation over time when tested on new data may suggest earlier inflated metrics due to leakage.
Advanced techniques include backward feature elimination, where suspicious features are temporarily removed to observe performance changes. Using a separate hold-out dataset for final validation before deployment is advisable.