Concept drift

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

Predictive model decay

In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.

Data configuration decay

Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system.
For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.
In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.
There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream in the data processing pipeline, but somewhere downstream these data fields are absent.

Inconsistent data

"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input.
"Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy.

Examples

The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.

Possible remedies

To prevent deterioration in prediction accuracy because of concept drift, reactive and tracking solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test or control charts from statistical process control, to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy. A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include online machine learning, frequent retraining on the most recently observed samples, and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.
Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.
Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.

Remedy methods

DDM : detects drift by monitoring the model's error rate over time. When the error rate passes a set threshold, it enters a warning phase, and if it passes another threshold, it enters a drift phase.
EDDM : improves DDM's detection rate by tracking the average distance between two errors instead of only the error rate.
ADWIN : dynamically stores a window of recent data and warns the user if it detects a significant change between the statistics of the window's earlier data compared to more recent data.
KSWIN : detects drift based on the Kolmogorov-Smirnov statistical test.

DDM and EDDM: Concept Drift Detection

online supervised methods that rely on sequential error monitoring to estimate the evolving error rate.

ADWIN and KSWIN: Windowing

maintain a "window", a subset of the most recent data, of the data stream, which it checks for statistical differences across the window.

Applications in security

Concept drift is a recurring issue in security analytics, especially in malware and intrusion detection. In these systems, models are often trained on past logs, binaries or network traces, but the behaviour of attackers changes over time as new malware families, obfuscation techniques and campaigns appear. When the data no longer resemble the training set, the decision boundaries learned by classifiers or anomaly detectors can become misaligned with the current threat landscape and detection performance can drop unless the models are updated or replaced.
Several studies on Windows malware model detection as an evolving data stream and track how performance changes as time passes. They show that classifiers trained on a fixed time window can perform well on nearby data but deteriorate quickly when evaluated on samples collected months or years later, even when large amounts of training data are available. In order to keep up with this, security systems often use sliding or adaptive windows, which restrict training to the most recent portion of the data so that older, less relevant examples are gradually discarded. They also employ drift detectors such as ADWIN and KSWIN that monitor error rates or changes in the distribution of recent observations and signal when the statistics of the incoming stream differ significantly from the past, prompting retraining or model replacement.
Related problems appear in spam filtering, fraud detection and intrusion detection, where adversaries change content, patterns of activity or network behavior to evade models trained on historical data. In these settings drift can be gradual, as new types of spam or fraud emerge, or abrupt, after a sudden shift in attack techniques. Common strategies to remain effective include updating models with recent labelled examples, using ensembles that give more weight to classifiers trained on newer data, and designing features that are less sensitive to superficial changes in how attacks are carried out.
Research on machine learning for security has also shown that not handling concept drift correctly during evaluation can lead to a lot of bias. Researchers studied 30 learning-based security systems and found that many detectors were only tested in short time periods or lab-only. This ignores temporal correlations, non-stationarity, and the way attacks change in the real world, which can make these systems seem much more effective than they really are when they are put to use. The authors point out ways that people can snoop on time, like using features from newer malware samples when training models that are supposed to work on older data, calculating normalization statistics, or embeddings on the whole dataset. They also talk about using old benchmark datasets that don't train models for current threats. Researchers suggest using time-aware evaluation protocols that keep causal ordering, don't let future information leak into training, and make sure that security data takes into account both time and space so that reported performance better shows the effects of concept drift in real-world situations.

Concept Drift, Concept Evolution, and Adversarial Manipulation

Concept drift refers to when the relationship between input features and their corresponding labels changes gradually over time. Consider a simple example: phishing emails. The older kinds of attacks would leverage prominent signals, such as the presence of the word "lottery," while newer variants would introduce more polished phrasing, such as "lottery and not scam," in attempts to appear more legitimate and evade detection. This is a form of concept drift because the underlying class—phishing—remains the same, but the feature patterns that define it shift as attackers adapt. It is important, though, to distinguish concept drift from its related but inherently different challenge: concept evolution.“Concept evolution” occurs when an entirely new attack family emerges with feature patterns the model has never encountered. The model is exposed to new features and patterns that were not present in the training data. In the multi-class setting, it means a completely new label is introduced. For the binary setting, a completely new malicious family is introduced; however, the model still tries to make predictions based on outdated feature patterns. One such example would be if a model is trained with both trojan and ransomware features, but then crypto-miner malware emerged. This is still "malicious" in label; however, it means that features are now new for the model. Concept drift and concept evolution pinpoint how fast attackers adapt and keep evolving their techniques to exploit the weaknesses in cyber-defense models. The attackers regularly update their payloads, communication methodologies, and even their depended-upon APIs. This translates to a model that is only trained on historical data becoming outdated unless updated or designed to learn continuously.
These behavior changes do not always happen organically; attackers can introduce them intentionally. The primary causes of concept drift are adversarial attacks because slight, carefully crafted changes to malware by the attacker will affect the input-label relationship, leading to models misclassifying malware as benign. Adversarial attacks generally fall into two settings: white-box and black-box. In white-box settings, an attacker fully accesses the model with its architecture, parameters, and gradients. One example in white-box would be those attacks in which an attacker uses gradient-based approaches to build slight perturbations to a sample, causing its misclassification. On the other extreme are black-box attacks. An adversary knows the input and output of the model but not its inner workings. Attackers can probe the model's outputs—sometimes sending thousands of slightly varied inputs—until something slips past the detector, or they train a substitute model and then generate adversarial examples that transfer to the original system. These tactics gradually shift the patterns a model observes, accelerating drift and making previously reliable defenses far less effective.
These attacker-driven manipulations not only cause concept drift but also reveal the weaknesses of modern learning techniques such as transfer learning. Transfer learning enables defenders to take an existing model, pre-trained for some other task, and fine-tune it rather than training from scratch. This has considerably increased success across diverse domains such as image classification and natural language processing. For malware detection, for instance, Microsoft and Intel recently demonstrated that converting malware binaries into grayscale images enables malware detection using pre-trained vision models, yielding strong performance—for example, 87% recall with only 0.1% false positives. While such an approach reduces the training time and computational cost drastically, it naturally brings one important drawback: the base model architecture is often publicly known. An adversary may use this fact and create adversarial examples designed to exploit the known weaknesses of the underlying model, which then enables those manipulations to "transfer" to the newly trained malware detector. In this regard, transfer learning offers the significant advantages of co-opting prior work while also running the risk of inheriting vulnerabilities that attackers can readily exploit.
In addition to rapidly evolving attacks, ML based defenses face a known issue of acquiring accurate data or labels. Label aggregators may encounter difficulty when labeling new samples, and initially these labels can be incorrect. As time passes, the labels typically stabilize toward their ground truth. This process can take anywhere from days to years. This is considered a contributing factor to concept drift because during this time the original labels may be updated, new attacks may be produced, or new classes may appear. This phenomenon is called delayed labels, and it contributes to the broader causes of concept drift. Therefore, delayed labels are often taken into account when building defense solutions and their evaluations. To detect drift between labels and sample mappings, drift detectors are commonly used during evaluation.

Software

: An open-source Python library for drift detection in machine learning systems.
: An open-source Python library for detecting univariate and multivariate distribution drift and estimating machine learning model performance without ground truth labels.
RapidMiner: Formerly Yet Another Learning Environment : free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin.
EDDM : free open-source implementation of drift detection methods in Weka.
MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka.

Datasets

Real

USP Data Stream Repository, 27 real-world stream datasets with concept drift compiled by Souza et al.. Airline, approximately 116 million flight arrival and departure records compiled by E. Ikonomovska. Reference: Data Expo 2009 Competition . Chess.com and Luxembourg datasets compiled by I. Zliobaite. ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. from S.J.Delany webpageElec2, electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. from J.Gama webpage. Comment on applicability.PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository. SMEAR is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness. Text mining, a collection of text mining datasets with concept drift, maintained by I. Katakis. Gas Sensor Array Drift Dataset, a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations.

Other

KDD'99 competition data contains simulated intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift.

Synthetic

Extreme verification latency benchmark from Nonstationary Environments – Archive.Sine, Line, Plane, Circle and Boolean Data Sets from L.Minku webpage.SEA concepts from J.Gama webpage.STAGGER

'''Mixed'''

Data generation frameworks

from L.Minku webpage.

Projects

: Computational Intelligence Platform for Evolving and Robust Predictive Systems, Bournemouth University, Evonik Industries, Research and Engineering Centre
: Handling Concept Drift in Adaptive Information Systems, Eindhoven University of Technology
: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support
: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester, University of Bristol
: autonomous learning agents for decentralised data and information networks
: C++ incremental decision tree algorithm. it minimize concept drifting damage.

Benchmarks

: The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications.

Meetings

2014
* Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014
2013
* Real-World Challenges for Data Stream Mining Workshop-Discussion at the ECML PKDD 2013, Prague, Czech Republic.
* The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments
2011
* Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11
* The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems
* Track on Incremental Learning
* Special Session on Concept Drift and Learning Dynamic Environments
* Symposium on Computational Intelligence in Dynamic and Uncertain Environments
2010
* International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions
* Special Session on Dynamic learning in non-stationary environments
* Data Streams Track at ACM Symposium on Applied Computing
* International Workshop on Knowledge Discovery from Sensor Data
* Novel Data Stream Pattern Mining Techniques
* Concept Drift and Learning in Nonstationary Environments at
* Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA'10