Concept drift
In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.
Predictive model decay
In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.Data configuration decay
Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system.For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.
In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.
There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream in the data processing pipeline, but somewhere downstream these data fields are absent.
Inconsistent data
"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input."Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy.
Examples
The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.Possible remedies
To prevent deterioration in prediction accuracy because of concept drift, reactive and tracking solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test or control charts from statistical process control, to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy. A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include online machine learning, frequent retraining on the most recently observed samples, and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.
Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.
Remedy methods
- DDM : detects drift by monitoring the model's error rate over time. When the error rate passes a set threshold, it enters a warning phase, and if it passes another threshold, it enters a drift phase.
- EDDM : improves DDM's detection rate by tracking the average distance between two errors instead of only the error rate.
- ADWIN : dynamically stores a window of recent data and warns the user if it detects a significant change between the statistics of the window's earlier data compared to more recent data.
- KSWIN : detects drift based on the Kolmogorov-Smirnov statistical test.
- online supervised methods that rely on sequential error monitoring to estimate the evolving error rate.
- maintain a "window", a subset of the most recent data, of the data stream, which it checks for statistical differences across the window.
Applications in security
Several studies on Windows malware model detection as an evolving data stream and track how performance changes as time passes. They show that classifiers trained on a fixed time window can perform well on nearby data but deteriorate quickly when evaluated on samples collected months or years later, even when large amounts of training data are available. In order to keep up with this, security systems often use sliding or adaptive windows, which restrict training to the most recent portion of the data so that older, less relevant examples are gradually discarded. They also employ drift detectors such as ADWIN and KSWIN that monitor error rates or changes in the distribution of recent observations and signal when the statistics of the incoming stream differ significantly from the past, prompting retraining or model replacement.
Related problems appear in spam filtering, fraud detection and intrusion detection, where adversaries change content, patterns of activity or network behavior to evade models trained on historical data. In these settings drift can be gradual, as new types of spam or fraud emerge, or abrupt, after a sudden shift in attack techniques. Common strategies to remain effective include updating models with recent labelled examples, using ensembles that give more weight to classifiers trained on newer data, and designing features that are less sensitive to superficial changes in how attacks are carried out.
Research on machine learning for security has also shown that not handling concept drift correctly during evaluation can lead to a lot of bias. Researchers studied 30 learning-based security systems and found that many detectors were only tested in short time periods or lab-only. This ignores temporal correlations, non-stationarity, and the way attacks change in the real world, which can make these systems seem much more effective than they really are when they are put to use. The authors point out ways that people can snoop on time, like using features from newer malware samples when training models that are supposed to work on older data, calculating normalization statistics, or embeddings on the whole dataset. They also talk about using old benchmark datasets that don't train models for current threats. Researchers suggest using time-aware evaluation protocols that keep causal ordering, don't let future information leak into training, and make sure that security data takes into account both time and space so that reported performance better shows the effects of concept drift in real-world situations.