Statistical vs Analytical Data Drift Analysis

Using machine learning to analyse data drift

Posted by Mohd Shah on Sunday, January 1, 2023

Why and what is data drift?

Data drift analysis occurs towards the end of a machine learning project life cycle and serves to act as a method to combat model decay in a scenario whereby the ground truth delay of the model is measured in prolonged period of time. In other words, this technique is useful when the true outcomes of the predictions are only made available after longer period of time, such as weeks, months, or years.

Theoretically, a trained model would perform best when the incoming data adheres to the same distribution as the training data. When that distribution shifts, we would refer to that as a drift, hence the term data drift. The drift in distribution would then lead to inaccurate prediction by the model.

Shift in data distribution
Shift in data distribution

Data drift analysis can be applied to both regression and classification based model as the analysis does not require the target variable in order to be conducted. However, the issue lies in determining if or when a shift has happen that can be considered as significant.

Statistical Analysis of Data Drift

One of the most powerful tool in a statistician arsenal for determining differences in mean is the use of T-test (for population sample) or Z-test (for entire population). However, there are a couple of limitations to these tests that makes them inapplicable in certain scenarios,

  1. the data has to be a continuous value
  2. the sample distribution has to observe normality
  3. the significance level is arbitrarily determined

One of the biggest reason as to why T-test and Z-test are not popularly used in most data drift solutions such as Evidently is largely attributed to point number 2. Data distributions comes in many shapes and forms and to have it adhere to normality is not practical.

Thus, other non-parametric tests such as Kolmogorov-Smirnov (KS) test can be used to determined the p-value of the incoming data against the training data.

How significant is significant?

The issue with using p-value tests to determine drift is that the siginificance level is usually empirically defined by the user and therefore does not actually tell us how much the data has drifted.

Another problem with using p-value is that the sample size becomes another attributing factor to consider. Given 2 sets of training and scoring data of similar mean differences, the set with the larger sample size would result in a smaller p-value.

Hence, even the KS test becomes unreliable when dealing with sample size larger than 1000. Having the significance level be empirical value to be determined by the end user and requiring it to be adjusted based on the sample size, makes using statistical test very much an abstract method for data drift monitoring.

How can we be less abstract?

Another method that is gaining some traction in Kaggle competition for detecting data drift is the use of the Adversarial Validation technique. This method involves using classification algorithm to seperate the training dataset and incoming dataset.

The training and incoming data has the same distribution
The training and incoming data has the same distribution

If the models ROC-AUC performance is above 0.5, then we can say with good confidence that there is a difference between the training dataset and incoming dataset. Substantially so, that the machine learning model was able to distinguish the two sets apart.

Using tree based classifiers such as XGBoost would also allow us to determine the features that most attribute to the performance of the model using feature importance. The gini impurity index of the feature importance would allow us to see which feature has the most drift and address them accordingly.

This takes out a lot of the guessing work when compared to using statistical tests.

Steps for conductin adversarial validation

In order to conduct adversarial validation, the following steps needs to be carried out;

  1. the training dataset and the scoring data needs to be labelled 0 and 1 respectively (the order does not matter as long we are consistent).
  1. The dataset is then combined and shuffled
  1. The dataset is then split into training and testing set
  1. Train an XGBoost classifier model with the training dataset
  1. Evaluate the ROC-AUC performance of the model on the test dataset.

Size of the drift

While the both statistical test and adversarial validation is capable of telling us if there is a drift in our dataset, it does not however tell us the size of the drift.

Here we can use Wasserstein’s Distance (Earth mover distance) in order to compare the horizontal distance between the two different distributions.

Wasserstein Distance / Earth Mover Distance / Kantorovich–Rubinstein metric
Wasserstein Distance / Earth Mover Distance / Kantorovich–Rubinstein metric

Wassersteind distance allows us to determine how much our data has shifted in order to complete the whole picture of what were trying to solve.

Conclusion

Ultimately, there is no one size fits all technique for data drift monitoring. Mutliple different methods and techniques are required in order to paint a more wholistic picture of our data. While statistical methods such as t-test and KS test are more widely used, I would advise against using such methods as they are more emperical in their application and only provides a somewhat abstract understanding of the data.


comments powered by Disqus