Query on how a classification model can be used for drift detection without introducing error. #37

k1hrudhay · 2023-02-02T06:41:01Z

k1hrudhay
Feb 2, 2023

Hi Team,

I am working on drift detection on structured and un-structured data and recently got to know about Eurybia.

I observed that Eurybia works on structured data and not on unstructured data, is my understanding correct?
When we want to test this on a set of dataset and then later use it for production, in such case, let us consider an example:

Let us consider a piece of text to be present in production data whose label is 1, now the model is trained and identified it to be under drift, we then have to re-train out model on the production data yo be used for the next drift computation. Now if it occurs again, but it is labeled as 1 but in fact there shouldn't be a drift.

Aren't we inducing bias or forcing the value to the input and as a result introducing the error.

Can you explain how this requirement can be solved using a classification model?
I explained the above scenario based on the following observations:

Case 1:
Data: A dataframe with just 1 column having text of reviews of variable length. df_basline and df_current are having 80% and 20% of the total data respectively.
```
 Output: The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.5
```
Case 2:
Data: Same data used as case 1, with an additional column where I labeled the data as 0 and 1 based on 80:20 ratio, so that the first 80% of the data is training/df_baseline having '0' as label and the rest 20% data is for testing/df_current having label '1'.
```
 Output: The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.5
```
Case 3:
Data: Same data used in Case 2, df having 2 columns (text, label). Now I considered the first 80% data for both df_baseline and df_current. Both the df_basline and df_current have the same text and same label ('0').
```
 Output: The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.85
```
When the data is same as in 'Case 3', why is there s drift, how is the model actually working and aren't we introducing error adding the label?

Thanks and Regards
Hrudhay

ThomasBouche · 2023-02-08T14:39:22Z

ThomasBouche
Feb 8, 2023
Maintainer

Hi,

Yes, Eurybia works on structured data.

To understand your 3 cases, can you provide us the code to generate your data?
When you write about drift, are you talking about data drift or concept drift?

Let's respectively name features, target and prediction of a model X, Y and P(X, Y). P(X, Y) can be decompose as : P(X, Y) = P(Y|X)P(X), with P(Y|X), the conditional probability of ouput given the model features, and P(X) the probability density of the model features.

Data drift: Evolution of the production data over time compared to training or test data before deployment. With formulas, compare P(Xtraining) to P(XProduction)

Concept drift: When change in P(Y|XProduction) compared to P(Y|Xtraining)

0 replies

k1hrudhay · 2023-02-10T12:55:25Z

k1hrudhay
Feb 10, 2023
Author

Hi Thomas,

Thank you for your update.

I am using our internal data for testing Eurybia, hence cannot share it. But This is what I'm considering.

Have text data (reviews about a product) as a column, so if we have 1000 reviews, we have a dataframe with 1 column and 1000 rows.
Now I'm considering df_baseline and df_current to have 80% and 20% of the total data respectively, which mean 800 and 200 rows of data respectively.
In the next scenario, I'm adding another column 'Label', which has 0 for first 80% of the data and 1 for the next 20% of the data.

So, thank you for your explanation on the drift concepts, can you please let me know how is Eurybia working in the backend considering these different drifts?

Now when the AUC score is being generated, it is not showing anywhere what kind of drift is it but we can only observe if there is drift or not based on the value. Is my understanding here correct?

Also, does Eurybia primarily work on "Concept drift" ?

Thanks and Regards
Hrudhay

0 replies

ThomasBouche · 2023-02-10T14:06:55Z

ThomasBouche
Feb 10, 2023
Maintainer

Hi,

if you generate data like this :

from eurybia import SmartDrift
from sklearn.model_selection import train_test_split
import numpy as np
from eurybia.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')
df_baseline , df_current = train_test_split(house_df[['LotConfig']], train_size=0.8, random_state=1)

#case 1
SD = SmartDrift(df_current=df_current,
                df_baseline=df_baseline)
SD.compile()
SD.auc # 0.5

#case 2
df_baseline["label"]=np.random.choice(a = [0,1], size = df_baseline.shape[0], p = [0.2, 0.8])
df_current["label"]=np.random.choice(a = [0,1], size = df_current.shape[0], p = [0.2, 0.8])
SD = SmartDrift(df_current=df_current,
                df_baseline=df_baseline)
SD.compile()
SD.auc # 0.5
SD.generate_report(    
    output_file='test_data_drift.html',    
    title_story="Data drift",    
    )

is that the first two cases?
How do you generate your case 3?

For Data drift, you can read section : "How Eurybia detect data drift" of readme.md (https://github.com/MAIF/eurybia/blob/master/README.md)

For the moment Eurybia does not deal with the drift concept.

Auc score of Eurybia is only for datadrift.

If you want to detect data drift on review texts (as Eurybia does not yet have specific features for text), you have to do some preprocessing to monitor the changes you want.
For example, you can create a column on the number of characters in your text. You can also create columns with regexes, on terms/words/categories. and check if these categories still have the same distribution in time or not.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query on how a classification model can be used for drift detection without introducing error. #37

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Query on how a classification model can be used for drift detection without introducing error. #37

k1hrudhay Feb 2, 2023

Replies: 3 comments

ThomasBouche Feb 8, 2023 Maintainer

k1hrudhay Feb 10, 2023 Author

ThomasBouche Feb 10, 2023 Maintainer

k1hrudhay
Feb 2, 2023

ThomasBouche
Feb 8, 2023
Maintainer

k1hrudhay
Feb 10, 2023
Author

ThomasBouche
Feb 10, 2023
Maintainer