OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

The DiffPrivLib version of Decision Tree is giving me the wrong accuracy

  • Thread starter Thread starter Alison Krauskopf
  • Start date Start date
A

Alison Krauskopf

Guest
IBM specifically designed their differential privacy library, DiffPrivLib, to work exactly like scikitlearn for ease of use. The tutorials on their github site state that if using epsilon = infinity (and, presumably, the same random state) with DiffPrivLib you should get an identical model to the non-private version from scikitlearn. Ref: github.com/IBM/differential-privacy-library/blob/main/notebooks/…

I've confirmed that this works with Gaussian Naiive Bayes, but when using almost identical code frameworks with the Decision Tree Classifier, I'm getting radically different values. Has anyone else encountered this? Can you see what's wrong with my code?

This is the code to run a non differentially private Decision Tree using scikitlearn, this works fine.

Code:
# Import necessary packages
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
import diffprivlib as dp

# Open and read the data into a dataframe
train = pd.read_csv('train.csv', index_col=0)
test = pd.read_csv('test.csv', index_col=0)

# X are the independent variables, y are the dependent variables.
X_train = train.drop(['is_fraud'], axis=1)
y_train = train['is_fraud']
X_test = test.drop(['is_fraud'], axis=1)
y_test = test['is_fraud']

# Build a Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Model training
dt.fit(X_train, y_train)

# Predict Output
y_pred_dt = dt.predict(X_test)

# Output metrics
print(confusion_matrix(y_test, y_pred_dt))

Note: I am working with synthetic data and I accidentally hacked the way the data has been simulated by adding in aggregates that the DT is branching on and giving me abnormal amounts of accuracy. But that's an issue for another day. For now, my confusion matrix looks like this (perfect):

Code:
[[553574      0]
 [     0   2145]]

Now for the problem area. With epsilon set to infinity and using the same random state as above I 'should' be getting the same values and I'm not. Here is my code:

Code:
# Build a Decision Tree Classifier
DPdt = dp.models.DecisionTreeClassifier(random_state=42, epsilon=np.inf)

# Model training
DPdt.fit(X_train, y_train)

# Predict Output
y_pred_DPdt = DPdt.predict(X_test)

# Output metrics
print(confusion_matrix(y_test, y_pred_DPdt))

And this is giving me dramatically different classification

Code:
[[553574      0]
 [  1787    358]]

Can anyone see what I've done wrong?
<p>IBM specifically designed their differential privacy library, DiffPrivLib, to work exactly like scikitlearn for ease of use. The tutorials on their github site state that if using epsilon = infinity (and, presumably, the same random state) with DiffPrivLib you should get an identical model to the non-private version from scikitlearn. Ref: <a href="https://github.com/IBM/differential-privacy-library/blob/main/notebooks/naive_bayes.ipynb" rel="nofollow noreferrer">github.com/IBM/differential-privacy-library/blob/main/notebooks/…</a></p>
<p>I've confirmed that this works with Gaussian Naiive Bayes, but when using almost identical code frameworks with the Decision Tree Classifier, I'm getting radically different values. Has anyone else encountered this? Can you see what's wrong with my code?</p>
<p>This is the code to run a non differentially private Decision Tree using scikitlearn, this works fine.</p>
<pre><code># Import necessary packages
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
import diffprivlib as dp

# Open and read the data into a dataframe
train = pd.read_csv('train.csv', index_col=0)
test = pd.read_csv('test.csv', index_col=0)

# X are the independent variables, y are the dependent variables.
X_train = train.drop(['is_fraud'], axis=1)
y_train = train['is_fraud']
X_test = test.drop(['is_fraud'], axis=1)
y_test = test['is_fraud']

# Build a Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Model training
dt.fit(X_train, y_train)

# Predict Output
y_pred_dt = dt.predict(X_test)

# Output metrics
print(confusion_matrix(y_test, y_pred_dt))
</code></pre>
<p>Note: I am working with synthetic data and I accidentally hacked the way the data has been simulated by adding in aggregates that the DT is branching on and giving me abnormal amounts of accuracy. But that's an issue for another day. For now, my confusion matrix looks like this (perfect):</p>
<pre><code>[[553574 0]
[ 0 2145]]
</code></pre>
<p>Now for the problem area. With epsilon set to infinity and using the same random state as above I 'should' be getting the same values and I'm not. Here is my code:</p>
<pre><code># Build a Decision Tree Classifier
DPdt = dp.models.DecisionTreeClassifier(random_state=42, epsilon=np.inf)

# Model training
DPdt.fit(X_train, y_train)

# Predict Output
y_pred_DPdt = DPdt.predict(X_test)

# Output metrics
print(confusion_matrix(y_test, y_pred_DPdt))
</code></pre>
<p>And this is giving me dramatically different classification</p>
<pre><code>[[553574 0]
[ 1787 358]]
</code></pre>
<p>Can anyone see what I've done wrong?</p>
 

Latest posts

I
Replies
0
Views
1
impact christian
I
Top