I am developing a Python script to test an algorithm. I have a dataset that I need to split into 80% for training and 20% for testing. However, I want to save the test set for further analysis, ensuring no overlap with previous test sets.
Although my code works well overall, I encountered one issue: the test dataset sometimes contains records that were already selected in previous test runs due to the random selection process.
To clarify with an example:
- On the first run, my dataset
{0,1,2,3,4,5,6,7,8,9}
is split into a training set{0,1,2,4,5,7,8,9}
and a test set{3,6}
. - On the second run, the training set is
{0,1,2,3,4,5,7,9}
and the test set is{6,8}
.
As you can see, the record {6}
was selected twice for testing, which I want to avoid.
How can I modify the code to ensure that the 20% test set is chosen randomly each time but excludes any records that were previously selected?
Here is the current code:
df = pd.read_csv("CustomersInfo.csv")
y = df['CustomerRank']
X = df.drop('CustomerRank', axis=1, errors="ignore")
#-------------------------------------------------------------------
#This is the part that need to be fixed
for RandStat in [11, 22, 33, 44, 55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RandStat)
#-------------------------------------------------------------------
clf = XGBClassifier(random_state=RandStat)
clf.fit(X_train, y_train)
fnStoreAnalyse(y_train)
You need to sign in to view this answers