How to split my dataset into Test and Train without repitition?

I am developing a Python script to test an algorithm. I have a dataset that I need to split into 80% for training and 20% for testing. However, I want to save the test set for further analysis, ensuring no overlap with previous test sets.

Although my code works well overall, I encountered one issue: the test dataset sometimes contains records that were already selected in previous test runs due to the random selection process.

To clarify with an example:

On the first run, my dataset {0,1,2,3,4,5,6,7,8,9} is split into a training set {0,1,2,4,5,7,8,9} and a test set {3,6}.
On the second run, the training set is {0,1,2,3,4,5,7,9} and the test set is {6,8}.

As you can see, the record {6} was selected twice for testing, which I want to avoid.

How can I modify the code to ensure that the 20% test set is chosen randomly each time but excludes any records that were previously selected?

Here is the current code:

df = pd.read_csv("CustomersInfo.csv")
y = df['CustomerRank']
X = df.drop('CustomerRank', axis=1, errors="ignore")


#-------------------------------------------------------------------
#This is the part that need to be fixed
for RandStat in [11, 22, 33, 44, 55]:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RandStat)
#-------------------------------------------------------------------



    clf = XGBClassifier(random_state=RandStat)
    clf.fit(X_train, y_train)
    fnStoreAnalyse(y_train)

You need to sign in to view this answers

Related Post