OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Handling categorical values in scikit-learn pipeline and consistent encoding across datasets with MLflow

  • Thread starter Thread starter Stackie
  • Start date Start date
S

Stackie

Guest
I have created a function that implement scikit-learn's pipeline into machine learning model but when I print out the target column (before splitting the dataset), it's still category value.

Besides, how can I log the model into the mlflow so in future the model is able to encode the features/ target consistently. For example, there is Dataset A and there are 3 categories: A, B and C and the model encode A,B,C as 1,2,3 but in future let's say there is a new dataset/ or in predict phase the dataset only has B and C and the model must able to encode it as 2 and 3 instead of 1 and 2. For now, when I log the model into the mlflow, it doesn't show any inputs in the artifact.

P/S: Im using Iris dataset

Following is my function codes:

Code:
def create_pipeline(self, train_feature, train_label, encoding_method, model):
    feature_cols = train_feature.columns

    if encoding_method == EncodingMethod.ONE_HOT: 
        categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore',
                                                                        sparse=False))])
        
        preprocessor = ColumnTransformer(transformers=[('category', categorical_transformer, feature_cols)],remainder='passthrough')

        label_encoder = LabelEncoder()
        train_label = pd.Series(label_encoder.fit_transform(train_label))

    elif encoding_method == EncodingMethod.LABEL:
        categorical_transformer = Pipeline(steps=[('label', LabelEncoder())])

        preprocessor = ColumnTransformer(transformers=[('category', categorical_transformer, feature_cols)],
                                            remainder='passthrough')
        
        label_encoder = LabelEncoder()
        train_label = pd.Series(label_encoder.fit_transform(train_label))
        
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('classifier', model)])
    
    return pipeline

This is my script:

Code:
df = self._dataset.as_dataframe()

train_feature = df[self._train_configs.feature_cols]
train_label = df[self._train_configs.target_col]

self._model = self.create_pipeline(train_feature, train_label, self._train_configs.encoding_method, self._model)
print("\n")
print("This is model")
print(self._model)

X_train, X_test, y_train, y_test = train_test_split(train_feature, train_label, random_state=0, train_size=0.8)
print("\n")
print("This is y_train")
print(y_train.head())

experiment = mlflow.get_experiment_by_name(self.project_id)
with mlflow.start_run(experiment_id=experiment.experiment_id):
    self._model.fit(X=X_train, y=y_train)
    self._model.score(X_test, y_test)

    mlflow.sklearn.log_model(self._model, "pipeline/model")

Output of my self._model

Code:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',  
                                   transformers=[('category',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object'))])),
                ('classifier',
                 RandomForestClassifier(max_depth=3,
                                        max_features=<RandomForestFeatureSelectionStrategy.SQUARE_ROOT: 'sqrt'>,
                                        max_samples=0.1, n_estimators=20))])

Output of my y_train

This is y_train

Code:
137     Iris-virginica
84     Iris-versicolor
27         Iris-setosa
127     Iris-virginica
132     Iris-virginica
<p>I have created a function that implement <code>scikit-learn</code>'s pipeline into machine learning model but when I print out the target column (before splitting the dataset), it's still category value.</p>
<p>Besides, how can I log the model into the <code>mlflow</code> so in future the model is able to encode the features/ target consistently. For example, there is Dataset A and there are 3 categories: A, B and C and the model encode A,B,C as 1,2,3 but in future let's say there is a new dataset/ or in predict phase the dataset only has B and C and the model must able to encode it as 2 and 3 instead of 1 and 2. For now, when I log the model into the <code>mlflow</code>, it doesn't show any inputs in the artifact.</p>
<p>P/S: Im using Iris dataset</p>
<p><strong>Following is my function codes:</strong></p>
<pre><code>def create_pipeline(self, train_feature, train_label, encoding_method, model):
feature_cols = train_feature.columns

if encoding_method == EncodingMethod.ONE_HOT:
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore',
sparse=False))])

preprocessor = ColumnTransformer(transformers=[('category', categorical_transformer, feature_cols)],remainder='passthrough')

label_encoder = LabelEncoder()
train_label = pd.Series(label_encoder.fit_transform(train_label))

elif encoding_method == EncodingMethod.LABEL:
categorical_transformer = Pipeline(steps=[('label', LabelEncoder())])

preprocessor = ColumnTransformer(transformers=[('category', categorical_transformer, feature_cols)],
remainder='passthrough')

label_encoder = LabelEncoder()
train_label = pd.Series(label_encoder.fit_transform(train_label))

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', model)])

return pipeline
</code></pre>
<p><strong>This is my script:</strong></p>
<pre><code>df = self._dataset.as_dataframe()

train_feature = df[self._train_configs.feature_cols]
train_label = df[self._train_configs.target_col]

self._model = self.create_pipeline(train_feature, train_label, self._train_configs.encoding_method, self._model)
print("\n")
print("This is model")
print(self._model)

X_train, X_test, y_train, y_test = train_test_split(train_feature, train_label, random_state=0, train_size=0.8)
print("\n")
print("This is y_train")
print(y_train.head())

experiment = mlflow.get_experiment_by_name(self.project_id)
with mlflow.start_run(experiment_id=experiment.experiment_id):
self._model.fit(X=X_train, y=y_train)
self._model.score(X_test, y_test)

mlflow.sklearn.log_model(self._model, "pipeline/model")
</code></pre>
<p><strong>Output of my self._model</strong></p>
<pre><code>Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('category',
Pipeline(steps=[('onehot',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object'))])),
('classifier',
RandomForestClassifier(max_depth=3,
max_features=<RandomForestFeatureSelectionStrategy.SQUARE_ROOT: 'sqrt'>,
max_samples=0.1, n_estimators=20))])
</code></pre>
<p><strong>Output of my y_train</strong></p>
<p>This is y_train</p>
<pre><code>137 Iris-virginica
84 Iris-versicolor
27 Iris-setosa
127 Iris-virginica
132 Iris-virginica
</code></pre>
 

Latest posts

A
Replies
0
Views
1
ALHUSSAIN ALAWAD ALHASSAN AWDA
A
Top