Scikit-Learn Pipelines with MLflow

The Scikit-Learn pipeline is a potent tool that is used in MLflow to automate the machine learning experiments and workflows. This allows for combining the various data processing and model-training techniques and tracking models’ evolution over time. A robust and effective method for developing and deploying the machine learning models can be produced by merging the MLflow and Scikit-Learn pipelines. Pipelines enhance efficiency, reproducibility, scalability, and traceability in machine learning workflows, saving time and effort, handling large datasets and sophisticated models, and detecting and enhancing models over time.

How to Create a Machine Learning Model

  1. Import all prerequisite libraries.
  2. Load the data from different data storage.
  3. Establish a pipeline that comprises the model training, feature engineering, and data preprocessing operations.
  4. Fit the pipeline to the data.
  5. Assess the model’s performance.
  6. Use MLflow to log the model.

Install the Scikit-Learn Pipelines

Using pip:

pip install scikit-learn

Verify the Installation

Run the following command to confirm that Scikit-Learn is installed after it has been set up:

python -c "import sklearn; print(sklearn.__version__)"


Implementation of Scikit-Learn Pipelines in E-Commerce

Let’s use a real-world e-commerce like Amazon or eBay dataset as an example and create a Scikit-Learn pipeline to forecast the consumer satisfaction based on the customer’s purchasing history and demographic data. We’ll implement a simplified, hypothetical e-commerce dataset in this example using the features that are listed in the following:

  • Age: The customer’s age which is expressed as a numerical value.
  • Gender: The customer’s gender (classified as “Male” or “Female”).
  • Income: The customer’s (numeric) annual salary.
  • Purchase-History: The customer’s overall (numeric) e-commerce platform that purhcases the total amount.
  • Customer feedback/Satisfaction: Regarding satisfaction (target variable: binary, 0 or 1).

Step 1: Generate a Synthetic CSV Dataset

Utilizing Python’s “Pandas” package, create a sample CSV dataset for an e-commerce website that has “Age”, “Gender”, “Purchase History”, “Salary”, and “Satisfaction” columns in the dataset. Generate random values for 200 rows in a modest dataset using NumPy. The code to create the synthetic CSV dataset is provided here:

Code Snippet:

import pandas as obj_panda

import numpy as obj_numpy



# Set a random seed for reproducibility

obj_numpy.random.seed(42)



# Number of rows in the dataset

ds_rows = 200



# Generate synthetic data for Age (numeric)

cust_age_data = obj_numpy.random.randint(18, 65, ds_rows)



# Generate synthetic data for Gender (categorical: 'Male' or 'Female')

cust_sex_data = obj_numpy.random.choice(['Male', 'Female'], ds_rows)



# Generate synthetic data for Salary (numeric)

cust_salary_data = obj_numpy.random.randint(20000, 100000, ds_rows)



# Generate synthetic data for PurchaseHistory (numeric)

cust_purchase_history_data = obj_numpy.random.randint(100, 10000, ds_rows)



# Generate synthetic data for Satisfaction (target variable: binary - 0 or 1)

cust_satisfaction_data = obj_numpy.random.randint(0, 2, ds_rows)


# Create a DataFrame with the synthetic data

Data_frame = obj_panda.DataFrame({

    'Age': cust_age_data,
    'Gender': cust_sex_data,
    'Salary': cust_salary_data,
    'Purchase History': cust_purchase_history_data,
    'Satisfaction': cust_satisfaction_data


})

Data_frame.to_csv('ecommerce_data.csv', index=False)

Generate a CSV file called “ecommerce_data.csv” with fictitious data in the current directory which allows the end user to modify the rows and values.

Step 2: Create the Dataset

Assuming that the e-commerce dataset is in a CSV file with the “ecommerce_data.csv” name, import the data using Pandas and divide it into features (F) and the target variable (t). The Python Panda’s library is imported as panda_obj which is used for data analysis. It offers numerous functions for importing, modifying, and analyzing the data.

The following line of code loads the dataset using the library’s read_csv() function into the customer_data variable from the ecommerce_data.csv file.

In the third line of code, the features are divided into independent factors to predict the target variable. Pandas removes the “Satisfaction” column from the Customer_Data dataframe using the drop() method, separating the goal variable from the features. Now that the features are in the “F” variable, the target variable is in the “t” variable.

Code Snippet:

import pandas as panda_obj

# Load the dataset from ecommerce_data.csv to customer_data variable

customer_data = panda_obj.read_csv("ecommerce_data.csv")

print("Customer Data Imported Successfully!")



# Split the data into features (F) and target variable (t)

F = customer_data.drop(columns=['Satisfaction'])

t = customer_data['Satisfaction']

print("Data successfully divided into features and targets")

Step 3: Divide the Data into Testing and Training Sets

The functions to divide the data into training and testing sets are available in the sklearn.model_selection package. Splitting the data into sets is done in the second line of code. The features, target variable, test size, and random state are the four arguments that are required by the train_test_split() function. The test size is the percentage of the data that will be tested. The random state serves as a seed for random number generators.

By performing this, the data is divided randomly. Since the test size in this instance is set to 0.3, 30% of the data is used for the test. Forty-two (42) are chosen as the random state. This means that although the data is divided randomly, the code always uses the same random split. The training and testing sets are stored in the F_training, t_training, F_testing,  and t_testing variables, respectively.

Code Snippet:

from sklearn.model_selection import train_test_split

 

F_training, F_testing, t_training, t_testing = train_test_split(F, t, test_size=0.3, random_state=42)

print("Splitting data into Training & Testing sets completed!")

Step 4: Create the Scikit-Learn Pipeline

The first line of code imports the Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, and SVC modules from the sklearn library. The preprocessing procedures for the categorical and numerical features are specified in the second line of code. The names of the numeric characteristics are listed in the cust_numeric_features list.

As a StandardScaler object, the cust_numeric_transformer variable is used. The numerical features are scaled using this object. The names of the categorical features are listed in the cust_categorical_features list. The OneHotEncoder object is the cust_categorical_transformer variable. This object is used to encode the categorical features. The third line of code creates the column transformer. The column transformer applies the preprocessing methods that are indicated in the preceding step to the individual feature subsets. The fourth line of code contains a description of the pipeline.

The pipeline is composed of the column transformer and the machine learning model. In this case, the machine learning model is a support vector classifier (SVC). Using this code, a pipeline that preprocesses the several feature subsets in different ways can be established. Machine learning frequently employs this technique. It enables us to improve the performance of models by preprocessing the data in accordance with the many sorts of features.

Code Snippet:

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import OneHotEncoder

from sklearn.svm import SVC

from sklearn.compose import ColumnTransformer

cust_numeric_features = ['Age', 'Salary', 'Purchase History']

cust_numeric_transformer = StandardScaler()


cust_categorical_features = ['Gender']

cust_categorical_transformer = OneHotEncoder()


data_for_processor = ColumnTransformer(
    transformers=[
        ('AGE->SALARY->PURCHASE_HISTORY', cust_numeric_transformer, cust_numeric_features),
        ('GENDER', cust_categorical_transformer, cust_categorical_features)
    ])
model_pipeline = Pipeline([
    ('preprocess', data_for_processor),
    ('model', SVC())])


print("Scikit-Learn pipeline created!")

Step 5: Fit and Evaluate the Pipeline

The accuracy_score() function, which measures how well the machine learning models perform, is imported in the first line of code from the “sklearn.metrics” module. The second line of code fits the training data to the model_pipeline. The fit() method accepts two arguments: the features and the target variable. The training data is used by the model_pipeline to determine the associations between the features and the target variable.

On the basis of the test results, the succeeding line of code produces predictions. The features are the only argument needed for the predict() function. The relationships that the model_pipeline discovered from the training data are used to produce predictions for the test data. The model is assessed in the fourth line of code. The actual values and the anticipated values are the two inputs for the accuracy_score() method. The precision of the model, or the proportion of correct predictions, is provided by the function.

Code Snippet:

from sklearn.metrics import accuracy_score

model_pipeline.fit(F_training, t_training)

t_prediction = model_pipeline.predict(F_testing)

# Evaluate the model

model_accuracy = accuracy_score(t_testing, t_prediction)

print("Accuracy:", model_accuracy)

Step 6: Execute the Code

Here is the successful execution of code using Python. The accuracy of the model is 0.483.

Conclusion

Numerous machine learning methods such as decision trees, logistic and linear regression, and random forests are provided by Scikit-Learn. It also offers preprocessing steps for machine learning data preparation such as scaling and normalization. Scikit-Learn recommends the evaluation and model selection functions for machine learning models including accuracy, precision, and recall.

About the author

Kalsoom Bibi

Hello, I am a freelance writer and usually write for Linux and other technology related content