Python

Preprocessing in sklearn

Data preprocessing, a crucial phase in data mining, can be defined as altering or dropping data before usage to ensure or increase performance. Data preparation involves several procedures such as exploratory data analysis, removing unnecessary information, and adding necessary information. We’ll talk about data preprocessing and how it’s used in sklearn in this article.

What is Data Processing?

Data preprocessing is a critical stage in machine learning that improves data quality to encourage extracting valuable insights from the data. Data preparation in machine learning is the process of getting the raw data ready (cleaning and organizing it) to be used to create and train machine learning models. Data preprocessing in machine learning is, to put it simply, a data mining approach that converts raw data into a format that is readable and intelligible.

Why Do We Need Data Preprocessing?

Real-world data frequently lacks particular attribute values or trends and is frequently inconsistent, erroneous (contains errors or outliers), and incomplete. Data preparation comes into play in this situation because it helps to clean, format, and organize the raw data, making it ready for use by machine learning models.

Data preprocessing deals with the following:

  • Missing data: Remove, correct, and impute
  • Engineering features: Extrapolate new features from raw data.
  • Data formatting: The data might not be available in the desired format. For instance, it is a simple text file we need to convert into a dataframe.
  • Data normalization: Not all data may be in normalized form. Therefore, we scale it to the given range for efficiency purposes
  • Decomposition: Removing redundant data to enhance performance

Standards for the Scikit-learn API

There are several specifications for the kind of data that the Sklearn will process.

  • Constant values (no categorical variables).
  • No values are missing.
  • Every column should contain a different predictor variable.
  • Each row should contain a feature observation.
  • There must be as many labels for each feature as there are observations of it.

Implementing Preprocessing sklearn

Importing the Libraries and Data

# importing the libraries and classes
from sklearn import svm, datasets
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
 
# loading the iris dataset
data = datasets.load_iris()
df = pd.DataFrame(data=data.data, columns=data.feature_names)

Loading First 5 Rows of the Data

df.head()

Output

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)  
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Getting Information About Types and Null Values

df.info()

Output

RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)

Filling Missing Values of the Dataframe Using sklearn:

imputer = SimpleImputer(strategy='mean')
df['sepal width (cm)'] = imputer.fit_transform(df[['sepal width (cm)']])

We can iterate all the columns for performing this task on all columns.

Scaling the data using Standard Scaler

scaler = StandardScaler()
 
scaler.fit(df)
 
# transforming the data
scaler.transform(df)[:10]

Output

array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ],
       [-0.53717756,  1.93979142, -1.16971425, -1.05217993],
       [-1.50652052,  0.78880759, -1.34022653, -1.18381211],
       [-1.02184904,  0.78880759, -1.2833891 , -1.3154443 ],
       [-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
       [-1.14301691,  0.09821729, -1.2833891 , -1.44707648]])

One Hot Encoding

encoder = OneHotEncoder(handle_unknown = 'ignore')
X = [['A', 1], ['B', 3], ['B', 2]]
encoder.fit(X)
print(encoder.categories_)
encoder.transform(X).toarray()

Output

[array(['A', 'B'], dtype=object), array([1, 2, 3], dtype=object)]
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0.]])

Conclusion

We discussed preprocessing and its implementation in sklearn Python library in this article. In order to facilitate the extraction of useful insights from the data, data preprocessing is a crucial step in machine learning. It raises the quality of the data. Then, we discussed the implementation in sklearn. We first retrieved information about data, including the missing values and datatypes, and then filled in the missing values. We also worked on scaling the data and one hot encoding.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.