What is Data Processing?
Data preprocessing is a critical stage in machine learning that improves data quality to encourage extracting valuable insights from the data. Data preparation in machine learning is the process of getting the raw data ready (cleaning and organizing it) to be used to create and train machine learning models. Data preprocessing in machine learning is, to put it simply, a data mining approach that converts raw data into a format that is readable and intelligible.
Why Do We Need Data Preprocessing?
Real-world data frequently lacks particular attribute values or trends and is frequently inconsistent, erroneous (contains errors or outliers), and incomplete. Data preparation comes into play in this situation because it helps to clean, format, and organize the raw data, making it ready for use by machine learning models.
Data preprocessing deals with the following:
- Missing data: Remove, correct, and impute
- Engineering features: Extrapolate new features from raw data.
- Data formatting: The data might not be available in the desired format. For instance, it is a simple text file we need to convert into a dataframe.
- Data normalization: Not all data may be in normalized form. Therefore, we scale it to the given range for efficiency purposes
- Decomposition: Removing redundant data to enhance performance
Standards for the Scikit-learn API
There are several specifications for the kind of data that the Sklearn will process.
- Constant values (no categorical variables).
- No values are missing.
- Every column should contain a different predictor variable.
- Each row should contain a feature observation.
- There must be as many labels for each feature as there are observations of it.
Implementing Preprocessing sklearn
Importing the Libraries and Data
from sklearn import svm, datasets
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
# loading the iris dataset
data = datasets.load_iris()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
Loading First 5 Rows of the Data
Output
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
Getting Information About Types and Null Values
Output
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
dtypes: float64(4)
Filling Missing Values of the Dataframe Using sklearn:
df['sepal width (cm)'] = imputer.fit_transform(df[['sepal width (cm)']])
We can iterate all the columns for performing this task on all columns.
Scaling the data using Standard Scaler
scaler.fit(df)
# transforming the data
scaler.transform(df)[:10]
Output
[-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
[-1.38535265, 0.32841405, -1.39706395, -1.3154443 ],
[-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ],
[-1.02184904, 1.24920112, -1.34022653, -1.3154443 ],
[-0.53717756, 1.93979142, -1.16971425, -1.05217993],
[-1.50652052, 0.78880759, -1.34022653, -1.18381211],
[-1.02184904, 0.78880759, -1.2833891 , -1.3154443 ],
[-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
[-1.14301691, 0.09821729, -1.2833891 , -1.44707648]])
One Hot Encoding
X = [['A', 1], ['B', 3], ['B', 2]]
encoder.fit(X)
print(encoder.categories_)
encoder.transform(X).toarray()
Output
array([[1., 0., 1., 0., 0.],
[0., 1., 0., 0., 1.],
[0., 1., 0., 1., 0.]])
Conclusion
We discussed preprocessing and its implementation in sklearn Python library in this article. In order to facilitate the extraction of useful insights from the data, data preprocessing is a crucial step in machine learning. It raises the quality of the data. Then, we discussed the implementation in sklearn. We first retrieved information about data, including the missing values and datatypes, and then filled in the missing values. We also worked on scaling the data and one hot encoding.