sklearn Feature Selection

The best technique for solving a machine learning problem is to start with a dataset, perform a thorough EDA on it, and grasp many to most of the key characteristics of the predictors before actively training models on these variables. However, it’s not always possible to do this. You sometimes require an automated approach to choose an appropriate subset of data since the dataset includes too many variables. Similar techniques found in the Sklearn library are discussed in this article.

What is Feature Selection?

The feature selection process involves reducing the number of predictor variables employed in the models you create. For instance, your first inclination should be to select the model with fewer variables when presented with two models with the same or nearly identical score. But the latter model has more variables. This model is less likely to be leaky, easier to run, and simpler to understand and train. It comes easily as part of the model-building process to tune the number of parameters in a model, which is a natural component of data science in practice. Selecting features is primarily a manual procedure, even when there are few features, or you have the time to sit and think about them. Automated or semi-automated feature selection can expedite processes in situations when there are too many variables, or you don’t have much time.

Implementing Feature Selection in sklearn

We first import the data using pandas. The dataset can be found here.

import pandas as pd
df = pd.read_csv('Iris.csv')
df = df.drop('Id', axis = 1)


  SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Variance Threshold Feature Selection

A variance Threshold is a straightforward method for removing features depending on the variance we anticipate for each feature. Without considering any data from the dependent variable, the Variance Threshold feature selection just considers the input features. As a result, it is only effective for unsupervised modeling when features are eliminated. Below is the implementation of the approach:

from sklearn.feature_selection import VarianceThreshold

df = df.select_dtypes('number')
selector = VarianceThreshold(2)


Index([‘PetalLengthCm’], dtype=’object’)


A feature selection technique called univariate feature selection is based on the univariate statistical test. The foundation of “SelectKBest” is the combination of a univariate statistical test with the K-number feature selection based on the statistical correlation between the dependent and independent variables. Below is the implementation of ‘SelectKBest’.

X = df.drop('Species' , axis = 1)
y = df['Species']

from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn import preprocessing

encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform(y)

selector = SelectKBest(mutual_info_regression, k = 2), y)



Index(['PetalLengthCm', 'PetalWidthCm'], dtype='object')


We discussed the feature selection methods in sklearn. Feature selection is important when working with many feature and we need to drop redundant features for a better understanding of the data and better performance. Sklearn provides us with a ‘feature_selection’ module used to implement this technique.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.