What is Feature Selection?
The feature selection process involves reducing the number of predictor variables employed in the models you create. For instance, your first inclination should be to select the model with fewer variables when presented with two models with the same or nearly identical score. But the latter model has more variables. This model is less likely to be leaky, easier to run, and simpler to understand and train. It comes easily as part of the model-building process to tune the number of parameters in a model, which is a natural component of data science in practice. Selecting features is primarily a manual procedure, even when there are few features, or you have the time to sit and think about them. Automated or semi-automated feature selection can expedite processes in situations when there are too many variables, or you don’t have much time.
Implementing Feature Selection in sklearn
We first import the data using pandas. The dataset can be found here.
df = pd.read_csv('Iris.csv')
df = df.drop('Id', axis = 1)
df.head()
Output
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Variance Threshold Feature Selection
A variance Threshold is a straightforward method for removing features depending on the variance we anticipate for each feature. Without considering any data from the dependent variable, the Variance Threshold feature selection just considers the input features. As a result, it is only effective for unsupervised modeling when features are eliminated. Below is the implementation of the approach:
df = df.select_dtypes('number')
selector = VarianceThreshold(2)
selector.fit(df)
df.columns[selector.get_support()]
Output
Index([‘PetalLengthCm’], dtype=’object’)
SelectKBest
A feature selection technique called univariate feature selection is based on the univariate statistical test. The foundation of “SelectKBest” is the combination of a univariate statistical test with the K-number feature selection based on the statistical correlation between the dependent and independent variables. Below is the implementation of ‘SelectKBest’.
y = df['Species']
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform(y)
selector = SelectKBest(mutual_info_regression, k = 2)
selector.fit(X, y)
X.columns[selector.get_support()]
Output
Conclusion
We discussed the feature selection methods in sklearn. Feature selection is important when working with many feature and we need to drop redundant features for a better understanding of the data and better performance. Sklearn provides us with a ‘feature_selection’ module used to implement this technique.