Python

Cross Validation in Sklearn

“A model validation technique called cross-validation assesses the generalizability of statistical analysis results to different data sets employed primarily in contexts where the goal is to forecast. We use cross-validation to test the model while training and generalizing the data. This article will discuss how it’s implemented in the Python sklearn library.”

What is Cross Validation?

It is a statistical model evaluation technique that tests how the data will be generalized to different data sets. It focuses on determining the accuracy of the model in actual use and is essentially in the context where the major goal is to forecast. Cross-validation tests the model during training and its ability to generalize the data.

How are Test and Train Data Different?

Data used to develop a model, such as data used to determine a multilinear regression’s coefficients, are referred to as training data. Once the model is created, it is tested against the test data to determine how well the model fits the data.

Implementing Cross-Validation in Sklearn

Importing required libraries

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import cross_val_score

 
Creating the dataset

X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
print(X_train[:20], y_train[:20])

 
Output

[[6.  3.4 4.5 1.6]
 [4.8 3.1 1.6 0.2]
 [5.8 2.7 5.1 1.9]
 [5.6 2.7 4.2 1.3]
 [5.6 2.9 3.6 1.3]
 [5.5 2.5 4.  1.3]
 [6.1 3.  4.6 1.4]
 [7.2 3.2 6.  1.8]
 [5.3 3.7 1.5 0.2]
 [4.3 3.  1.1 0.1]
 [6.4 2.7 5.3 1.9]
 [5.7 3.  4.2 1.2]
 [5.4 3.4 1.7 0.2]
 [5.7 4.4 1.5 0.4]
 [6.9 3.1 4.9 1.5]
 [4.6 3.1 1.5 0.2]
 [5.9 3.  5.1 1.8]
 [5.1 2.5 3.  1.1]
 [4.6 3.4 1.4 0.3]
 [6.2 2.2 4.5 1.5]] [1 0 2 1 1 1 1 2 0 0 2 1 0 0 1 0 2 1 0 1]

 
Creating the model and finding cross-validation scores

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
print(scores)

 
Output

[0.96666667 1.         0.96666667 0.96666667 1.        ]

 

Conclusion

When the dataset for training and testing is too small, cross-validation is required. The dataset is typically partitioned into N random pieces of equal volume to prevent the overfitting issue. The technique is evaluated with the remaining portion after training with N-1 parts. The average of the metrics throughout the N training-test runs is used to get the overall measure. Later, we added cross-validation to Sklearn, which has a class called “model selection” for doing so.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.