Python

Train Test Split in Sklearn

The initial dataset needed to train machine learning algorithms is known as training data. Machine learning algorithms are instructed to make predictions or perform a task using training datasets. The test dataset assesses how well the training data is performed. Sklearn is a Python-based machine learning toolkit that allows us to split our data into train and test samples using the train_test_split() method. The train_test_split() technique is discussed in this article.

What is Train Test Split

Machine learning models are trained using the train-test split technique. It establishes how effectively machine learning algorithms function. It may be employed to solve problems involving regression and classification. The machine learning training dataset is the train-test split. It is made up of a significant amount of data. In Python, the scikit learn package has a module called model selection from which you can import “train test split”. You can supply the training and testing size sample in this function using train and test sizes. There is no such thing as the perfect split percentage. You should select a split percentage that suits the goals of your project.

The model must accurately match the given data using known inputs and outputs. The program is then used to make predictions on the remaining data subset to learn from it. Without knowing the anticipated input and output values, this can be used to generate predictions about data sets in the future. The train_test_split() function in the scikit-learn machine learning toolkit in Python can be used to implement the train test split evaluation technique. It accepts the dataset as input and splits it into two subsets as output.

Test Set

The test set is a selection of instances drawn from the dataset to gauge the model’s effectiveness. This data is kept separate from the training and tuning data. Therefore, cannot be used during the learning process’s training or tuning stages. This would only affect performance by biasing the model to the data.

Training Set

The training and test sets must be kept separate. The training phase consumes the training set to identify parameter values that minimize a specific cost function across the entire training set. Once trained on the training dataset, the model will be evaluated on the test dataset. The test dataset shouldn’t be much smaller than the training dataset.

How To Train Data

A model is built using specific data, referred to as “training” data. In a simple linear model, the model formalizes relationships between variables by producing the mathematical equation for a line. The model type determines how it is built. For example, a regression differs from other methods.

It’s important to distinguish training data from other data since you’ll often divide your initial dataset into two parts: training data for creating models and test data for evaluating them. Typically, you would do this by having your model predict values in test data (based on the variables in the model) and comparing them to the actual values.

The Purpose of Splitting Our Data

Overfitting and underfitting are two significant problems we face while testing our dataset.

Building a model based on data that isn’t supposed to be known is called look-ahead bias.

Overfitting is when a model adapts too closely to historical data. It becomes unsuccessful in the future. Underfitting is the act of creating a model that adapts to past data so loosely that it is rendered useless in the future.

Implementing train_test_split() in sklearn

# importing the necessary methods and libraries

import numpy as np

from sklearn.model_selection import train_test_split

 

# creating sample dataset

X, y = np.arange(100).reshape((20, 5)), range(20)

print('Features are', X)

print('Target labels are', list(y))

Output

Features are

[[ 0  1  2  3  4]

[ 5  6  7  8  9]

[10 11 12 13 14]

[15 16 17 18 19]

[20 21 22 23 24]

[25 26 27 28 29]

[30 31 32 33 34]

[35 36 37 38 39]

[40 41 42 43 44]

[45 46 47 48 49]

[50 51 52 53 54]

[55 56 57 58 59]

[60 61 62 63 64]

[65 66 67 68 69]

[70 71 72 73 74]

[75 76 77 78 79]

[80 81 82 83 84]

[85 86 87 88 89]

[90 91 92 93 94]

[95 96 97 98 99]]

Target labels are

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Splitting the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print('Training features are', X_train)

print('Training labels are', y_train)

print('Test features are', X_test)

print('Test labels are', y_test)

Output

Training features are

[[15 16 17 18 19]

[90 91 92 93 94]

[80 81 82 83 84]

[65 66 67 68 69]

[10 11 12 13 14]

[45 46 47 48 49]

[95 96 97 98 99]

[20 21 22 23 24]

[60 61 62 63 64]

[35 36 37 38 39]

[50 51 52 53 54]

[70 71 72 73 74]

[30 31 32 33 34]]

Training labels are

[3, 18, 16, 13, 2, 9, 19, 4, 12, 7, 10, 14, 6]

Test features are

[[ 0  1  2  3  4]

[85 86 87 88 89]

[75 76 77 78 79]

[ 5  6  7  8  9]

[40 41 42 43 44]

[25 26 27 28 29]

[55 56 57 58 59]]

Test labels are

[0, 17, 15, 1, 8, 5, 11]

Conclusion

We discussed the train_test_split() method of sklearn, which is used to split out initial data into train and test samples. This is required to evaluate our model performance and ultimately improve it. We also saw how train and test samples are different from each other. Finally, we implemented the train_test_split() method in sklearn.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.