AI

Hugging Face Train and Split Dataset

The Hugging Face library does not have a specific function named train_test_split. However, when it comes to splitting the data for training and testing in machine learning tasks, the train_test_split function is commonly used in other popular libraries such as scikit-learn. Here, we will explain the parameters that are typically used in the train_test_split function from scikit-learn.

The train_test_split method in Hugging Face’s dataset library is used to divide a dataset into two subsets: a training subset and a testing subset. This method is commonly employed in machine learning to evaluate the performance of a model on unseen data. The training subset is utilized to train the model, while the testing subset is used to assess its performance and generalization capabilities.

Here’s an overview of the train_test_split method in Hugging Face:

  1. test_size (numpy.random.Generator, optional): The size of the test split is determined by this option. The type can be either float or integer.
    • If it is given as a float, it should reflect the percentage of the dataset to include in the test split and be between 0.0 and 1.0.
    • The exact number of test samples is represented by the value if it is supplied as an integer.
    • If it is set to None, the complement of the train size is used as the value.
    • If the train_size is also None, it will be set to 0.25 (25% of the dataset).
  2. train_size (numpy.random.Generator, optional): The train split size is determined by this parameter. It follows to the same guidelines as test_size.
    • If it is given as a float, it should reflect the percentage of the dataset to include in the train split and be between 0.0 and 1.0.
    • The exact number of train samples is represented by the value if it is supplied as an integer.
    • If it is set to None, the value is automatically changed to the test size’s complement.
  3. shuffle (bool, optional, defaults to True)
    • This parameter determines whether or not to shuffle the data before splitting.
    • If it is set to True, the data will be randomly shuffled before the split.
    • If it is set to False, the data will be split without shuffling.
  4. stratify_by_column (str, optional, defaults to None)
    • This parameter is used for the stratified splitting of data based on a specific column.
    • If it is specified, it should be the column name of the labels or classes.
    • The data will be split in a way that maintains the same distribution of labels or classes in the train and test splits.
  5. seed (int, optional)
    • This parameter allows you to set a seed to initialize the default BitGenerator.
    • If it is set to None, a fresh, unpredictable entropy will be pulled from the operating system.
    • If an integer or array-like integers are passed, they will be used to derive the initial BitGenerator state.
  6. generator (numpy.random.Generator, optional)
    • This parameter allows you to specify a NumPy random generator to compute the permutation of the dataset rows.
    • If it is set to None (default), it uses the np.random.default_rng which is the default BitGenerator (PCG64) of NumPy.
  7. keep_in_memory (bool, defaults to False)
    • This parameter determines whether to keep the split indices in memory instead of writing them to a cache file.
    • If it is set to True, the split indices will be stored in the memory during the splitting process.
    • If it is set to False, the split indices will be written to a cache file for later use.
  8. load_from_cache_file (Optional[bool], defaults to True if caching is enabled)
    • This parameter determines whether to use a cache file to load the split indices instead of recomputing them.
    • If it is set to True and a cache file that stores the split indices can be identified, it will be used.
    • If it is set to False, the split indices will be recomputed even if a cache file exists.
    • The default value is True if caching is enabled.
  9. train_cache_file_name (str, optional)
    • This parameter allows you to provide a specific path or name for the cache file that stores the train split indices.
    • If it is specified, the train split indices will be stored in this cache file instead of the automatically generated cache file name.
  10. test_cache_file_name (str, optional)
    • This parameter allows you to provide a specific path or name for the cache file that stores the test split indices.
    • If it is specified, the test split indices will be stored in this cache file instead of the automatically generated cache file name.
  11. writer_batch_size (int, defaults to 1000)
    • This parameter determines the number of rows per write operation for the cache file writer.
    • It is a trade-off between the memory usage and processing speed.
    • Higher values reduce the number of write operations but consume more memory during processing.
    • Lower values consume less temporary memory but may slightly affect the processing speed.
  12. train_new_fingerprint (str, optional, defaults to None)
    • This parameter represents the new fingerprint of the train set after applying a transformation.
    • If it is specified, it provides a new fingerprint for the train set.
    • If it is set to None, the new fingerprint is computed using a hash of the previous fingerprint and the transformation arguments.
  13. test_new_fingerprint (str, optional, defaults to None)
    • This parameter represents the new fingerprint of the test set after applying a transformation.
    • If it is specified, it provides a new fingerprint for the test set.
    • If it is set to None, the new fingerprint is computed using a hash of the previous fingerprint and the transformation arguments.

Syntax:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X: This represents the input features or independent variables of your dataset.

  • y: This represents the output or dependent variable that you are trying to predict.
  • test_size: This parameter determines the proportion of the dataset that will be allocated for testing. It can be specified as a float (e.g., 0.2 for 20%) or an integer (e.g., 200 for 200 samples).
  • random_state: This is an optional parameter that allows you to set a seed for the random number generator. It ensures that the split is reproducible which means that you will obtain the same split if you use the same random state value.

The train_test_split function returns four sets of data:

  • X_train: The training set of input features.
  • X_test: The testing set of input features.
  • y_train: The training set of output labels.
  • y_test: The testing set of output labels.

Example: The following example program is saved as “test.py”.

from sklearn.model_selection import train_test_split

from datasets import load_dataset

# Step 1: Load the dataset

dataset = load_dataset('imdb')

X = dataset['train']['text']

y = dataset['train']['label']

# Step 2: Split the dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

        shuffle=True, random_state=42)

# Step 3: Explore the dataset

print("Number of examples in the original dataset:", len(X))

print("Number of examples in the train dataset:", len(X_train))

print("Number of examples in the test dataset:", len(X_test))

# Step 4: Access and print example data

print("\nExample from the train dataset:")

print(X_train[0], y_train[0])

print("\nExample from the test dataset:")

print(X_test[0], y_test[0])

This import statement is from scikit-learn, not from the Hugging Face datasets library. Please make sure that you have scikit-learn installed in your environment. You can install it using the following command:

pip install scikit-learn

Explanation: First, we import the necessary module: train_test_split from scikit-learn.

  • We load the IMDb dataset using the load_dataset(‘imdb’) and assign it to the dataset variable.
  • To use the train_test_split, we need to separate the input features (X) and the corresponding labels (y). In this case, we assume that the dataset has a split named “train” with “text” as the input features and “label” as the corresponding labels. You may need to adjust the keys based on the structure of your dataset.
  • We then pass the input features (X) and labels (y) to the train_test_split along with other parameters. In this example, we set the test_size to 0.2 which means that 20% of the data will be allocated for testing. The shuffle parameter is set to “True” to randomly shuffle the data before splitting, and the random_state parameter is set to 42 for reproducibility.
  • The train_test_split function returns four sets of data: X_train, X_test, y_train, and y_test. These represent the training and testing subsets of the input features and labels, respectively.
  • We print the number of examples in the original dataset (len(X)), the training dataset (len(X_train)), and the test dataset (len(X_test)). This allows us to verify the splitting process and ensure that the subsets are created correctly.
  • Finally, we access and print an example from the training dataset (X_train[0], y_train[0]) and an example from the test dataset (X_test[0], y_test[0]).

Output: We run the previously saved program using the Python “test.py”.

Conclusion

The train-test split functionality that is provided by Hugging Face’s datasets library, in combination with scikit-learn’s train_test_split function, offers a convenient and efficient way to divide a dataset into separate training and testing subsets.

By utilizing the train_test_split function, you can control the size of the test set, whether to shuffle the data, and set a random seed for reproducibility. This flexibility allows for effective evaluation of machine learning models on unseen data and aids in detecting the issues such as overfitting or underfitting.

The parameters of the train_test_split function allow you to control various aspects of the split such as the size of the test set (test_size), shuffling the data (shuffle), and performing a stratified splitting based on specific columns (stratify_by_column). Additionally, you can specify a seed value (seed) for reproducibility and customize the cache file names for storing the split indices (train_cache_file_name and test_cache_file_name).

The functionality that Hugging Face offers makes it easier to prepare your data for model training and evaluation. By having separate training and testing subsets, you can accurately assess your model’s performance on unseen data, detect potential issues like overfitting, and make informed decisions for model improvements.

Overall, the train-test split functionality in Hugging Face’s datasets library, in conjunction with scikit-learn’s train_test_split, provides a powerful toolset for efficient data splitting, model evaluation, and development of robust machine learning solutions.

About the author

Shekhar Pandey