LangChain

How to Split Characters Using LangChain?

Large Language Models or LLMs are a type of Machine Learning that is trained on huge data to interact with humans. Data for LLMs are usually in large numbers and in unstructured form written in human languages so the model can understand the query. Splitting data is a vital aspect of training such as the model must understand the data by making small chunks of the data and processing it individually.

This guide will explain the process of splitting characters using LangChain.

How to Split Characters Using LangChain?

To split characters using LangChain to build and train Large Language Models, go through the given instructions:

Step 1: Setup Modules

Firstly, install the LangChain framework to get started with the process of splitting characters:

pip install langchain

A screenshot of a computer program Description automatically generated

Step 2: Uploading Data

Upload the data on the Python IDE for applying character split using LangChain:

from google.colab import files

upload = files.upload()

A screenshot of a computer Description automatically generated

Method 1: Split Text by Character

Open the uploaded file to read its content and then use splitting functions or methods in LangChain:

with open('text.txt') as f:

Text = f.read()

Import the CharacterTextSplitter library to use its function and split the character into chunks with the size of 100 characters and the separator will be the blank space:

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(

separator = "",

chunk_size = 100,

chunk_overlap = 20,

length_function = len,

)

Simply print the split documents as they are stored in the array, and we can get them by calling their index numbers:

texts = text_splitter.create_documents([Text])

print(texts[0])

print(texts[1])

print(texts[2])

The document in smaller chunks of 100 characters is displayed in the following screenshots using their index numbers. The user can change or alter the split configurations to make it more useful for their model:

How to Recursive Split Characters Using LangChain?

Another method to split the text/file based on its characters is using the RecursiveCharacterTextSplitter library which can be imported from LangChain:

from langchain.text_splitter import RecursiveCharacterTextSplitter

Configure the RecursiveCharacterTextSplitter() method by setting the parameters like chunk_size, chunk_overlap, length_function, etc.

text_splitter = RecursiveCharacterTextSplitter(

chunk_size = 100,

chunk_overlap = 20,

length_function = len,

is_separator_regex = False,

)

Print the chunks of the text using their index numbers as the following code prints the text stored at index 0 and 1:

texts = text_splitter.create_documents([Text])

print(texts[0])

print(texts[1])

The user can also print the first two indexes by giving the limit of the index number in square braces:

text_splitter.split_text(Text)[:2]

That is all about the process of splitting characters using LangChain to build LLMs.

Conclusion

To split characters using LangChain, start by installing the LangChain framework and then upload the text to apply character split on it. After that, use the CharacterTextSplitter() and RecursiveCharacterTextSplitter() methods which can be imported from the LangChain framework. Configure these methods to split the files into smaller chunks and print them using their index numbers. This blog has illustrated the process of splitting text based on characters using the LangChain framework.

About the author

Talha Mahmood

As a technical author, I am eager to learn about writing and technology. I have a degree in computer science which gives me a deep understanding of technical concepts and the ability to communicate them to a variety of audiences effectively.