How to Split Text in LangChain?

Data is coming in huge numbers and becomes difficult to manage or understand the data if it is in the unstructured form. However, the structured form of data can be understood easily in comparison to the unstructured form which is mostly available in the textual form. LangChain is a framework to make it easy to understand natural language by building applications-powered language models.

This guide will explain the process of splitting text in LangChain.

How to Split Text in Langchain?

Splitting text can make it easy to understand by dividing the large text into smaller chunks so the user can gain meaningful information through that. To split text using the LangChain framework, simply follow this guide with simpler examples:

Example 1: Text Splitters

Before learning the process of splitting text in the langchain model, simply learn how these model work and start by installing the LangChain using the following command:

pip install langchain

The following screenshot displays the successful installation of the LangChain module:

Use the following command to import the “text_splitter” from the “langchain” library:

from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
Language,
)

After that, simply use the following command to get the list of languages that support the process of splitting text using code:

[e.value for e in Language]

The following screenshot displays the list of all the languages:

Use the following code to learn all the splitters/separators used to split the text in the provided language:

RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

Splitters are the keyword located in the textual document which can be used to split documents.

The above code is used to get the separators of the Python language and the following screenshot displays the list of all the separators in the language:

Example 2: Splitting Code Using LangChain

Start the process of splitting text by splitting code written in the Python language using the following code:

PYTHON_CODE = """
def hello():
print("Hello, LinuxHint")

# Call the function
hello()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

The above code uses a basic Python code containing a hello() function and calls it later to print the “Hello, LinuxHint” message. After that, the “python_splitter” is used to split the above code into smaller chunks:

The above screenshot displays that the code has split the code into two documents.

Example 3: Split Text by Character

To split the text document by character, simply upload the document from a local computer on the Google Collaboratory IDE. Google Collaboratory is the cloud-based integrated Development Environment for the Python language:

from google.colab import files
upload = files.upload()

After executing the above code, simply click on the “Choose Files” button and upload the document from the local system:

After that, simply open the text on the cloud IDE and allows the environment to read the text:

with open('text.txt') as f:
Text = f.read()

Import the “CharacterTextSplitter” library from the “langchain” library using the “text_splitter” module and set the configurations for the splitter. The following code uses space as the separator and then set its chunk size with its overlap chunk:

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator = "",
chunk_size = 100,
chunk_overlap = 20,
length_function = len,
)

The following code is used to print the chunks of the text using the index numbers of the texts:

texts = text_splitter.create_documents([Text])
print(texts[0])
print(texts[1])
print(texts[2])

The above screenshot displays the three chunks from the texts as the code only prints the first 3 indexes.

Example 4: Metadata of the Splitting Text

The following screenshot also displays the metadata of the chunks separated from the text and prints the text with the index number 3:

metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([Text, Text], metadatas=metadatas)
print(documents[3])

The following screenshot displays the metadata of the data at the end of the output:

After that, simply print the index number 3 without any metadata attached to it using this code:

text_splitter.split_text(Text)[3]

The following screenshot displays the data without its metadata:

Note: to learn the process of tokenization in Python, simply refer to this guide.

That’s all about splitting the text in the LangChain module using separators of Python language.

Conclusion

To split the text in the LangChain module, simply install the module in the development environment as this guide uses Google Collaboratory. After that, understand the language that supports the splitting process and separator for each language. LangChain supports the splitting of code in Python language as well as text documents from the local system or written in the code. This guide has explained both the process of splitting text in LangChain using Python language.

How to Split Text in LangChain?

How to Split Text in Langchain?

Conclusion

About the author

Talha Mahmood