LangChain

How to Use MarkdownHeaderTextSplitter in LangChain?

Text splitting is an important aspect of LangChain to build Large Language Models or LLMs to interact efficiently with humans. One of the techniques to split the text into small chunks is using the MarkdownHeaderTextSplitter which is used to split text based on file headers. The user can simply get the file organized by headers and create chunks within specified headers to split the markdown file.

This guide will demonstrate the process of using MarkdownHeaderTextSplitter in LangChain.

How to Use MarkdownHeaderTextSplitter in LangChain?

To use the MarkdownHeaderTextSplitter in LangChain, simply follow this simple and thorough guide:

Install and Import Libraries

To get started with the guide, simply install the LangChain framework. This guide uses this framework to split text using the markdown header:

pip install langchain

 

Import the MarkdownHeaderTextSplitter library from the “langchain.text_splitter” which is a LangChain library:

from langchain.text_splitter import MarkdownHeaderTextSplitter

 

Example 1: Using Markdown Splitter

Configure the “markdown_document” variable with the content using the headers and then split the data based on these headers. Split the headers into three chunks with the number of hash signs as configured in the following code. After that, split the markdown document using the imported library and then print the document on the screen:

markdown_document = "# Foo\n\n    ## Bar\n\nHello Jim here\n\nHey Joe this side\n\n ### Boo \n\n Hello Lance here \n\n ## Baz\n\n Hey Molly this side"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

 

After executing the above code, the split document has displayed in three chunks:

Execute the following command to get the type of document located at the index 0 in the md_header_splits variable:

type(md_header_splits[0])

 

Example 2: Using Multiple Text Splitters

This example uses multiple splitters such as markdown_splitter and Character level splitter. They use the markdown header text splitter. Before using these splitters, simply insert data with headers and set the number of chunks to print their chunks:

markdown_document = "# Intro \n\n    ## History \n\n HTML is a scripting language \n\n HTML is mostly used for building web pages \n\n #### Standardization \n\n HTML use different tags to create a script and build web sites\n\n ## Implementations \n\n HTML is used with different languages like CSS to implement its script\n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks"


headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
splits = text_splitter.split_documents(md_header_splits)
splits

 

The following screenshot displays the number of documents split into small chunks:

That is all about using the MarkdownHeaderTextSplitter in LangChain.

Conclusion

To use the MarkdownHeaderTextSplitter library in LangChain, simply install LangChain modules to get the library which can be used to split text into small chunks. After that, configure the data with the specified header to use the markdown splitter and make smaller documents from it. The second example in this guide explains the use of multiple splitters in the MarkdownHeaderTextSplitter in LangChain.

About the author

Talha Mahmood

As a technical author, I am eager to learn about writing and technology. I have a degree in computer science which gives me a deep understanding of technical concepts and the ability to communicate them to a variety of audiences effectively.