How to Use File Directory Loaders in LangChain?

LangChain is the framework to build AI models like Large Language Models using natural languages to answer queries in text. To train these models, the user needs to get a huge pool of data so the model can answer a variety of questions from different users. LangChain allows the developers to use directory loaders to get the data from different locations at once.

This guide will demonstrate the process of using the file directory loaders in LangChain.

How to Use File Directory Loaders in LangChain?

To use the file directory loader in LangChain, follow this easy and simple guide:

Prerequisite: Install Modules and Upload Files

First, install the LangChain framework to get started with the process:

pip install langchain

Then, install OpenAI to connect to its environment and use its libraries:

pip install openai

The “unstructured” module is also required for this process so the model can read the unstructured data as well:

pip install unstructured

Import libraries to use the operating system for establishing a connection to the OpenAI environment by providing your API key:

import os

import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

Upload Files in Directories

Upload files in the directory and access them by clicking on the folder icon from the left panel:

Example 1: Check the Number of Loaded Documents

Import the DirectoryLoader library from the LangChain to start the process of using it:

from langchain.document_loaders import DirectoryLoader

Configure the DirectoryLoader() function with the path of the directory and place it in the loader variable:

loader = DirectoryLoader('/content/Directory', glob="**/*.txt")

Execute the loader using the load() function:

docs = loader.load()

Check the number of documents loaded by the loader by getting the length from the “docs” variable:

len(docs)

The following screenshot displays that two files have been loaded successfully as the directory only has 2 files in the text:

Example 2: Showing a Progress Bar

Another method to use DirectoryLoader is by enabling the progress bar that displays the loading process with the help of a bar:

loader = DirectoryLoader('/content/Directory', glob="**/*.txt", show_progress=True)

docs = loader.load()

Example 3: Using Multithreading

Usually, the DirectoryLoader() function uses a single thread to load files, however, the user can enable multithreading to speed up the process:

loader = DirectoryLoader('/content/Directory', glob="**/*.txt", use_multithreading=True)

docs = loader.load()

Example 4: Change Loader Class

Import TextLoader library which is another way of loading files that are in textual form:

from langchain.document_loaders import TextLoader

Configure the DirectoryLoader() function with the loader class as the TextLoader to only get text files from the said directory:

loader = DirectoryLoader('/content/Directory', glob="**/*.txt", loader_cls=TextLoader)

Load the files from the directory and store them in the “docs” variable:

docs = loader.load()

Get the length of the “docs” variable to get how many files have been loaded successfully:

len(docs)

Example 5: Using PythonLoader to Load Files

To import Python code files, import the PythonLoader library and use it to get all the Python files from the directory:

from langchain.document_loaders import PythonLoader

Use the Python loader with the extension of the files to execute the DirectoryLoader() method:

loader = DirectoryLoader('/content/Directory', glob="**/*.ipynb", loader_cls=PythonLoader)

Load the directory using the docs variable to store the files after executing the loader:

docs = loader.load()

There are four Python files loaded using the PythonLoader library as displayed in the screenshot after this code:

len(docs)

Example 6: Using TextLoader to Detect Auto Encoding

The LangChain framework allows the user to load big data with some strategies to get efficient loading of big files using the TextLoader:

path = '/content/Directory'

loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader)

Example 6.1: Using Default Behavior

Simply execute the loader using the load() function to load the content of the files:

loader.load()

Example 6.2: Using Silent Fail

Another strategy is to enable the silent errors feature to leave the files that are unable to load so the model does not waste time on that particular file:

loader = DirectoryLoader(path, glob="**/*.ipynb", loader_cls=TextLoader, silent_errors=True)

docs = loader.load()

Simply print the variable containing the list of files that are simply loaded by the TextLoader:

doc_sources = [doc.metadata['source'] for doc in docs]

doc_sources

Example 6.3: Auto Detect Encoding

Another strategy to get optimum file loading is using the auto detection of any encoding attached to the file to understand if anything stopping it from loading the file:

text_loader_kwargs={'autodetect_encoding': True}

loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)

docs = loader.load()

Simply print the files with the path which can be loaded so the user only focuses on those files:

doc_sources = [doc.metadata['source'] for doc in docs]

doc_sources

That is all about using the file directory loaders in LangChain.

Conclusion

To use the file directory loader in LangChain, simply install LangChain, OpenAI, and unstructured modules to load files from the directory. The LangChain framework offers multiple methods of using the DirectoryLoader() function with different strategies. This guide has illustrated the process of using the file directory loader with multiple methods in LangChain.

How to Use File Directory Loaders in LangChain?