Python

How to Use Textract to Extract Text from Files

This article will cover a guide on using the “Textract” Python module and command line utility to extract text based content from a variety of different file formats. It can extract text from over 20 different file formats and you can use it programatically in your own Python program by importing its main module. You may have used other similar text extraction command line tools. However, they are mostly limited to one or two specific file formats. Textract provides a one stop solution with a unified interface for extracting text from a multitude of different file formats. It can even use Optical Character Recognition (OCR) and speech recognition technologies to extract text from image and audio files respectively.

Installing Textract in Linux

You can install textract in Linux from the pip package manager. You can install pip package manager in Ubuntu by running the command below:

$ sudo apt install python3-pip

Once you have pip manager installed, run the following command to install dependencies for Textract:

$ sudo apt install python3-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig python3-testresources

Now use pip package manager to install Textract in Ubuntu:

$ pip3 install textract

You can install pip package manager in other Linux distributions from the package manager. Alternatively, you can install pip package manager in Linux by following official installation instructions available here. Once the pip package manager is installed, you can either use the pip command specified above or follow further installation instructions available in the official documentation of Textract (only for Linux distributions other than Ubuntu).

Extracting Text from Files

According to the official documentation of Textract, you can use it to extract text from following file formats:

To extract text from any of these supported files and show the output as stdout in terminal, run a command in the following format:

$ textract file.pdf

You can replace “file.pdf” with any other file format supported by Textract. Depending on the contents of a file, you should see some output similar to this:

To save the extracted output in another file, run a command in the following format:

$ textract file.pdf -o file.txt

You can replace the file names as needed. The “-o” switch is used to specify the name of the output file where extracted text will be stored.

Textract automatically detects file extension type and uses appropriate technology to parse and extract file contents. So to detect and extract text from an image file, you can just use the above mentioned command and supply a supported image file type as an argument. As long as you use the supported file type and correctly specify the filename with extension on the command line, Textract will do all the work for you. For instance, to extract text content from a “PNG” or an “OGG” file, you can simply run these commands:

$ textract file.png -o file.txt
$ textract file.ogg -o file.txt

To know more about Textract command line usage, run the following command:

$ textract --help

Using Textract as a Python Module

You can use Textract in a Python program starting with following code sample:

import textract
text = textract.process("file.png")
print (text)

The first statement imports the main textract module. Next, the “process” method is called by supplying it a file name as an argument. Like the command line utility, the process method automatically detects the current file type using its extension name and then uses an appropriate content parser and extractor suitable for the file extension.

You can also manually override file extension using “extension” argument. Here is a code sample:

import textract
text = textract.process("file.ogg", extension="ogg")
print (text)

If you want to manually override an automatic extraction method used by Textract, you can use the “method” argument (as shown in the code sample below):

import textract
text = textract.process("file.ogg", method="sox")
print (text)

Supported file types and extraction methods are listed here.

To know more about Textract Python methods and their usage, you can view the API documentation available here.

Conclusion

Textract provides a single unified command line interface and Python API for extracting text from a number of different file types. You can even use it to extract content from media files. It is especially suitable in cases where you don’t want to go through a multitude of different command line utilities to handle text extraction and want to use a single API for everything.

About the author

Nitesh Kumar

I am a freelancer software developer and content writer who loves Linux, open source software and the free software community.