Installing Tesseract OCR in Linux
Tesseract OCR is available by default on most Linux distributions. You can install it in Ubuntu using the command below:
Detailed instructions for other distributions are available here. Even though Tesseract OCR is available in repositories of many Linux distributions by default, it is recommended to install the latest version from the link mentioned above for improved accuracy and parsing.
Installing Support for Additional Languages in Tesseract OCR
Tesseract OCR includes support for detecting text in over 100 languages. However, you only get support for detecting text in the English language with the default installation in Ubuntu. To add support for parsing additional languages in Ubuntu, run a command in the following format:
The command above will add support for the Hindi language to Tesseract OCR. Sometimes you can get better accuracy and results by installing support for language scripts. For instance, installing and using the tesseract package for Devanagari script “tesseract-ocr-script-deva” gave me much more accurate results than using the “tesseract-ocr-hin” package.
In Ubuntu, you can find correct package names for all languages and scripts by running the command below:
Once you have identified the correct package name to install, replace the string “tesseract-ocr-hin” with it in the first command specified above.
Using Tesseract OCR to Extract Text from Images
Let’s take an example of an image shown below (taken from Wikipedia page for Linux):
To extract text from the image above, you have to run a command in the following format:
Running the command above gives the following output:
In the command above, “capture.png” refers to the image from which you want to extract the text. The captured output is then stored in the “output.txt” file. You can change the language by replacing the “eng” argument with your own choice. To see all valid languages, run the command below:
It will show abbreviation codes for all languages supported by Tesseract OCR on your system. By default, it will only show “eng” as output. However, if you install packages for additional languages as explained above, this command will list more languages that you can use to detect text (as ISO 639 3-letter language codes).
If the image contains text in multiple languages, define primary language first followed by additional languages separated by plus signs.
If you want to store the output as a searchable PDF file, run a command in the following format:
Note that the searchable PDF file won’t contain any editable text. It includes the original image, with an additional layer containing the recognized text superimposed on the image. So while you will be able to accurately search text in the PDF file using any PDF reader, you won’t be able to edit the text.
Another point you should note that the accuracy of text detection increases greatly if the image file is of high quality. Given a choice, always use lossless file formats or PNG files. Using JPG files may not give the best results.
Extracting Text from a Multi-page PDF File
Tesseract OCR natively doesn’t support extracting text from PDF files. However, it is possible to extract text from a multi-page PDF file by converting each page into an image file. Run the command below to convert a PDF file into a set of images:
For each page of the PDF file, you will get a corresponding “output-1.png”, “output-2.png” file, and so on.
Now, to extract text from these images by using a single command, you will have to use a “for loop” in a bash command:
Running the above command will extract text from all “.png” files found in the working directory and store the recognized text in “output-original_filename.txt” files. You can modify the middle part of the command as per your needs.
If you want to combine all text files containing the recognized text, run the command below:
The process for extracting text from a multi-page PDF file into searchable PDF files is nearly the same. You have to supply an extra “pdf” argument to the command:
If you want to combine all searchable PDF files containing the recognized text, run the command below:
Both “pdftoppm” and “pdfunite” are installed by default on the latest stable version of Ubuntu.
Advantages and Disadvantages of Extracting Text in TXT and Searchable PDF Files
If you extract recognized text into TXT files, you will get editable text output. However, any document formatting will be lost (bold, italic characters, and so on). Searchable PDF files will preserve the original formatting, but you will lose text editing capabilities (you can still copy raw text). If you open the searchable PDF file in any PDF editor, you will get embedded image(s) in the file and not raw text output. Converting the searchable PDF files into HTML or EPUB will also give you embedded images.
Tesseract OCR is one of the most widely used OCR engines today. It is a free, open-source and supports over a hundred languages. When using Tesseract OCR, make sure to use high-resolution images and correct language codes in command-line arguments to improve the accuracy of text detection.