Tesseract is the free and probably the best OCR solution in the market. Since 2006 it has been sponsored by Google; previously, it was developed by Hewlett Packard in C and C++ between 1985 and 1998. The system can identify even handwriting; it can learn, increasing its accuracy, and is among the most developed and complete in the market.
If properly trained, it can beat commercial competitors like ABBY; if you are looking for a serious solution for OCR, Tesseract is the most accurate one, but don’t expect massive solutions: it uses a core per process, which means an 8 core processor (hyperthreading accepted) will be able to process 8 or 16 images simultaneously.
Tesseract is a great solution, but before thinking about it, you must know that the last Tesseract’s versions brought big improvements, some of which mean hard work. While training could last for hours or days, recent Tesseract’s versions training may be of days, weeks, or even months, especially if you are looking for a multilingual OCR solution.
Installing Tesseract on Debian and Ubuntu:
To install Tesseract on Debian or Ubuntu Linux distribution, use apt as shown in the screenshot below.
This will install Tesseract under /usr/share/tesseract-ocr/4.00/tessdata.
Note: For other Linux distributions, jump to Install Tesseract from Sources.
By default, Tesseract will install the English language pack. To install additional languages, the syntax is the following. In the example below, I will install the Hebrew language pack.
To install all available languages, run:
For Tesseract to work properly, we will need to use the “convert” command. This command is useful to convert between image formats and resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more. This tool is provided by Imagemagick:
Now let’s test Tesseract, find an image containing text and run:
Tesseract will extract the text from the image.
When I worked with Tesseract, all we needed was to word count documents. Like with any other program, you can, and must, train it to understand the handwriting.
In advanced text editors, we can define some symbols which can be counted or not, if to count or not numbers, etc., the same with possibility is available on Tesseract.
Optimizing Tesseract:
- Size Optimization: According to official sources, the optimal pixel size for an image to be processed successfully by Tesseract is 300DPI. We’ll need to process any image using the -r parameter to enforce this DPI. Increasing the DPI will also increase the processing time.
- Page rotation: If, when scanned, the page isn’t properly positioned and stays 180° or 45°, Tesseract’s accuracy will decrease, so you can use a Python script to detect and fix rotation issues automatically.
- Border Removal: According to Tesseract’s official man, borders can erroneously be picked as characters, especially dark borders and where there is a gradation variety. Removing borders may be a good step to achieve maximal accuracy with Tesseract.
- Removing Noise: According to Tesseract sources, noise “is random variation of brightness or color in an image”. We can remove this variation in the binarization step, which means polarizing its colors.
Introduction to Tesseract training process:
Previously this article covered Tesseract’s training process, which evolved to a more manual process that deserves a dedicated article. Therefore this section only covers theoretical information on the training process and instructions to install Tesseract training tools and launch them.
According to Tesseract’s official wiki, we have 3 current options to train our OCR system:
- “Fine-tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.
- Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine-tuning doesn’t work, this is most likely the next best option. If you start with the most similar-looking script, cutting off the top layer could still work for training a completely new language or script.
- Retrain from scratch. Unless you have a very representative and sufficiently large training set for your problem, this is a daunting task. If not, you will likely end up with an over-fitted network that does really well on the training data but not on the actual data.
Before continuing to Tesseract training instructions, we will need to install additional libraries:
On Debian-based Linux distributions, install the Tesseract development package, including Tesseract training tools using apt as shown below. If you are not using a Debian-based Linux distribution, read the instructions to install Tesseract training tools from sources.
After the installation, you’ll be able to see the training tools under /usr/share/tesseract-ocr/ as shown below.
Before starting to train a language, you need to provide Tesseract the content from which to learn.
For this, you need to create the langdata directory and eng subdirectory within Tesseract’s installation main directory. Then create the training text file as shown below.
sudo mkdir /usr/share/tesseract-ocr/langdata/eng/
sudo nano /usr/share/tesseract-ocr/langdata/eng/eng.training_text
Note: Remember to add content to the eng.training_text file.
Once the training text file was added, the syntax to start training a language is the following. The following command is to train the English language defined as “eng”.
This process may take a long time. Of course, this also depends on your training text files. This is the introduction to the Tesseract training process. We will publish a new article focused on the training process only.
Troubleshooting missing fonts:
In my case, I got an error when trying to train Tesseract. The Arial Bold font was missing. I solved this by running the command below.
Install Tesseract from Sources on Linux:
On different Linux distributions, you can get Tesseract using git, as shown below.
Once cloned, go into the tesseract directory by running using cd.
Then run the autogen.sh script as shown below.
The command above creates the installation files; now run the following command to start the installation process.
Run make to start compiling Tesseract.
Then run make install, as shown in the screenshot below.
Execute ldconfig as shown below.
To compile training tools, run the following command.
Then run:
Now you can follow the instructions to get started with the training process.
Conclusion:
As you can see, installing Tesseract on Linux is pretty easy, especially on Debian-based Linux distributions. When I used Tesseract, we managed thousands of potential customers uploading handwritten content, images with text, etc. We used 48 core servers, with DatabaseByDesign and then with AWS; we never had a resource problem.
We had an uploader that discriminated between text files like Microsoft Office or Open Office files and images or scanned documents. The uploader determined whatever the OCR or PHP scripts would process an order in the field of text recognition.
In my experience, Tesseract is the best OCR solution available in the market, and it’s open-source.
Thank you for reading this tutorial explaining how to install and configure Tesseract OCR on Linux. Keep following us for additional Linux tips and tutorials.