Tesseract: A free OCR solution
Tessereact is considered one of the best OCR solutions available. Since 2006 it is sponsored by Google, previously it was developed by Hewlett Packard in C and C++ between 1985 and 1998. The system is capable to identify even handwriting, it can learn increasing it’s accuracy, and is among the most developed and complete in the market.
It easily beats commercial competitors like ABBY, if you are looking for a serious solution for OCR, Tesseract is the most accurate one, but don’t expect for massive solutions: it uses a core per process, which means a 8 core processor (hyperthreading accepted) will be able to process 8 or 16 images simultaneously.
When I used Tesseract we managed thousands of potential customers uploading handwritten content, images with text, etc. We used 48 core servers, with DatabaseByDesign and then with AWS, we never had a resources problem.
We had an uploader which discriminated between text files like Microsoft Office or Open Office files and images or scanned documents. The uploader determined whatever the OCR or PHP scripts would process an order, in the field of text recognition.
Tesseact is a great solution, but before thinking about it you must know, last Tesseract’s versions brought big improvements, some of them mean hard work. While training could last for hours or days, recent Tesserct’s versions training may be of days, weeks, or even months if you are looking for a multilingual OCR solution.
Installing Tesseract 4 on Debian / Ubuntu:
If you are using a different Linux distribution, you’ll need to copy the last github repository version and copy the .traineddata file into ‘tessdata’ (/usr/share/tesseract-ocr/tessdata or /usr/share/tessdata).
By default Tesseract will install the English language pack, to install additional languages run
for example, to add Hebrew:
You can include all languages by running:
In order for Tesseract to work properly, we will need to use the command “convert” (convert between image formats as well as resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more) provided by Imagemagick:
Lets install imagemagick with apt-get:
Now let’s test Tesseract, find an image containing text and run:
If installed properly, Tesseract will extract the text from the image.
When I worked with Tesseract, all we needed was to word count documents. Like with any other program you can, and must, train it, in Word we can define some symbols which can be counted or not, if to count or not numbers, etc. the same with Tesseract.
We can also train it’s sensibility to specific images.
Size Optimization: According to official sources, the optimal pixels size for an image to be processed successfully by Tesseract is 300DPI. We’ll need to process any image using the -r parameter to enforce this DPI. Increasing the DPI will also increase the processing time.
Page rotation: If when scanned the page wasn’t properly rotated and stays 180° or 45°, Tesseract’s accuracy will decrease, you can use this Python script to automatically detect and fix rotation issues.
Border Removal: According to Tesseract’s official man, borders can erroneously be picked as characters, especially dark borders and where there is gradation variety. Removing borders may be a good step to achieve the maximal accuracy with Tesseract.
Removing Noise: According to Tesseracts, noise “is random variation of brightness or colour in an image”. We can remove it in the binarization step, which means polarizing it’s colors.
While most of tutorials cover only Tesseract’s installation, I will summarize how to train your OCR system, here we can find a tutorial for all versions. In this article I’ll summarize how to train Tesseract 4 which includes a new “neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.”
Before continuing we will need to install additional libraries:
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
And we will install the training tools by running, within the Tesseract directory:
sudo make training-install
According to Tesseract’s official wiki, we have 3 current options to train our OCR system:
- “Fine tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.
- Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn’t work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script.
- Retrain from scratch. This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.
While the above options may sound different, the training steps are actually almost identical, apart from the command line, so it is relatively easy to try it all ways, given the time or hardware to run them in parallel.”
In this tutorial, we will only run the tesstrain.sh script which will call necessary programs to train a specific language.
First of all lets clone all the files within our /usr/share/tesseract-ocr:
Go to /usr/share/tesseract-ocr/tesseract/training (Tesseract’s default installation directory) and run:
$ ./tesstrain.sh --lang heb --langdata_dir /usr/share/tesseract-ocr/langdata --tessdata_dir /usr/share/tesseract-ocr/tessdata
Change “heb” for the language you want to train, and also edit the path to your data.
Within the directory /usr/share/tesseract-ocr/tesseract/training you will find the file language-specific.sh useful to add rules for specific languages.
Tesseract is to me the best OCR solution, but recently it made huge changes from the past versions and many users are complaining about changes or things which are no longer working, I wouldn’t worry since the changes seem to give great results. Tesseract’s community is very active, in case you find problems running tesseract, become part of Tesseract’s community here.