Tesseract (https://code.google.com/p/tesseract-ocr/) is a well-known open-source OCR application, apart from other things it features layout analysis and training capabilities. Because Tesseract is a command-line tool it is very handy to have it as part of larger digitisation workflow. This document describes how to create custom recognition profile for a specific kind of documents using web application called Cutouts (http://wlt.synat.pcss.pl/cutouts) and command line tools called page-generator (https://github.com/psnc-dl/page-generator).
Requirements and licensing
Overview of training process
In first place images for training are initially processed with Tesseract OCR and uploaded to Cutouts application. Then users can handle training material preparation, by adjusting the boundaries of glyphs recognised by Tesseract in the initial step (see Figure 1). As a result each processed glyph is represented as four small files:
- original, not-binarized image with a glyph,
- binarized version of the image with a glyph,
- final version, which includes binarization and manual correction performed by user, e.g. removal of overlapping glyphs,
- XML file with metadata.
XML file contains several important information related to a glyph itself and to the original scanned image. This includes: coordinates of a glyph, size of the original image, name of the original file, Unicode character associated with a glyph. Apart from that Cutouts allows to specify additional information about the quality of print for a given glyph. User can mark certain „noisy” glyphs as unreadable, this can be used to filter out these „noisy” characters during preparation of a training material.
The next step is done using page-generator which converts Cutouts output into the Tesseract training images (see Figure 2). After that the simple Bash script is launched in order to perform training which results in a new recognition profile. This profile can be then uploaded to OCR service or used in other tools e.g. Virtual Transcription Laboratory (http://wlt.synat.pcss.pl) which offers web-based user interface for OCR and post-correction.
Two the most tedious parts of this process: preparation of training material and OCR post-correction are implemented using web-based tools, thanks to this the whole process can be accelerated by distributing small units of work among group of volunteers.
Assuring quality with Cutouts
Nature of crowdsourcing does not guarantee that only skilled and well-trained experts will be responsible for creation of training materials. Because of that Cutouts features also an audit interface. For each scanned image, application administrator can check statistically relevant sample of materials prepared by the volunteers. Then on top of this sample, he/she can decide about the quality of prepared material. Figure 3 presents Cutouts audit interface; incorrect elements can be rejected. In such case, the editor would have to process them once again.