Cutouts

From DigitWiki
Jump to: navigation, search


Introduction

Creation of searchable versions of historical documents is in general a hard problem. Modern documents differ a lot from early printed documents in terms of used letters, fonts and conventions. Contemporary commercial OCR applications were trained to recognise modern documents, that is why their applicability to historical documents is limited. One of the possible solutions to this problem is OCR customization.

Tesseract (https://code.google.com/p/tesseract-ocr/) is a well-known open-source OCR application, apart from other things it features layout analysis and training capabilities. Because Tesseract is a command-line tool it is very handy to have it as part of larger digitisation workflow. This document describes how to create custom recognition profile for a specific kind of documents using web application called Cutouts (http://wlt.synat.pcss.pl/cutouts) and command line tools called page-generator (https://github.com/psnc-dl/page-generator).

Requirements and licensing

Cutouts is available at http://wlt.synat.pcss.pl/cutouts/ and the page generator is on github: https://github.com/psnc-dl/page-generator

Usage

Overview of training process

Cutouts cutouts.png
Figure 1. First step in training material preparation, adjusting the boundaries of the glyph.

In first place images for training are initially processed with Tesseract OCR and uploaded to Cutouts application. Then users can handle training material preparation, by adjusting the boundaries of glyphs recognised by Tesseract in the initial step (see Figure 1). As a result each processed glyph is represented as four small files:

  • original, not-binarized image with a glyph,
  • binarized version of the image with a glyph,
  • final version, which includes binarization and manual correction performed by user, e.g. removal of overlapping glyphs,
  • XML file with metadata.


XML file contains several important information related to a glyph itself and to the original scanned image. This includes: coordinates of a glyph, size of the original image, name of the original file, Unicode character associated with a glyph. Apart from that Cutouts allows to specify additional information about the quality of print for a given glyph. User can mark certain „noisy” glyphs as unreadable, this can be used to filter out these „noisy” characters during preparation of a training material.

Cutouts cutouts2.png
Figure 2. Example of original page and training image created using page-generator.


The next step is done using page-generator which converts Cutouts output into the Tesseract training images (see Figure 2). After that the simple Bash script is launched in order to perform training which results in a new recognition profile. This profile can be then uploaded to OCR service or used in other tools e.g. Virtual Transcription Laboratory (http://wlt.synat.pcss.pl) which offers web-based user interface for OCR and post-correction.

Two the most tedious parts of this process: preparation of training material and OCR post-correction are implemented using web-based tools, thanks to this the whole process can be accelerated by distributing small units of work among group of volunteers.

Assuring quality with Cutouts

Nature of crowdsourcing does not guarantee that only skilled and well-trained experts will be responsible for creation of training materials. Because of that Cutouts features also an audit interface. For each scanned image, application administrator can check statistically relevant sample of materials prepared by the volunteers. Then on top of this sample, he/she can decide about the quality of prepared material. Figure 3 presents Cutouts audit interface; incorrect elements can be rejected. In such case, the editor would have to process them once again.

Cutouts cutouts3.png
Figure 3: Audit interface which allows to check the quality of materials prepared by volunteers.