TPDL Tutorial State-of-the-art tools for text digitisation

From DigitWiki
Revision as of 11:47, 4 June 2014 by Admin (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


The goal of this tutorial (organised by the Succeed project) is to provide a practical experience introducing participants to a number of state-of-the-art tools in digitisation and text processing which have been developed in recent research projects. The tutorial will focus on hands-on demonstration and on the testing of the tools in real-life situations, even those provided by the participants. The learning objectives are:

- Gain practical insight of the most recent developments in text digitisation techniques.

- Identify strengths and usability weaknesses of existing tools.

- Reach a better knowledge on the effect of new tools and resources on the productivity.

- Discuss the requirements and effects of their integration in the production workflow.

This tutorial will give participants a unique opportunity to gather information about tools created in research projects, to test and evaluate their usability and to find out how to benefit from the usage of these tools. Conversely, researchers will benefit from practitioner comments and suggestions.


To register for the tutorial, please visit

Target audience

The tutorial is intended for librarians, archivists or museum staff involved in text digitisation. Attendees might have basic knowledge in digitisation strategies.

Tentative program

1. Introduction to digitisation process: The tutorial will start with a general introduction to the digitisation process, the steps needed and best practices when starting a mass digitisation project will be discussed. In this introduction, we will overview the areas in which we believe significant improvement in the digitisation workflow can be achieved and discuss the evaluation criteria which lead to the selection of the tools showcased.

2. Overview of the available tools: After the introductory section, every tool will be presented, briefly followed by a practical demonstration and a discussion on how they can be deployed within the library digitisation infrastructure. Depending on the number and interest of the attendees, the tutorial may be split into two tracks: Image-to-text and Language resources.

3. Hands-on session: The tutorial will be continued with an extended hands-on session for those attendees who need further training or testing.


1. Image Enhancement

These tools enhance the quality of scanned documents, both for visual presentation in digital libraries and eBooks and to improve the results of the subsequent steps such as segmentation and OCR. We will showcase tools for binarisation and colour reduction, noise and artefact removal, and geometric correction

Showcased Tools:

  • ScanTailor [1]

Scan Tailor is an interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, and others.

2. OCR and Post-correction

OCR is defined as automatic transcription of the text represented on an image into machine-readable text. From this area, we will discuss and demonstrate the latest developments in both commercial and open source OCR. In this section we will also include tools for document segmentation. Additionally, approaches to both interactive (e.g. crowdsourcing) and fully automatic post-correction of the digitised text will be demonstrated.

Showcased Tools:

Tesseract is probably the most accurate open source OCR engine available. Cutouts is a web application which allows to crowdsource preparation of training data for Tesseract OCR engine.

  • Virtual Transcription Laboratory [4]

Virtual Transcription Laboratory is Virtual Research Environment which works as a crowdsourcing platform for developing high quality textual representations of digital documents. It gives access to online OCR service and easy to use transcription editor. Images can be imported from various sources including direct import from digital libraries.

3. Logical Structure Analysis

Documents such as newspapers or magazines are a composition of various structural elements such as headings or articles. Here we will present tools that are able to automatically detect and reconstruct these structural elements from scanned documents.

Showcased Tools:

  • Newspaper Segmenter [5]

Award-winning (e.g. ICDAR'09,'11) page and article segmentation for scanned documents featuring complex layouts (e.g. (historical) newspapers, contemporary magazines, text books, etc.)

  • Functional Extension Parser [6]

The Functional Extension Parser (FEP) is a Document Understanding Software tool capable of decoding layout elements of books. Based on the output of Optical Character Recognition, layout elements such as page numbers, running titles, headings, and footnotes are detected and annotated.

4. Lexicon-building, Deployment and Enrichment

The purpose of tools in this area is to make digitised text more accessible to users and researchers by applying linguistic resources and language technology. The tutorial will show how lexical resources for retrieval and OCR can be constructed and how they can be exploited in retrieval and OCR.

Showcased Tools:

NERT is a tool that can mark and extract named entities (persons, locations and organizations) from a text file. It uses a supervised learning technique, which means it has to be trained with a manually tagged training file before it is applied to other text. In addition, version 2.0 of the tool and higher also comes with a named entity matcher module, with which it is possible to group variants or to assign modern word forms of named entities to old spelling variants. As a basis for the tool in this package, the named entity re cognizer from Stanford University is used. This tool has been extended for use in IMPACT. Among the extensions is the aforementioned matcher module, and a module that reduces spelling variation within the used data, thus leading to improved performance.

  • Corpus Based Lexicon Tool (CoBaLT) [8]

Corpus Based Lexicon Tool (CoBaLT). A tool for corpus-based lexicon construction. Users can upload a text dataset (corpus) for use in creating an attestation-based lexicon. This tool is used to manually correct the automatically lemmatized corpus text. Verified lemmatized words plus the context in which they appear will be stored in the Information Retrieval Lexicon. The tool can handle plain text and various XML formats, among which the IMPACT Page XML format and TEI. An important requirement of the tool is that it should be fit to quickly process large quantities of data, that it is a web application that can be run from any computer in the local network, that frequent input actions can be performed with the keyboard, and that the information is presented in such a way that quick evaluation is possible.


Download the tutorial slides from here [9]


Bob Boelhouwer (Instituut voor Nederlandse Lexicologie - INL)

He is a computational linguist. He holds a PhD: From letter strings to phonemes: the role of orthographic context in phonological recoding, 1998. Relevant experience e.g.: integration of lexical resources, and implementing tools to explore lexical resources.

Adam Dudczak (Poznań Supercomputing and Networking Centre - PSNC)

Holds a Master degree in Computer Science, he is a member of Digital Libraries Team ( a division of PSNC. He is leading the development of Virtual Transcription Laboratory (VTL, [10]). Portal integrating custom cloud-based OCR with handy editing interface which allows for crowdsourcing of text correction. Apart from this Adam is working on development of the e‐learning materials related to digital libraries and digitisation, created during the ACCESS IT and ACCESS IT plus ([11]) projects. These courses include e-learning materials and a dedicated operating system -- DigitLab ([12]) which integrates free and widely known tools useful in digitisation of various kinds of documents (including textual objects). Adam is also an experienced trainer, who had a chance to work with various communities in the area of digital libraries and software development.

Sebastian Kirch (Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS)

Received his diploma in information technology from the University of Marburg in 2009, majoring in the field of distributed systems. At Fraunhofer IAIS he is working on the design and implementation of distributed software applications for document analysis and enrichment. Sebastian has been active in several research projects; most recently as project leader in the BMBF-funded (German Federal Ministry of Education and Research) project “MediaGrid”. [13]

Date and Venue

The venue for the tutorials is the main conference hotel (Hotel Excelsior) in Valletta (information on the venue will be posted soon on [14]). All tutorials will be held on the 22th of September.