Improving the quality of scanned images can serve two different purposes:
- enhance the visual appearance of images when viewed by humans,
- enhance the quality for post-processing steps such as OCR and layout analysis.
Depending on the use case, different tools or settings have to be applied to optimize the image processing result for a particular purpose or material.
Common software tools used to enhance the visual appearance of images are tools for deskewing, contrast enhancement or border adjustment. The overall goal is to transform scanned images in a way that results in sharp and readable text, clear images and a white background. Additionally, borders and page sizes are usually adjusted to be the same size for every page to improve the viewing experience for a set of pages. These requirements can result in different processing parameters applied to different regions of an image in order to have letters rendered with very high contrast compared to images or photos which require much less contrast.
Image enhancement for post-processing purposes usually involves tools for deskewing, noise removal and binarisation. However, the parameters used for such tools depend very much on the intended use case. For example if the goal is to improve OCR results by applying image enhancement tools, the optimal parameters might vary for different OCR engines. Additionally, parameters might have to be adjusted for different data sets or even individual pages within a given data set.
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.
OCR (optical character recognition) is defined as automatic transcription of the text represented on an image into machine-readable text. In this section we provide training materials for both commercial and open-source tools.
Evaluation of OCR can be done in a number of different ways. From the point of view of reproducibility and objectivity, the preferred way is to develop ground truth transcriptions for an evaluation dataset, and evaluate by measuring character error rate and/or word error rate. An essential prerequisite for this is a tool which compares the OCR with the ground truth transcription, and collects statistics about the amount of errors, both in terms of characters (Character Error Rate, CER) and words (Word Error Rate, WER). Most evaluation tools proceed by alignment of ground truth and OCR, allowing to precisely locate the discrepancies between GT and OCR. An alternative evaluation measure which gives a rough indication of word error rate without considering the location of the words is the bag-of-words error rate. This may be used when complex layouts render alignment problematic.
OCR produces its best results from well-printed, modern documents. But historical documents contain a range of effects that can reduce accuracy of recognition: from poor paper quality, poor typesetting, damage or degradation of the original paper source, and text skew or warping due to age or humidity. In addition to this, content holding institutions will tend to have legacy data: text-based digitised material that was not originally created with OCR in mind.
This sort of material will produce unsatisfactory OCR accuracy and render digital material only partially discoverable and useable at best. IMPACT has therefore created a number of tools and modules that will allow institutions and their users to correct and validate OCR text either prior to publication or after (by means of crowdsourcing).
The purpose of tools in this area is to make digitised text more accessible to users and researchers by applying linguistic resources and language technology.
A named entity (NE) is a word or string of words referring to a proper location, person, or organisation (or date, time, etc.). Named Entity Recognition aims at detecting elements from a text and classify these into predefined categories such as persons, organizations, locations. Such information can be used to facilitate search by end users, but can also be used to supplement lexicons for OCR in order to enhance text recognition.
In this section, materials for tools for different purposes can be found.
The Impact Centre of Competence dataset contains more than half a million representative text-based images compiled by a number of major European libraries. Covering texts from as early as 1500, and containing material from newspapers, books, pamphlets and typewritten notes, the dataset is an invaluable resource for future research into imaging technology, OCR and language enrichment. A carefully selected subset of these images has been reproduced with accompanying "ground truth".
In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image's text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine's accuracy, and how important any deviation from ground truth is in that instance.
A lexicon is a structured, machine-usable repository of relevant linguistic knowledge about words in a language. A lexicon will contain historical variants (orthographical variants, inflected forms) and link them to a corresponding dictionary form in modern spelling (known as a 'modern lemma'). In this way, a user can search for a modern word ('water') and receive results that take into account all historical variants in that language ('wæter', 'weter', 'waterr', 'watre', etc.)