Automated identification of press variants in old documents

Date

2023-01

Advisors

Journal Title

Journal ISSN

ISSN

DOI

Volume Title

Publisher

De Montfort University

Type

Thesis or dissertation

Peer reviewed

Abstract

Collation is the comparison of textual content to identify variations within the texts being compared. This involves comparing the texts word by word and character by character. Throughout history, collation has been used in a variety of application areas, using the naked eye, mechanical machines, as well as advanced automated collation methods that rely on software tools. The main objective of this research is to develop a fully automated system that can detect textual variations between copies of the same book. The system includes five main steps: pre-processing, segmentation, post-processing, feature extraction, and classification using a K-NN classifier. The post-processing step, includes a new technique to solve the character co-existence problem. It consists of counting the number of black pixels in all detected objects in the character image and eliminating objects with a small number of black pixels. This technique achieves an accuracy rate of eliminating unwanted objects from the character image of more than 90%. Another problem addressed in this research is detecting extra lines and extra words that may appear in the texts being compared. To solve this issue, a new technique was provided that uses the distance between the first and last black pixel to determine if there is an additional word. The testing was done using "The Tragedy of Hamlet" by William Shakespeare. The results showed that the integration of a K-NN classifier with feature extraction algorithms (Zoning, Template Matching, Crossings, Theta Distribution, Projection Profile) led to higher accuracy scores of character matching (classifying each character to the correct class) at 84% compared to the Calamari OCR system at 72%, and also a higher accuracy of textual variants detection at 88% compared to the Calamari OCR system at 73%.

Description

Keywords

Citation

Rights

Research Institute

Collections