Comprehensive software framework for accurate and flexible document digitisation, information extraction and semantic labelling
About
Driven by the ever increasing need for accessibility of information, significant efforts are being made by libraries, document repositories and commercial organisations to digitise printed or handwritten documents.
Aletheia is the result of several years of research, development and practical use. It can be used for small-scale or bespoke projects or can be used to create training/test data for large-scale AI applications. The types of data that can be annotated include: page structure, objects with content (text, tables, graphics, …), semantic tags, and relations between entities.
The University of Salford has developed an advanced production-quality software system for very accurate and yet efficient (cost-effective) analysis, recognition and annotation of large amounts of scanned documents.
In contrast to existing systems which simply apply the same processes to large amounts of documents, Aletheia aids the user to achieve very high precision with an ever-expanding number of automated and semi-automated tools developed and improved over the years. This is based on research by the Pattern Recognition and Image Analysis (PRImA) Lab and feedback from stakeholders across the world (major content-holding institutions and commercial service providers), several of which have been using the tool in production environments.
Aletheia has also been used frequently by organisations globally to create training data for building new mass digitisation systems. Data created with Aletheia is stored in dedicated, well-publicised XML formats that can be considered the de-facto standard for this type of data.
The software is mature, having been extensively developed over successive research and development projects and is available to license from the University of Salford. There is a version for Windows and a web-based version. Both come with extensive documentation, including video guides.
Key Benefits
While other ground truth software is available, the Aletheia tool is particularly sophisticated, offering advantages over existing tools, including:
- Support for complex document layouts (some systems only support simple rectangular text outlines), particularly applicable to historic texts
- Rich annotation with a wide variety data that can be represented (polygons, object attributes, text content, reading order, table structure, named entities and relations, and more)
- Support for a range of image formats (some systems only support JPEG)
- Fast, user friendly interface (for desktop and web)
- Improved accuracy, compared to other software
Applications
Aletheia can be used:
- As a pre‑step in existing OCR software, to improve its accuracy and speed (crucial for mass digitisation projects)
- As a tool for institutions to use to test, evaluate, compare and select the best OCR software currently available for their specific application.
- As a viewer for page recognition / OCR outputs (for visual inspection and post-correction)