OCR4all 1.0 – Introduction
Motivation and General Idea
- Availability of Solutions: Numerous high-performance open-source solutions for Automatic Text Recognition (ATR) are already available, with new releases emerging continuously.
- Diverse Use Cases: The highly heterogeneous nature of use cases necessitates the targeted deployment of specialized ATR solutions.
- Requirement: There is a need for user-friendly frameworks that facilitate the flexible, integrable, and sustainable combination and application of both existing and future ATR solutions.
- Objective: Our goal is to empower users to perform ATR independently, achieving high-quality results.
- Foundation: This framework is built upon freely available tools, enhanced by our in-house developments.
OCR-D and OCR4all
- OCR-D Initiative: The DFG-funded OCR-D initiative is dedicated to facilitating the mass full-text transformation of historical prints published in the German-speaking world.
- Focus Areas: OCR-D emphasizes interoperability and connectivity, ensuring a high degree of flexibility and sustainability in its solutions.
- Integrated Solutions: The initiative combines multiple ATR solutions within a unified framework, enabling precise adaptation to specific materials and use cases.
- Open Source Commitment: All results from the OCR-D project are released as completely open-source.
- OCR4all-Libraries Project: The DFG-funded OCR4all-libraries project has two primary goals:
- Providing a user-friendly interface for OCR-D solutions via OCR4all, enabling independent use by non-technical users.
- Enhancing the ATR output within OCR4all to offer added value to even the most technically experienced users.
System Architecture
- Modularity and Interoperability: The framework is designed with a strong focus on modularity and interoperability, ensuring seamless integration and adaptability.
- Distributed Infrastructure: The architecture features a distributed infrastructure, with a clear separation between the backend and frontend components.
- Backend: Built with Java and Spring Boot.
- Frontend: Developed using the Vue.js ecosystem.
- Component Communication: Components communicate via a REST API, enabling efficient interaction between different parts of the system.
- Integration of Third-Party Solutions: Service Provider Interfaces (SPIs) allow for the integration of third-party solutions, such as ATR processors.
- Containerized Setup: The containerized architecture ensures easy distribution and deployment of all integrated components with minimal barriers.
- Data Sovereignty: Users retain full control over their data, with no data leaving the instance without explicit user or administrator consent.
- Reproducibility: Every step in the process is fully reproducible. A "transcript of records" feature stores detailed information about the processors and parameters used, ensuring transparency and repeatability.
Modules
Data Management and Processing
- Separation of Functions: Data management and processing are strictly separated to ensure efficient handling and security.
- Data Sharing: Data can be shared with different users or user groups as needed.
Processors and NodeFlow
- Wide Array of Processors: A diverse range of ATR processors is available, including OCR-D and external options.
- Ease of Integration: New processors can be easily implemented via a well-defined interface, with the user interface generated automatically.
- NodeFlow: The graphical editor NodeFlow simplifies the creation of workflows, making it convenient for users to design and customize processing sequences.
LAREX
- Result Correction and Training Data Creation: LAREX allows for the correction of all ATR workflow results and the creation of training data.
- Visual Workflow Identification: LAREX helps users identify the most suitable workflows as a visual explanation component.
Datasets, Training, and Evaluation
- Dataset Creation: Datasets can be created with the option to use tagging and import functionalities.
- Dataset Enrichment: Datasets can be enriched with training data generated and tagged within the application, even across various projects and sources.
- Model Training: Train models on selected datasets or subsets thereof, with options for in-app usage or exporting both models and associated training data.
- Model Evaluation: Evaluate both trained and imported models using curated datasets to ensure quality and accuracy.
Working with OCR4all 1.0
One Tool, Two Modes
Base Mode | Pro Mode |
---|---|
Designed for novice users, with reduced complexity and a strongly guided, linear workflow | Tailored for experienced users who require more exploration and complexity |
Pre-selected solutions for each processing step | Unrestricted access to all processors, parameters, and features |
Pre-filtered parameters and limited access to advanced features | Support for identifying the best workflows and models for specific needs |
INFO
Currently only pro mode is available in the beta release. The base mode will be added shortly.
Example Use Cases and Application Scenarios
Fully Automatic Mass Full-Text Digitalization
- Objective: Maximize throughput with minimal manual effort.
- Users: Libraries and archives processing large volumes of scanned materials.
- Approach: Use the pro mode (NodeFlow, LAREX, and datasets) to identify the most suitable workflow.
Flawless Transcription of Source Material
- Objective: Achieve maximum quality, accepting significant manual effort.
- Users: Humanist researchers preparing text for a digital edition.
- Approach: Utilize the base mode for iterative transcription with continually improving accuracy.
Building Corpora for Quantitative Applications
- Objective: Maximize quality while minimizing manual effort.
- Users: Researchers constructing corpora for training and evaluating quantitative methods.
- Approach: Manage data and consistently retrain source-specific or mixed models using datasets and tagging functionalities.