Skip to content

OCR4all 1.0 – Introduction

Motivation and General Idea

  • Availability of Solutions: Numerous high-performance open-source solutions for Automatic Text Recognition (ATR) are already available, with new releases emerging continuously.
  • Diverse Use Cases: The highly heterogeneous nature of use cases necessitates the targeted deployment of specialized ATR solutions.
  • Requirement: There is a need for user-friendly frameworks that facilitate the flexible, integrable, and sustainable combination and application of both existing and future ATR solutions.
  • Objective: Our goal is to empower users to perform ATR independently, achieving high-quality results.
  • Foundation: This framework is built upon freely available tools, enhanced by our in-house developments.

OCR-D and OCR4all

  • OCR-D Initiative: The DFG-funded OCR-D initiative is dedicated to facilitating the mass full-text transformation of historical prints published in the German-speaking world.
  • Focus Areas: OCR-D emphasizes interoperability and connectivity, ensuring a high degree of flexibility and sustainability in its solutions.
  • Integrated Solutions: The initiative combines multiple ATR solutions within a unified framework, enabling precise adaptation to specific materials and use cases.
  • Open Source Commitment: All results from the OCR-D project are released as completely open-source.
  • OCR4all-Libraries Project: The DFG-funded OCR4all-libraries project has two primary goals:
    • Providing a user-friendly interface for OCR-D solutions via OCR4all, enabling independent use by non-technical users.
    • Enhancing the ATR output within OCR4all to offer added value to even the most technically experienced users.

System Architecture

  • Modularity and Interoperability: The framework is designed with a strong focus on modularity and interoperability, ensuring seamless integration and adaptability.
  • Distributed Infrastructure: The architecture features a distributed infrastructure, with a clear separation between the backend and frontend components.
    • Backend: Built with Java and Spring Boot.
    • Frontend: Developed using the Vue.js ecosystem.
  • Component Communication: Components communicate via a REST API, enabling efficient interaction between different parts of the system.
  • Integration of Third-Party Solutions: Service Provider Interfaces (SPIs) allow for the integration of third-party solutions, such as ATR processors.
  • Containerized Setup: The containerized architecture ensures easy distribution and deployment of all integrated components with minimal barriers.
  • Data Sovereignty: Users retain full control over their data, with no data leaving the instance without explicit user or administrator consent.
  • Reproducibility: Every step in the process is fully reproducible. A "transcript of records" feature stores detailed information about the processors and parameters used, ensuring transparency and repeatability.

Modules

Data Management and Processing

  • Separation of Functions: Data management and processing are strictly separated to ensure efficient handling and security.
  • Data Sharing: Data can be shared with different users or user groups as needed.

Processors and NodeFlow

  • Wide Array of Processors: A diverse range of ATR processors is available, including OCR-D and external options.
  • Ease of Integration: New processors can be easily implemented via a well-defined interface, with the user interface generated automatically.
  • NodeFlow: The graphical editor NodeFlow simplifies the creation of workflows, making it convenient for users to design and customize processing sequences.

LAREX

  • Result Correction and Training Data Creation: LAREX allows for the correction of all ATR workflow results and the creation of training data.
  • Visual Workflow Identification: LAREX helps users identify the most suitable workflows as a visual explanation component.

Datasets, Training, and Evaluation

  • Dataset Creation: Datasets can be created with the option to use tagging and import functionalities.
  • Dataset Enrichment: Datasets can be enriched with training data generated and tagged within the application, even across various projects and sources.
  • Model Training: Train models on selected datasets or subsets thereof, with options for in-app usage or exporting both models and associated training data.
  • Model Evaluation: Evaluate both trained and imported models using curated datasets to ensure quality and accuracy.

Working with OCR4all 1.0

One Tool, Two Modes

Base ModePro Mode
Designed for novice users, with reduced complexity and a strongly guided, linear workflowTailored for experienced users who require more exploration and complexity
Pre-selected solutions for each processing stepUnrestricted access to all processors, parameters, and features
Pre-filtered parameters and limited access to advanced featuresSupport for identifying the best workflows and models for specific needs

INFO

Currently only pro mode is available in the beta release. The base mode will be added shortly.

Example Use Cases and Application Scenarios

Fully Automatic Mass Full-Text Digitalization

  • Objective: Maximize throughput with minimal manual effort.
  • Users: Libraries and archives processing large volumes of scanned materials.
  • Approach: Use the pro mode (NodeFlow, LAREX, and datasets) to identify the most suitable workflow.

Flawless Transcription of Source Material

  • Objective: Achieve maximum quality, accepting significant manual effort.
  • Users: Humanist researchers preparing text for a digital edition.
  • Approach: Utilize the base mode for iterative transcription with continually improving accuracy.

Building Corpora for Quantitative Applications

  • Objective: Maximize quality while minimizing manual effort.
  • Users: Researchers constructing corpora for training and evaluating quantitative methods.
  • Approach: Manage data and consistently retrain source-specific or mixed models using datasets and tagging functionalities.