ocrpara

ocrpara is an open-source software toolkit designed to improve the readability and searchability of OCR output by automatically detecting and reconstructing paragraph boundaries in digitized documents. It processes OCR results produced by engines such as Tesseract or OCRopus and can accept input in hOCR, ALTO, or plain text formats. The aim is to restore document structure, particularly in multi-column layouts, documents with irregular line breaks, or archival material.

Key features include paragraph segmentation using language-agnostic heuristics enhanced by machine learning models, support for multiple

The architecture is a modular pipeline with input adapters, a layout analysis component, a paragraph segmentation

Applications include digitization projects for libraries and archives, academic publishing, and government or legal document processing

See also: Optical character recognition, hOCR, ALTO, document layout analysis, paragraph detection.

(single-column,

a

a

interoperability

a

a

version-control