ocrpara
ocrpara is an open-source software toolkit designed to improve the readability and searchability of OCR output by automatically detecting and reconstructing paragraph boundaries in digitized documents. It processes OCR results produced by engines such as Tesseract or OCRopus and can accept input in hOCR, ALTO, or plain text formats. The aim is to restore document structure, particularly in multi-column layouts, documents with irregular line breaks, or archival material.
Key features include paragraph segmentation using language-agnostic heuristics enhanced by machine learning models, support for multiple
The architecture is a modular pipeline with input adapters, a layout analysis component, a paragraph segmentation
Applications include digitization projects for libraries and archives, academic publishing, and government or legal document processing
See also: Optical character recognition, hOCR, ALTO, document layout analysis, paragraph detection.