OCRprocessed

OCRprocessed refers to the state of a document after it has been processed by an optical character recognition (OCR) system. In typical workflows, raw scanned images are fed to OCR software, which outputs OCRprocessed data consisting of machine-readable text, layout information, and metadata that describe the document's structure and provenance. The term distinguishes the processed results from the original image and from intermediate OCR attempts.

Components of OCRprocessed data usually include recognized text with character and word bounding boxes, along with

Formats commonly used to encode OCRprocessed output include HOCR, PAGE XML, ALTO, and JSON. These formats preserve

Applications for OCRprocessed data span digitization projects, accessibility initiatives, searchable archival repositories, and automated data extraction

Quality and challenges: OCRprocessed quality is assessed via metrics such as character error rate and word

post-processing

a