OCRprocessed
OCRprocessed refers to the state of a document after it has been processed by an optical character recognition (OCR) system. In typical workflows, raw scanned images are fed to OCR software, which outputs OCRprocessed data consisting of machine-readable text, layout information, and metadata that describe the document's structure and provenance. The term distinguishes the processed results from the original image and from intermediate OCR attempts.
Components of OCRprocessed data usually include recognized text with character and word bounding boxes, along with
Formats commonly used to encode OCRprocessed output include HOCR, PAGE XML, ALTO, and JSON. These formats preserve
Applications for OCRprocessed data span digitization projects, accessibility initiatives, searchable archival repositories, and automated data extraction
Quality and challenges: OCRprocessed quality is assessed via metrics such as character error rate and word