PAGEXML
PAGEXML is an XML-based data format designed to encode the layout, content, and metadata of digital page images for optical character recognition (OCR) and document analysis. It provides a portable, interoperable representation for ground-truth annotations and evaluation across digitization projects and research.
The schema represents page structure in a hierarchical manner, allowing pages to be subdivided into regions
PAGE XML is widely used in OCR research and historical-document digitization. It supports ground-truth annotation workflows