PubLayNet
PubLayNet is a large-scale dataset designed for document layout analysis. It comprises over 330,000 scientific paper documents. The primary goal of PubLayNet is to facilitate research in automatically understanding the structural organization of scientific articles, such as identifying different regions like titles, paragraphs, figures, tables, and lists.
The dataset was created by extracting document images from PDF files found in the PubMed Central Open
PubLayNet is particularly useful for developing models that can parse complex documents, which is a crucial