datasettiintext
Datasettiintext is a publicly available multilingual text corpus intended for natural language processing research and model benchmarking. The project aggregates text data from diverse, publicly accessible sources to support tasks such as language modeling, text classification, and multilingual transfer.
Content and format: The collection comprises tens of languages with a focus on broad domain coverage including
Acquisition and licensing: Data are gathered from sources that permit redistribution; where possible, licenses are preserved
Access and governance: Datasettiintext is hosted by an open data repository with versioned releases. Access is
Limitations and ethics: The corpus reflects biases inherent in its sources and may underrepresent some languages