textsbut
textsbut is a framework and collection of tools designed for auditing and evaluating text data and language models. It provides standardized methods to assess data provenance, content quality, and the alignment between datasets and their documentation. The project emphasizes reproducibility, transparency, and modularity in NLP experimentation.
The name textsbut combines 'texts' with a contrastive conjunction 'but', signaling its focus on examining exceptions,
The concept emerged from concerns about data integrity in NLP workflows. An early prototype appeared in 2022
Key components include a data catalog that records dataset provenance, licensing, languages, and annotations; an evaluation
Applications cover dataset auditing, model evaluation, and content moderation workflows. Users can compare datasets against documentation,
Reception has been mixed: scholars praise the emphasis on provenance and reproducibility, while critics point to