superglue
SuperGLUE is a benchmark for evaluating natural language understanding (NLU) systems. Released in 2019 as a more challenging successor to GLUE, it tests a model’s general language understanding and reasoning across diverse tasks. It aims to push progress beyond pattern matching on a single task and to better reveal true language understanding in large pre-trained models.
The suite includes eight tasks: BoolQ, CB (CommitmentBank), COPA, MultiRC, ReCoRD, RTE, WiC, and WSC. BoolQ is
Evaluation uses per-task metrics appropriate to each task (often accuracy, with some tasks using different measures)
Impact and limitations: SuperGLUE has spurred progress in model capabilities and spurred the development of more