Home

superglue

SuperGLUE is a benchmark for evaluating natural language understanding (NLU) systems. Released in 2019 as a more challenging successor to GLUE, it tests a model’s general language understanding and reasoning across diverse tasks. It aims to push progress beyond pattern matching on a single task and to better reveal true language understanding in large pre-trained models.

The suite includes eight tasks: BoolQ, CB (CommitmentBank), COPA, MultiRC, ReCoRD, RTE, WiC, and WSC. BoolQ is

Evaluation uses per-task metrics appropriate to each task (often accuracy, with some tasks using different measures)

Impact and limitations: SuperGLUE has spurred progress in model capabilities and spurred the development of more

a
yes/no
question
answering
task
based
on
a
short
passage.
CB
focuses
on
recognizing
linguistic
commitments.
COPA
evaluates
causal
reasoning
by
selecting
plausible
causes
or
effects
of
a
given
event.
MultiRC
provides
multi-question
reading
comprehension
over
a
single
passage.
ReCoRD
is
reading
comprehension
requiring
reasoning
with
external
knowledge.
RTE
is
a
small
natural
language
inference
task.
WiC
tests
whether
a
word
has
the
same
meaning
in
two
sentences.
WSC
is
the
Winograd
Schema
Challenge,
a
pronoun-resolution
task
requiring
world
knowledge.
and
reports
an
overall
composite
score
as
the
mean
of
per-task
results.
The
benchmark
provides
training,
development,
and
held-out
test
sets,
with
a
public
leaderboard
for
submissions.
robust
NLU
systems.
Critics
note
its
English-centric
scope,
potential
overfitting
to
benchmark
patterns,
and
ongoing
concerns
about
whether
improvements
translate
to
general,
real-world
language
understanding.