Home

UDcorpora

UDcorpora is the collection of annotated corpora produced under the Universal Dependencies (UD) project. It provides multilingual corpora annotated with universal part-of-speech tags, lemmas, and dependency relations, all following the UD annotation guidelines. The aim is to enable cross-linguistic comparison and support research in natural language processing, corpus linguistics, and language typology.

Content and format: UDcorpora encompasses a wide range of treebanks across many languages and dialects, including

Access and licensing: UDcorpora is released under an open license and freely accessible through the UD website

Usage and significance: The corpora are used to train and evaluate dependency parsers, perform cross-linguistic analyses,

Relation to UD resources: UDcorpora is integral to the UD project, designed to be compatible with UD

English,
Spanish,
Chinese,
Russian,
Arabic,
and
others.
The
data
are
distributed
in
the
CoNLL-U
format,
with
metadata
detailing
language,
dataset
source,
version,
and
annotation
guidelines.
Updates
add
new
languages,
reforms,
and
corrections
to
maintain
consistency
and
coverage.
and
related
repositories.
The
data
are
maintained
by
the
UD
Consortium
and
contributing
institutions,
with
versioned
releases
to
ensure
reproducibility
and
traceability
of
experiments.
and
benchmark
parsing
systems.
UDcorpora
serves
as
a
standard
resource
for
comparing
parsing
performance
across
languages
under
a
common
annotation
scheme,
supporting
both
research
and
teaching
activities.
guidelines
and
linked
to
other
UD
resources
such
as
UD
Core
and
language-specific
treebanks.
See
also
Universal
Dependencies,
CoNLL-U,
and
language
treebanks.