Home

treebank

A treebank is a linguistics resource—a corpus of text that has been annotated with syntactic structure. In most treebanks, sentences are annotated with either constituency parse trees or dependency graphs, often accompanied by part-of-speech tags, named entities, morphological features, and sometimes semantic roles. Treebanks are used to study syntax and to train and evaluate parsing algorithms.

The best known example is the Penn Treebank (PTB), created at the University of Pennsylvania and the

Treebanks differ in their annotation schemes, domains, sizes, and licensing. PTB uses bracketed constituency representations; PDT

Treebanks are created through manual annotation guided by formal guidelines, often aided by automatic pre-annotation followed

Linguistic
Data
Consortium
in
the
1990s.
It
annotated
the
Wall
Street
Journal
portion
of
the
WSJ
corpus
with
constituency
trees
and
became
a
standard
benchmark
for
statistical
parsers.
Other
well-known
treebanks
include
the
Chinese
Treebank,
the
Prague
Dependency
Treebank,
and
the
multilingual
Universal
Dependencies
collection,
which
provides
cross-linguistic
dependency
annotations.
uses
rich
dependency
annotations
with
functional
relations;
UD
applies
a
single
universal
dependency
scheme
across
languages.
Treebank
projects
typically
publish
annotation
guidelines,
inter-annotator
agreement
measures,
and
revision
histories
to
ensure
consistency
and
reproducibility.
by
human
correction.
They
are
used
to
train
and
evaluate
parsers,
study
syntactic
phenomena,
perform
cross-linguistic
comparison,
and
support
downstream
NLP
tasks
such
as
information
extraction
and
machine
translation.
Availability
ranges
from
open-access
corpora
to
licensed
datasets.