Home

Tagset

A tagset is a defined collection of tags used to label items in a data annotation scheme. In linguistics and natural language processing, a tagset commonly labels words with grammatical categories, such as part of speech, and may also encode additional morphosyntactic features like tense, number, or case. In markup and data representation, the term refers to the set of tags or element names that a language, tool, or standard recognizes.

In NLP, tagsets are central to corpus annotation, tagging, parsing, and information extraction. They vary in

Tagset design also involves practical considerations such as annotation guidelines, training costs, and inter-annotator agreement. Researchers

granularity
from
coarse
schemes
that
distinguish
broad
classes
(nouns,
verbs,
adjectives)
to
fine-grained
sets
that
differentiate
subtypes
(proper
vs
common
noun,
past
vs
present
tense
verb).
Well-known
examples
include
the
Penn
Treebank
tagset,
with
about
45
POS
tags,
and
the
Universal
Dependencies
tagset,
which
is
designed
to
be
cross-linguistically
applicable
and
compact.
The
choice
of
tagset
affects
tagging
accuracy,
interoperability,
and
the
transfer
of
models
across
languages.
Design
goals
typically
balance
coverage
with
consistency
and
ease
of
annotation.
often
provide
conversion
mappings
between
tagsets
to
enable
comparisons
or
cross-system
applications.
Beyond
linguistic
tagging,
tagsets
appear
in
markup
languages,
where
a
fixed
set
of
element
names
defines
a
document’s
structure,
and
in
information
extraction
for
labeling
named
entities,
relations,
or
events.