Home

Tagsets

A tagset is a collection of labels used to annotate linguistic data in corpora and natural language processing. Each tag represents a category such as a part of speech, a morphological feature, a named entity class, or a semantic role, and a token may receive one or more tags depending on the scheme. Tagsets enable consistent annotation, data retrieval, and linguistic analysis across corpora and NLP systems.

Tagsets vary in granularity and language focus. Part-of-speech tagsets range from coarse to fine-grained classifications. Morphological

Notable examples include the Penn Treebank II POS tagset, widely used for English corpora, and the Brown

Design considerations for tagsets include choosing the desired level of granularity, providing precise and publicly available

tagsets
encode
features
such
as
tense,
number,
gender,
case,
and
aspect.
Named-entity
tagsets
distinguish
types
like
person,
organization,
and
location.
Some
tagsets
are
language-specific,
while
others
aim
for
cross-linguistic
compatibility.
An
example
of
the
latter
is
the
Universal
Dependencies
POS
tagset,
which
provides
a
common
set
of
tags
and
morphological
features
usable
across
languages.
corpus
tagset.
Universal
Dependencies
offers
language-agnostic
morpho-syntactic
features.
For
named
entities,
tagsets
used
in
CoNLL
and
related
shared
tasks
often
define
categories
like
PER,
LOC,
and
ORG.
Tagging
schemes
such
as
BIO
or
IOB
are
commonly
employed
to
mark
multi-token
entities
within
a
sequence.
definitions,
documenting
guidelines,
and
versioning.
Interoperability
is
enhanced
by
adopting
standard
tagsets
or
mappings
between
tagsets.
Tagsets
play
a
central
role
in
corpus
annotation,
evaluation
of
automatic
taggers,
and
cross-language
comparative
linguistics.