Tagsets

A tagset is a collection of labels used to annotate linguistic data in corpora and natural language processing. Each tag represents a category such as a part of speech, a morphological feature, a named entity class, or a semantic role, and a token may receive one or more tags depending on the scheme. Tagsets enable consistent annotation, data retrieval, and linguistic analysis across corpora and NLP systems.

Tagsets vary in granularity and language focus. Part-of-speech tagsets range from coarse to fine-grained classifications. Morphological

Notable examples include the Penn Treebank II POS tagset, widely used for English corpora, and the Brown

Design considerations for tagsets include choosing the desired level of granularity, providing precise and publicly available

language-specific,

cross-linguistic

a

language-agnostic

morpho-syntactic

a

Interoperability

a