Subword - Infinite Lexicon - Infinite Lexicon

Subword

Subword refers to any linguistic or computational unit smaller than a word. In linguistics, subword units include morphemes (the smallest meaningful units like prefixes and suffixes), as well as syllables or other segments used in analyses of word formation. In computing and natural language processing, the term often denotes units produced by subword tokenization: a token that is not a full word but a shorter piece of a word. Subword modeling helps handle languages with rich morphology, as well as out-of-vocabulary words, by representing text as a fixed vocabulary of subword units rather than a fixed set of whole words.

Subword tokenization methods such as Byte-Pair Encoding (BPE), WordPiece, and SentencePiece learn a vocabulary of subword

For example, the word "unhappiness" might be tokenized as ["un", "happiness"] if "happiness" is in the vocabulary,

Advantages and limitations: Benefits include smaller vocabularies, better handling of neologisms and agglutinative languages, and improved

a

representations

a