BPEs - Infinite Lexicon - Infinite Lexicon

BPEs

BPEs, short for Byte Pair Encodings, are a family of subword tokenization algorithms used in natural language processing. They derive from a data compression technique and are employed to construct a fixed-size vocabulary of subword units by repeatedly merging the most frequent adjacent symbol pairs in a corpus. The process typically starts with a vocabulary of individual characters (or bytes in byte-level variants). The most frequent pair of neighboring symbols is merged to form a new symbol, and this continues until a target vocabulary size is reached. The resulting set of merges defines the subword units used to tokenize text.

For encoding, text is segmented by applying the learned merges to produce the longest possible subword units

BPEs have become widely used in NLP because they mitigate out-of-vocabulary issues while keeping vocabulary size

Limitations include dependence on the training corpus and domain, potential creation of overly small or opaque

language-specific

morphologically

a

hyperparameters

considerations,

a