BPEbased

BPE-based systems rely on byte-pair encoding to represent text as sequences of subword units. The approach arose from the need to balance vocabulary size with the ability to represent rare or unknown words in neural models. Byte-pair encoding was adapted for NLP in 2016 for neural machine translation and has since become a standard part of many tokenizers. In BPE-based tokenization, the model uses a fixed vocabulary of subword units learned from data, enabling the composition of words from these units.

How it works: Start with characters as symbols; repeatedly merge the most frequent adjacent symbol pairs in

Advantages and limitations: BPE-based tokenization reduces out-of-vocabulary problems and supports efficient modeling of morphologically rich text

Applications and related concepts: BPE-based tokenizers are widely used in modern large language models and machine

a

a

hyperparameters.

WordPiece-based

implementations

a

pretokenization