Bytepair
Byte pair encoding (BPE) is a data compression and text processing technique that builds a vocabulary of subword units by iteratively merging the most frequent adjacent symbol pairs in a corpus. The process begins with the input text represented as a sequence of basic symbols, typically bytes or characters. In each step, the most frequent consecutive pair of symbols is replaced by a new symbol that represents that pair. The steps are repeated until a predefined vocabulary size or number of merges is reached. The resulting vocabulary consists of both the original symbols and the newly created merged symbols; text is encoded as a sequence of these tokens.
Origin and usage: The algorithm was introduced by P. Gage in 1994 as a data compression method.
Variants and notes: BPE can operate on bytes (byte-level BPE) or on characters; byte-level variants help handle
Limitations: BPE depends on the training data; the same merges may not generalize across domains or languages;