BPEs
BPEs, short for Byte Pair Encodings, are a family of subword tokenization algorithms used in natural language processing. They derive from a data compression technique and are employed to construct a fixed-size vocabulary of subword units by repeatedly merging the most frequent adjacent symbol pairs in a corpus. The process typically starts with a vocabulary of individual characters (or bytes in byte-level variants). The most frequent pair of neighboring symbols is merged to form a new symbol, and this continues until a target vocabulary size is reached. The resulting set of merges defines the subword units used to tokenize text.
For encoding, text is segmented by applying the learned merges to produce the longest possible subword units
BPEs have become widely used in NLP because they mitigate out-of-vocabulary issues while keeping vocabulary size
Limitations include dependence on the training corpus and domain, potential creation of overly small or opaque