WordPieces - Infinite Lexicon - Infinite Lexicon

WordPieces

WordPieces are a subword tokenization method used in natural language processing. They represent a compromise between word-level and character-level tokenization. Instead of treating each word as a single unit, WordPieces break down rare or unknown words into smaller, meaningful subword units. This approach allows models to handle a larger vocabulary and better understand out-of-vocabulary words by composing them from known subwords.

The process of creating WordPieces typically involves a greedy algorithm. It starts with a base vocabulary

When a new piece of text is tokenized using a pre-trained WordPiece model, it is first split

WordPiece tokenization is widely used in transformer-based language models such as BERT and its successors. Its

a

a

a

a

compositionally