pretokenization - Infinite Lexicon - Infinite Lexicon

pretokenization

Pretokenization is a preprocessing step in natural language processing that converts raw text into an initial, coarse-grained representation before the main tokenization stage. The goal is to standardize input, reduce variability, and make downstream encoding more predictable for models that operate on tokens or subwords.

Pretokenization usually involves breaking text into manageable units while controlling how punctuation, whitespace, and Unicode elements

Pretokens feed into subword tokenizers such as BPE or SentencePiece. After pretokens, the chosen subword model

Trade-offs: pretokens influence vocabulary size, token granularity, and processing speed. An overly aggressive pretokens scheme can

representations

language-specific

Pretokenization