pretokenization
Pretokenization is a preprocessing step in natural language processing that converts raw text into an initial, coarse-grained representation before the main tokenization stage. The goal is to standardize input, reduce variability, and make downstream encoding more predictable for models that operate on tokens or subwords.
Pretokenization usually involves breaking text into manageable units while controlling how punctuation, whitespace, and Unicode elements
Pretokens feed into subword tokenizers such as BPE or SentencePiece. After pretokens, the chosen subword model
Trade-offs: pretokens influence vocabulary size, token granularity, and processing speed. An overly aggressive pretokens scheme can