pretok
Pretok is a pre-tokenization stage used in natural language processing to transform raw text into a form that downstream tokenizers can handle consistently. It encompasses normalization, segmentation, and encoding considerations that precede the main tokenization step.
The purpose of pretok is to standardize input so that tokenization results are more predictable across languages,
Common operations in pretok include Unicode normalization, case folding, whitespace normalization, and the definition of token
Pretok is typically implemented as a separate component in NLP pipelines and sits between raw text and
Evaluation and criticisms of pretok focus on trade-offs between consistency and information preservation. While pretok can
See also: tokenization, text normalization, Unicode normalization, pre-processing.