OutofVocabularyProbleme
Out of vocabulary (OOV) refers to tokens that are not present in the vocabulary used by a natural language processing system. In many NLP models, a fixed vocabulary maps words to embeddings or probability distributions. When a word or token not included in this set is encountered, it is treated as OOV, triggering fallback mechanisms designed to maintain processing.
OOVs arise for several reasons. New names and neologisms, technical jargon, multilingual input, misspellings, and morphological
The presence of OOVs can impact performance in language modeling, translation, search, and information retrieval. Common
Mitigation strategies include subword tokenization methods such as byte-pair encoding (BPE), WordPiece, and SentencePiece, which break