outofvocabularyongelmia
Out-of-vocabulary (OOV) words are terms that do not appear in the vocabulary used by a language model or NLP system during processing. In practice, OOV words cannot be directly mapped to a trained embedding or token index, which can hinder tasks such as language modeling, machine translation, information retrieval, and speech recognition.
Causes include neologisms and loanwords, proper nouns and brand names, technical terms, spelling variants or errors,
Impact: OOV can lead to degraded predictions, misinterpretations, or loss of information when a system substitutes
Mitigation strategies: Subword tokenization (for example Byte-Pair Encoding or SentencePiece) breaks words into smaller units, allowing
Evaluation: OOV rate is the proportion of tokens not in the training vocabulary; performance can be measured
See also: tokenization, vocabulary, natural language processing, embedding, subword.