Overstemmingin
Overstemmingin is a term used in computational linguistics to describe an excessive stemming process during text normalization, in which distinct words are reduced to the same base form more aggressively than linguistically warranted. The result is overstemmingin: semantically different terms may be conflated, leading to errors in search, indexing, and downstream NLP tasks. The phenomenon is especially noted in rule-based stemmers and in languages with rich morphology, where broad suffix-stripping rules can merge unrelated words.
Causes include aggressive suffix removal, inadequate language-specific rules, and the mismatch between derivational and inflectional morphology.
Impact can manifest as lower precision in information retrieval, since irrelevant documents are retrieved, and as
Mitigation strategies include using lemmatization or POS-aware stemming, adopting lighter or language-appropriate stemmers, and evaluating stemming
See also: stemming, lemmatization, Porter stemmer, Paice/Holmes stemming, information retrieval evaluation.