Home

Stemming

Stemming is a text normalization technique used in natural language processing and information retrieval to reduce words to their base or stem form by removing affixes. The goal is to group together related word variants so they can be matched or indexed together, even when they appear in different forms. Stemmers do not attempt to produce linguistically correct lemmas, and the resulting stem may not be a valid word in the language.

Stemming algorithms can be rule-based or statistical. Early rule-based systems include the Lovins stemmer and the

Stemming versus lemmatization: stemming reduces words to a stem that may not be a dictionary form, whereas

Applications and limitations: In information retrieval, stemming can increase recall by treating related forms as equivalent,

Overall, stemming remains a foundational preprocessing step in many IR and NLP pipelines, balancing simplicity and

Porter
stemmer,
both
of
which
apply
sequences
of
affix-stripping
rules.
Snowball
is
a
newer,
language-aware
framework
that
provides
language-specific
stemmers.
Some
approaches
are
designed
to
be
conservative
(light
stemmers)
to
reduce
over-stemming,
while
others
are
more
aggressive.
lemmatization
maps
words
to
their
dictionary
lemmas
using
linguistic
analysis.
Stemming
is
typically
faster
and
simpler
to
implement,
making
it
well
suited
for
large-scale
indexing
and
search.
Lemmatization
tends
to
produce
more
accurate
and
readable
bases
but
requires
lexical
resources
and
more
computation.
but
it
may
reduce
precision
if
different
words
are
incorrectly
conflated.
Stemming
is
language-dependent
and
can
struggle
with
irregular
forms
or
morphologically
rich
languages.
In
multilingual
or
mixed-language
texts,
language
detection
is
often
needed
before
stemming.
effectiveness
for
reducing
lexical
variation.