Home

lemmatize

Lemmatization is the process of reducing a word to its lemma, the canonical base form found in a dictionary. It uses vocabulary and morphological analysis to determine the appropriate dictionary form, and it often requires the word’s part of speech to select the correct lemma. This makes lemmatization more linguistically informed than simple stemming, which relies on heuristic affix stripping.

In practice, a lemmatizer consults a lexical resource that lists lemma forms for different inflections and

Lemmatization differs from stemming in that it aims to produce valid dictionary forms rather than arbitrary

Applications of lemmatization include information retrieval, text preprocessing for natural language processing, and any task that

may
apply
rules
for
irregular
forms.
With
the
word
and
its
POS,
the
algorithm
returns
the
lemma.
For
example,
running
as
a
verb
yields
run;
went
as
a
past
tense
verb
yields
go;
better
as
an
adjective
can
yield
good.
or
truncated
stems.
Stemmers
may
produce
nonwords
or
misleading
forms,
whereas
lemmatizers
seek
a
linguistically
correct
base
form
based
on
language
rules
and
lexicons.
benefits
from
normalizing
word
forms.
It
is
language-dependent
and
relies
on
quality
lexicons
and
POS
tagging.
In
practice,
lemmatization
improves
matching
and
analysis
by
treating
inflected
or
derived
forms
of
a
word
as
a
single
item,
aiding
downstream
tasks
such
as
indexing,
clustering,
and
semantic
understanding.