Home

lemmatiseerde

Lemmatization is the process of reducing words to their lemma, the canonical base form found in dictionaries. In Dutch, the past participle form lemmatiseerde is used to describe text that has been processed to replace words with their lemmas. A lemmatized text typically uses dictionary forms such as lopen for all inflected verb forms and the singular form for nouns.

In practice, lemmatization combines morphological analysis with lemmat dictionaries. Many systems use a pipeline that includes

Lemmatization differs from stemming. Stemming cuts words to a rough stem that may not be a valid

Applications of lemmatized text include improved information retrieval, search over inflected languages, text mining, machine translation,

Example: the Dutch words loopt, liep, gelopen all map to the lemma lopen in a lemmatized representation.

part-of-speech
tagging,
morphological
rules,
and
lexicons
to
determine
the
correct
lemma
for
each
token.
This
is
especially
important
in
Dutch,
where
verbs
have
multiple
tenses
and
forms,
nouns
have
plurality
and
possessives,
and
adjectives
agree
with
nouns.
Some
approaches
also
handle
compounds
and
clitics,
which
are
common
in
Dutch.
word
in
the
language,
while
lemmatization
aims
for
legitimate
lemmas
that
belong
to
the
language’s
vocabulary.
Lemmatization
can
be
language-specific
and
relies
on
linguistic
resources,
which
may
affect
accuracy
across
dialects
and
domains.
and
linguistic
research.
By
normalizing
word
forms,
lemmatization
can
increase
consistency
and
reduce
sparsity
in
corpora.
Challenges
include
handling
irregular
forms,
polysemy,
multiword
expressions,
and
domain-specific
vocabulary.
The
term
lemmatiseerde
often
describes
data
prepared
in
this
way.