Home

rareword

Rareword is a concept used in linguistics and corpus linguistics to describe lexical items that occur with very low frequency in a given language sample. It is a descriptive label rather than a formal category, applied in studies of vocabulary size, lexical diversity, and language processing. Rareword is related to, but broader than, hapax legomena, which are words that appear only once.

Measurement of rarity relies on frequency data from corpora. Common approaches define rareword by absolute counts

Identification typically requires preprocessing steps such as tokenization, normalization, and lemmatization. Researchers use large, representative corpora

Applications include natural language processing and information retrieval, where rarewords can challenge language models, spelling, and

Examples of rarewords depend on the corpus. In general English corpora, long specialised terms like floccinaucinihilipilification

(for
example,
fewer
than
a
few
occurrences
per
million
words)
or
by
relative
position
in
a
frequency
distribution
(such
as
the
bottom
5
or
10
percent).
Thresholds
vary
by
language,
corpus
size,
and
research
goals.
and
apply
smoothing
or
probabilistic
models
to
estimate
true
rarity,
accounting
for
sampling
bias
and
domain
effects.
search.
Techniques
to
handle
rarewords
include
subword
models,
character-level
representations,
leveraging
external
dictionaries,
or
expanding
training
data.
In
linguistic
research,
studying
rareword
patterns
informs
theories
of
lexical
productivity,
borrowing,
neologisms,
and
diachronic
change.
or
pneumonoultramicroscopicsilicovolcanoconiosis
are
often
cited
as
rare.
Yet
rarity
is
corpus-specific;
a
word
may
be
common
in
a
specialized
domain
but
rare
in
general
language
data.