Home

wordfrequency

Wordfrequency refers to how often individual words occur in a body of text or across a collection of texts (a corpus). It is a basic statistic used in linguistics, corpus linguistics, and natural language processing to characterize language use, identify common vocabulary, and support modeling tasks such as search and classification.

Measurement types: absolute frequency is the raw count f(w) of a word in a corpus; relative frequency

Word frequency distributions tend to follow Zipf's law: a small set of words account for a large

Applications: building lexicons and stop word lists; informing language models; improving information retrieval, text classification, and

Data sources and preprocessing: frequency data come from corpora such as general-purpose corpora (e.g., COCA), web

Tools: common software libraries include Python-based NLTK, spaCy, gensim, and scikit-learn; R packages such as quanteda

p(w)
is
f(w)
divided
by
the
total
number
of
tokens
N.
In
document-level
analysis,
term
frequency
(TF)
measures
f(w)
within
a
single
document;
document
frequency
(DF)
counts
in
how
many
documents
a
word
appears.
TF-IDF
combines
these
to
weigh
a
word
by
its
ubiquity
in
the
corpus
and
rarity
across
documents.
share
of
tokens,
while
most
words
occur
rarely.
This
distribution
is
language-dependent
and
domain-dependent.
topic
modeling;
selecting
features
for
machine
learning;
benchmarking
corpora
against
frequency
profiles.
corpora,
or
specialized
corpora.
Preprocessing
steps
include
tokenization,
lowercasing,
stemming
or
lemmatization,
handling
punctuation,
and
deciding
on
single-
versus
multiword
expressions.
and
tm.
Frequency
data
are
available
as
precomputed
lists
or
can
be
generated
from
a
corpus,
and
projects
may
use
Google
Ngram
data
for
cross-language
comparisons.