Home

IDFs

IDFs, short for inverse document frequency, is a statistic used in information retrieval to assess how informative a term is across a document collection. It helps distinguish terms that are common across many documents from those that are relatively unique to a subset of the corpus.

Calculation and interpretation: For a corpus with N documents, the document frequency df(t) is the number of

Relation to tf-idf: IDF is a key component of the tf-idf weighting scheme, where the weight of

Variants and limitations: Some approaches use adjusted formulas, such as BM25’s IDF component, e.g., log((N - df

documents
that
contain
term
t.
The
IDF
of
t
is
usually
computed
as
log(N/df(t))
or
log((N+1)/(df(t)+1))
to
smooth
extreme
values.
Terms
that
appear
in
many
documents
have
low
IDF,
while
rare
terms
have
high
IDF.
This
weighting
is
intended
to
emphasize
discriminative
terms
in
search
and
analysis.
a
term
in
a
particular
document
is
the
term
frequency
(tf)
times
the
IDF.
In
practice,
a
high
IDF
boosts
terms
that
are
likely
to
be
informative
for
a
given
document,
aiding
ranking
and
retrieval.
Variants
exist,
and
some
systems
use
natural
log
or
log
base
10
depending
on
implementation.
+
0.5)/(df
+
0.5)).
IDF
is
corpus-dependent
and
can
be
unstable
for
very
small
or
rapidly
changing
collections.
It
does
not
capture
semantic
similarity
or
context
beyond
frequency,
and
highly
frequent
but
meaningful
terms
(like
names
or
technical
terms)
may
receive
low
IDF
despite
usefulness.
Effective
use
often
involves
combining
IDF
with
other
signals
and
updating
it
as
corpora
evolve.