Home

querylikelihood

The query likelihood model is a probabilistic information retrieval approach that ranks documents by the likelihood that a document’s language model would generate the user’s query. In this framework, every document D is associated with a language model P(.|D) over terms, and the query Q is treated as data drawn from that model. Documents are ranked by P(Q|D), often computed as the product of term probabilities across the query: P(Q|D) = ∏_{w∈Q} P(w|D) or, equivalently, by summing the log probabilities of query terms.

Because a document’s language model is typically sparse, smoothing is used to combine the document-specific distribution

Practical use involves computing P(w|D) for each term, assembling P(Q|D), and ranking documents by this likelihood

with
a
background
collection
model
P(w|C).
Common
smoothing
methods
include
Dirichlet
prior
and
Jelinek–Mercer
smoothing.
In
Dirichlet
smoothing,
P(w|D)
=
(tf(w,D)
+
μ
P(w|C))
/
(|D|
+
μ),
where
tf(w,D)
is
the
term
frequency
in
the
document,
|D|
is
document
length,
μ
is
a
parameter,
and
P(w|C)
is
the
term’s
probability
in
the
collection.
Jelinek–Mercer
smoothing
blends
the
document
and
collection
models
as
P(w|D)
=
(1
−
λ)
P(w|D)
+
λ
P(w|C).
The
collection
model
is
typically
estimated
from
the
entire
corpus.
(often
using
log
probabilities
for
numerical
stability).
Strengths
include
good
performance
on
short
queries
and
robustness
to
unseen
terms
due
to
smoothing.
The
approach
was
introduced
as
a
language-modeling
method
to
information
retrieval
by
Ponte
and
Croft,
and
has
become
a
foundational
technique
in
IR
research
and
applications.