Home

MinHash

MinHash is a probabilistic data structure and algorithm for estimating the Jaccard similarity between finite sets. It is widely used in information retrieval and data mining for tasks such as near-duplicate detection and clustering, especially on large collections where exact set comparison is impractical.

The method uses a family of hash functions intended to mimic random permutations. For a set S,

MinHash sketches enable fast similarity estimation and are often combined with Locality-Sensitive Hashing (LSH) to prune

In practice, documents are represented as sets of features, typically tokens or shingles. The method supports

MinHash was introduced by Andrei Z. Broder in 1997 in the context of measuring similarity among web

compute
for
each
hash
function
h_i
the
value
m_i
=
min{
h_i(x)
:
x
in
S
}.
The
sketch
of
S
is
the
k-tuple
M(S)
=
(m_1,
...,
m_k).
A
key
property
is
that
for
any
two
sets
A
and
B,
the
probability
that
the
i-th
components
coincide,
i.e.,
M(A)_i
=
M(B)_i,
equals
the
Jaccard
similarity
J(A,B)
=
|A
∩
B|
/
|A
∪
B|.
Consequently,
the
similarity
between
A
and
B
is
estimated
by
the
fraction
of
coordinates
where
their
sketches
match.
the
number
of
pairwise
comparisons
in
large
datasets.
This
approach
underpins
scalable
near-duplicate
detection,
document
clustering,
plagiarism
detection,
and
search
applications
that
rely
on
identifying
similar
items
without
exhaustively
comparing
all
elements.
streaming
and
incremental
updates,
as
adding
or
removing
elements
only
affects
a
subset
of
the
hash
minima.
Computation
uses
O(k)
space
per
set
and
O(k
·
|S|)
time
to
build
a
sketch,
with
k
trade-offs
between
accuracy
and
storage.
There
are
extensions
to
weighted
sets
and
to
more
efficient
hashing
schemes.
documents.
The
term
"minwise
hashing"
is
often
used
for
the
core
technique;
variants
and
optimizations
have
since
been
developed
for
large-scale
data
processing.