MinHash

MinHash is a probabilistic data structure and algorithm for estimating the Jaccard similarity between finite sets. It is widely used in information retrieval and data mining for tasks such as near-duplicate detection and clustering, especially on large collections where exact set comparison is impractical.

The method uses a family of hash functions intended to mimic random permutations. For a set S,

MinHash sketches enable fast similarity estimation and are often combined with Locality-Sensitive Hashing (LSH) to prune

In practice, documents are represented as sets of features, typically tokens or shingles. The method supports

MinHash was introduced by Andrei Z. Broder in 1997 in the context of measuring similarity among web

=

:

x

S

}.

S

=

...,

A

A

=

=

∩

/

∪

A

B

a

·

a

k