deduplicationranking

Deduplication ranking is the process of producing an ordered list of items from a corpus while removing or suppressing duplicates or near-duplicates. The goal is to improve result quality, diversity, and user satisfaction by avoiding repetitive content and reducing presentation of redundant information. It is applied in search engines, document retrieval, recommender systems, and data curation tasks.

Two common design choices exist. In deduplication-aware ranking, duplicates are detected and filtered as part of

Techniques for detecting duplicates include fingerprinting, shingling, MinHash, and SimHash; clustering of similar items; and threshold-based

Applications include web search, enterprise search, e-discovery, and multimedia retrieval. Evaluation uses traditional ranking metrics like

See also: deduplication, diversification, near-duplicate detection.

a

a

diversification

distinctiveness