deduplicationranking
Deduplication ranking is the process of producing an ordered list of items from a corpus while removing or suppressing duplicates or near-duplicates. The goal is to improve result quality, diversity, and user satisfaction by avoiding repetitive content and reducing presentation of redundant information. It is applied in search engines, document retrieval, recommender systems, and data curation tasks.
Two common design choices exist. In deduplication-aware ranking, duplicates are detected and filtered as part of
Techniques for detecting duplicates include fingerprinting, shingling, MinHash, and SimHash; clustering of similar items; and threshold-based
Applications include web search, enterprise search, e-discovery, and multimedia retrieval. Evaluation uses traditional ranking metrics like
See also: deduplication, diversification, near-duplicate detection.