UniRef
UniRef, short for UniProt Reference Clusters, is a collection of non-redundant protein sequence clusters produced by the UniProt consortium. It combines sequences from UniProtKB and the UniParc database to reduce redundancy and speed large-scale analyses. UniRef provides three levels of clustering: UniRef100, UniRef90, and UniRef50. In each cluster, sequences that meet the corresponding identity threshold are grouped together, and a single representative sequence is assigned to the cluster. UniRef100 contains all identical sequences (100% identity) across the source data; UniRef90 groups sequences with at least 90% identity to the seed; UniRef50 uses a 50% identity threshold. Each cluster includes metadata and a mapping to member entries in UniProtKB and related resources.
The clusters are generated by hierarchical clustering algorithms applied to the protein sequences in UniProtKB and
Applications include speeding up sequence similarity searches, reducing database size for large-scale analyses, and facilitating functional
Access and formats: UniRef data are publicly available from the UniProt website and FTP site in flat