Home

UniRef

UniRef, short for UniProt Reference Clusters, is a collection of non-redundant protein sequence clusters produced by the UniProt consortium. It combines sequences from UniProtKB and the UniParc database to reduce redundancy and speed large-scale analyses. UniRef provides three levels of clustering: UniRef100, UniRef90, and UniRef50. In each cluster, sequences that meet the corresponding identity threshold are grouped together, and a single representative sequence is assigned to the cluster. UniRef100 contains all identical sequences (100% identity) across the source data; UniRef90 groups sequences with at least 90% identity to the seed; UniRef50 uses a 50% identity threshold. Each cluster includes metadata and a mapping to member entries in UniProtKB and related resources.

The clusters are generated by hierarchical clustering algorithms applied to the protein sequences in UniProtKB and

Applications include speeding up sequence similarity searches, reducing database size for large-scale analyses, and facilitating functional

Access and formats: UniRef data are publicly available from the UniProt website and FTP site in flat

UniParc,
and
they
provide
a
compact,
non-redundant
representation
of
protein
sequence
space.
The
representative
sequence
serves
as
a
proxy
for
the
whole
cluster
in
many
analyses.
UniRef
identifiers
link
to
the
cluster’s
metadata
and
to
the
list
of
member
sequences,
enabling
traceability
to
original
UniProt
entries.
annotation
transfer
across
related
proteins.
UniRef
is
widely
used
in
bioinformatics
workflows,
proteomics
pipelines,
and
comparative
genomics
studies.
files
and
downloadable
formats.
Users
can
retrieve
representative
sequences,
access
cluster
mappings,
and
download
member
lists.
The
resource
is
updated
periodically
to
align
with
updates
to
UniProtKB
and
UniParc.