Home

bgzip

bgzip is a compression utility that produces BGZF files, where BGZF stands for Block GZIP Format. It is designed for large genomic data files and is widely used in the HTSlib/Samtools ecosystem. BGZF compresses data in independent blocks, typically up to about 64 kilobytes of uncompressed data per block, and stores them sequentially in a single file. Each block is a gzip-compressed unit, and the collection of blocks is arranged to function as a single compressed file while enabling selective decompression.

The key feature of BGZF is support for random access to compressed data when combined with an

Usage and compatibility are straightforward. To compress a VCF file, one typically runs bgzip file.vcf, producing

BGZF is an industry-wide standard in genomic data processing within the Samtools/HTSlib ecosystem. It provides the

index.
In
coordinate-sorted
data
such
as
VCF
files,
an
index
maps
genomic
coordinates
to
BGZF
block
offsets,
allowing
targeted
retrieval
of
a
region
without
decompressing
the
entire
file.
Tabix
is
commonly
used
to
create
such
indexes
(.tbi)
for
BGZF
files,
enabling
fast
querying
of
compressed
data.
file.vcf.gz.
Decompression
can
be
performed
with
bgzip
-d
file.vcf.gz
or
with
standard
gzip
tools
for
full-file
extraction,
though
random
access
requires
the
BGZF
index.
To
enable
region
queries,
the
corresponding
index
is
created
with
tabix
(e.g.,
tabix
-p
vcf
file.vcf.gz).
familiar
gzip
compression
while
adding
block-level
structure
and
indexing
to
support
fast,
random
access
to
large
datasets.