Home

Hudi

Apache Hudi is an open-source data management framework for data lakes, designed to provide transactional capabilities on large analytical datasets stored in distributed storage such as HDFS or cloud object stores (for example, S3, GCS, or ABFS). It originated at Uber and is now an Apache Software Foundation top-level project. Hudi supports upserts, deletes, and incremental processing, enabling near-real-time data freshness while maintaining consistency across reads and writes.

A core distinction in Hudi is its two storage types: Copy-on-Write (COW) and Merge-on-Read (MOR). COW rewrites

Key features include record-level upserts and deletes by primary key, incremental views to process only new

Use cases typically involve data ingestion pipelines for data lakes and lakehouse architectures, where users need

entire
data
files
to
reflect
updates,
offering
fast
reads,
while
MOR
stores
updates
as
log-structured
changes
that
are
merged
during
reads
or
compaction,
trading
some
read
speed
for
potentially
faster
write
throughput.
Hudi
writes
data
in
Parquet
(with
optional
ORC)
and
maintains
a
timeline
of
commits
and
metadata
under
the
.hoodie
directory
to
provide
ACID-like
guarantees
and
facilitate
incremental
queries.
or
modified
data,
and
a
commit
timeline
that
enables
snapshot
isolation.
Hudi
integrates
closely
with
Apache
Spark,
providing
integration
points
for
Spark
Structured
Streaming,
Spark
SQL,
and
batch
jobs.
It
also
offers
connectors
and
tooling
for
querying
through
engines
such
as
Hive
and
Presto/Trino.
reliable
upserts,
deletes,
and
efficient
incremental
consumption
on
large-scale
datasets
while
leveraging
existing
Hadoop
and
cloud
storage
ecosystems.
Hudi
is
licensed
under
the
Apache
License
2.0.