Hudi

Apache Hudi is an open-source data management framework for data lakes, designed to provide transactional capabilities on large analytical datasets stored in distributed storage such as HDFS or cloud object stores (for example, S3, GCS, or ABFS). It originated at Uber and is now an Apache Software Foundation top-level project. Hudi supports upserts, deletes, and incremental processing, enabling near-real-time data freshness while maintaining consistency across reads and writes.

A core distinction in Hudi is its two storage types: Copy-on-Write (COW) and Merge-on-Read (MOR). COW rewrites

Key features include record-level upserts and deletes by primary key, incremental views to process only new

Use cases typically involve data ingestion pipelines for data lakes and lakehouse architectures, where users need

a

a