RDDs

Resilient distributed dataset (RDD) is a fundamental data abstraction in distributed data processing frameworks such as Apache Spark. An RDD represents a fault-tolerant, parallel collection of objects that are distributed across a cluster. RDDs are immutable; once created, they cannot be changed but can be transformed to produce new RDDs.

RDDs are constructed either by loading data from external storage, such as HDFS or local files, or

The system builds a directed acyclic graph (DAG) of transformations, and execution is scheduled across the cluster.

RDDs can be persisted in memory or on disk with various storage levels, enabling reuse across multiple

While RDDs provide low-level control and are essential for unstructured or fine-grained operations, many workloads can

a

saveAsTextFile.

Transformations

a

transformations

a