RDD

Resilient Distributed Dataset (RDD) is a core data abstraction in Apache Spark. It represents an immutable, partitioned collection of elements that can be processed in parallel across a cluster. RDDs are fault-tolerant because they retain a lineage graph describing how each dataset was derived, allowing lost partitions to be recomputed from the original data.

RDDs support two types of operations: transformations and actions. Transformations, such as map, filter, flatMap, union,

Dependencies between RDDs are declared as narrow or wide. Narrow dependencies (e.g., map, filter) allow produced

RDDs can be created from external storage systems (HDFS, local file systems, cloud storage) or from existing

Runtimes and language support include Scala, Java, Python, and R. While Spark later introduced DataFrames and

a

a

a

a

transformations.

MEMORY_AND_DISK,

a