Home

RDD

Resilient Distributed Dataset (RDD) is a core data abstraction in Apache Spark. It represents an immutable, partitioned collection of elements that can be processed in parallel across a cluster. RDDs are fault-tolerant because they retain a lineage graph describing how each dataset was derived, allowing lost partitions to be recomputed from the original data.

RDDs support two types of operations: transformations and actions. Transformations, such as map, filter, flatMap, union,

Dependencies between RDDs are declared as narrow or wide. Narrow dependencies (e.g., map, filter) allow produced

RDDs can be created from external storage systems (HDFS, local file systems, cloud storage) or from existing

Runtimes and language support include Scala, Java, Python, and R. While Spark later introduced DataFrames and

intersection,
and
distinct,
return
a
new
RDD
and
are
evaluated
lazily.
Actions,
such
as
count,
collect,
reduce,
first,
and
take,
trigger
computation
and
return
a
value
to
the
driver
or
write
results
to
storage.
For
key-value
data,
specialized
operations
like
reduceByKey,
groupByKey,
and
combineByKey
enable
aggregation
and
joins.
partitions
to
be
computed
using
a
single
parent
partition,
enabling
pipelining.
Wide
dependencies
(e.g.,
shuffle,
join)
involve
data
movement
across
executors
and
incur
a
shuffle
cost.
RDDs
via
transformations.
They
can
be
persisted
in
memory
or
on
disk
using
various
storage
levels,
such
as
MEMORY_ONLY,
MEMORY_AND_DISK,
or
serialized
forms,
to
optimize
iterative
computations.
Datasets
with
optimizations
via
the
Catalyst
optimizer,
RDDs
remain
a
lower-level,
flexible
API
useful
for
fine-grained
control,
custom
data
types,
or
algorithms
not
easily
expressed
in
SQL-like
operations.