Home

RDDs

Resilient distributed dataset (RDD) is a fundamental data abstraction in distributed data processing frameworks such as Apache Spark. An RDD represents a fault-tolerant, parallel collection of objects that are distributed across a cluster. RDDs are immutable; once created, they cannot be changed but can be transformed to produce new RDDs.

RDDs are constructed either by loading data from external storage, such as HDFS or local files, or

The system builds a directed acyclic graph (DAG) of transformations, and execution is scheduled across the cluster.

RDDs can be persisted in memory or on disk with various storage levels, enabling reuse across multiple

While RDDs provide low-level control and are essential for unstructured or fine-grained operations, many workloads can

by
parallelizing
a
collection
in
the
driver
program.
They
can
be
transformed
by
operations
like
map,
filter,
flatMap,
and
join,
or
consumed
by
actions
such
as
collect,
count,
reduce,
saveAsTextFile.
Transformations
are
lazily
evaluated,
meaning
computation
is
deferred
until
an
action
is
invoked.
RDDs
achieve
fault
tolerance
through
lineage:
if
a
partition
is
lost,
it
can
be
recomputed
from
the
original
data
and
the
sequence
of
transformations
that
created
it.
Optional
checkpointing
can
be
used
to
truncate
long
lineage
graphs
and
improve
fault
tolerance.
actions.
They
can
be
single
objects
or
key-value
pairings
called
paired
RDDs,
which
support
operations
like
reduceByKey
and
join
more
efficiently.
be
expressed
more
succinctly
and
optimally
with
higher-level
APIs
such
as
DataFrames
and
Datasets,
which
optimize
execution
plans
and
offer
richer
optimizations.
RDDs
remain
a
foundational
abstraction
for
certain
advanced
use
cases
and
low-level
data
processing
tasks.