Home

crashresilience

Crash resilience is the ability of a system to continue operating or to recover quickly after an unexpected shutdown, crash, or power loss, while preserving data integrity and minimizing downtime. It involves maintaining crash consistency, durability, and rapid recovery across software and hardware layers.

Key concepts include crash consistency, where a system ensures that state transitions are atomic and recoverable

In practice, crash resilience spans multiple domains. Databases rely on transactional mechanisms, including WALs and recovery

Designers assess crash resilience with metrics like recovery time objective (RTO) and recovery point objective (RPO).

after
a
crash;
durability,
which
guarantees
that
committed
data
survives
a
failure;
and
recovery
procedures,
which
reconstruct
the
latest
valid
state.
Techniques
commonly
used
to
achieve
crash
resilience
include
write-ahead
logging
and
journaling
to
record
intended
changes
before
they
are
applied,
redundancy
and
replication
to
tolerate
component
failures,
and
frequent
checkpoints
or
copy-on-write
storage
to
enable
fast
restoration.
Supportive
hardware
features
such
as
non-volatile
memory
and
battery-backed
caches
can
reduce
data
loss
during
power
outages.
algorithms,
to
reconstruct
a
consistent
state
after
crashes.
File
systems
use
journaling
or
copy-on-write
semantics
to
protect
metadata
and
data.
Distributed
systems
employ
replicated
state
machines
and
consensus
protocols
(for
example,
Paxos
or
Raft)
to
continue
operating
despite
node
crashes.
Embedded
and
safety-critical
systems
use
watchdog
timers,
safe
shutdown
procedures,
and
deterministic
recovery
paths
to
maintain
operation.
Testing
approaches
include
chaos
engineering
and
disaster
recovery
drills.
Trade-offs
often
involve
additional
latency,
storage
overhead,
and
complexity
to
achieve
stronger
guarantees
of
data
integrity
and
availability.