Home

checkpointing

Checkpointing is a fault-tolerance technique used to save the state of a computation or system at predefined points so that execution can be resumed from that point after a failure or interruption. It is widely used in high-performance computing, long-running scientific simulations, databases, operating systems, and embedded systems to avoid complete recomputation and to shorten recovery times following outages.

A checkpoint typically captures a consistent view of the program or process, including memory contents, processor

During recovery, the system restores from the last checkpoint and resumes execution. In distributed settings, recovery

Common techniques include synchronous (blocking) and asynchronous (background) checkpointing, multi-level checkpointing that uses faster memory-resident images

registers,
open
file
descriptors,
and
relevant
ancillary
state.
The
snapshot
is
stored
on
stable
storage,
such
as
disks
or
parallel
file
systems.
Checkpoints
can
be
full,
saving
the
entire
state,
or
incremental,
recording
only
changes
since
the
previous
checkpoint.
Some
systems
perform
coordinated
checkpoints
across
multiple
processes;
others
rely
on
local
or
log-based
recovery.
may
require
rolling
back
to
a
global
checkpoint
or
reconstructing
state
from
logs
and
checkpoints.
Checkpointing
introduces
runtime
overhead
from
creating
and
writing
snapshots
and
from
any
synchronization
cost.
The
optimal
frequency
balances
the
cost
of
checkpointing
with
the
expected
work
lost
in
a
failure.
and
slower
disk-stored
images,
and
multi-process
coordination.
Some
approaches
pair
checkpointing
with
write-ahead
logging
or
data
replay
to
enhance
recoverability.
In
databases,
checkpointing
interacts
with
transaction
logs
to
guarantee
durability
and
consistency
across
commits
and
crashes.