checkpointing
Checkpointing is a fault-tolerance technique used to save the state of a computation or system at predefined points so that execution can be resumed from that point after a failure or interruption. It is widely used in high-performance computing, long-running scientific simulations, databases, operating systems, and embedded systems to avoid complete recomputation and to shorten recovery times following outages.
A checkpoint typically captures a consistent view of the program or process, including memory contents, processor
During recovery, the system restores from the last checkpoint and resumes execution. In distributed settings, recovery
Common techniques include synchronous (blocking) and asynchronous (background) checkpointing, multi-level checkpointing that uses faster memory-resident images