Towards Optimal Multi-Level Checkpointing

6 years ago

ID: #78004

Towards Optimal Multi-Level Checkpointing

Business Description

The idea of multi-level checkpointing is that checkpoints are taken for each level of faults, but at different periods. Intuitively, the less frequent the faults, the longer the checkpointing period: this is because the risk of a failure striking is lower when going to higher levels; hence the expected reexecution time is lower too; one can safely checkpoint less frequently, thereby reducing failure-free overhead (checkpointing is useless in the absence of fault). There are several natural approaches to implement multi-level checkpointing. The first option is to use independent checkpointing periods for each level. This option raises several difficulties, the most prominent one being overlapping checkpoints. Typically, we need to checkpoint different levels in sequence (e.g., writing into memory before writing onto disk), so we would need to delay some checkpoints, which might not be possible in some environments, and which would introduce irregular periods. The second option is to synchronize all checkpoint levels by nesting them inside a periodic pattern that repeats over time, as illustrated in Fig. 1a. In this figure, the pattern has five computational segments, each followed by a level-1 checkpoint. A segment is a chunk of work between two checkpoints, and a pattern consists in segments and checkpoints. The second and fifth level-1 checkpoints are followed by a level-2 checkpoint. Finally, the pattern ends with a level-3 checkpoint. When using patterns, a checkpoint at level ‘ is always preceded by checkpoints at all lower levels 1 to ‘ 1, which makes good sense in practice (e.g., with two levels, main memory and disk, one writes the data into memory before transferring it to disk). for more details :candles online