The Checkpoint and Resume System enables task runs to suspend execution, save their state (CPU/memory snapshots), and later resume from the saved checkpoint. This mechanism is primarily used for retry scenarios where a task can continue from a checkpoint instead of restarting from the beginning, reducing execution time and cost. The system coordinates checkpoint storage, run state transitions, concurrency management, and re-enqueuing of suspended runs.
For information about the broader run lifecycle, see 4.2. For retry logic and error handling, see 4.8.
The checkpoint system is implemented as one of several subsystems within the RunEngine architecture. It coordinates with the execution snapshot system to track run state, the enqueue system to re-queue suspended runs, and the run queue to manage concurrency.
Component Relationship Diagram
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts1-270 internal-packages/run-engine/src/engine/index.ts76-384
A checkpoint follows this lifecycle: creation during execution, suspension with state storage, and eventual resumption when the run is dequeued again.
Checkpoint Lifecycle Sequence
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts32-248 internal-packages/run-engine/src/engine/systems/checkpointSystem.ts251-313
The checkpoint system uses three primary database tables to track state.
| Table | Key Columns | Purpose |
|---|---|---|
TaskRun | id, status, lockedById | Stores run metadata and current status (WAITING_TO_RESUME when checkpointed) |
TaskRunCheckpoint | id, friendlyId, type, location, imageRef, reason | Stores checkpoint metadata and storage location |
TaskRunExecutionSnapshot | id, runId, executionStatus, checkpointId, isValid | Tracks execution state transitions with optional checkpoint reference |
Checkpoint Data Model
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts163-173 internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts1-35
The CheckpointSystem class is the primary coordinator for checkpoint operations. It is initialized by the RunEngine and works with other subsystems.
Class Structure:
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts15-30
The checkpoint metadata is passed from the worker process after it has stored the actual checkpoint data:
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts1
The checkpoint system behavior is controlled by environment variables:
| Variable | Default | Description |
|---|---|---|
CHECKPOINT_THRESHOLD_IN_MS | 30000 (30s) | Minimum execution time before checkpoint is considered for retry |
RUN_ENGINE_RETRY_WARM_START_THRESHOLD_MS | 30000 (30s) | Threshold for enabling warm start from checkpoint on retry |
Sources: apps/webapp/app/env.server.ts405 apps/webapp/app/v3/runEngine.server.ts120
When a checkpoint is created, the system must validate the current state, update the database, and handle re-enqueueing based on execution status.
createCheckpoint() Method Flow
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts36-248
The system validates snapshots to ensure checkpoints are created at the correct time:
Validation Rules:
snapshotId matches the current snapshot, OR the provided snapshotId matches the previous snapshot AND the current status is QUEUED_EXECUTINGIf validation fails, the checkpoint is discarded and an event is emitted:
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts56-114 internal-packages/run-engine/src/engine/statuses.ts
When a checkpointed run is dequeued and ready to execute again, the continueRunExecution() method coordinates the transition back to executing state.
continueRunExecution() Method Flow
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts254-313
When transitioning from SUSPENDED to EXECUTING, the system notifies the worker process via the event bus:
This allows the worker to immediately react to state changes and begin restoring from the checkpoint.
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts280
The checkpoint system introduces two new run statuses and works with several execution statuses.
State Machine for Checkpointed Runs
| Run Status | Description |
|---|---|
WAITING_TO_RESUME | TaskRun is checkpointed and waiting to be resumed. Set when checkpoint is created. |
EXECUTING | TaskRun is currently executing (may or may not have a checkpoint) |
COMPLETED_SUCCESSFULLY | TaskRun finished successfully |
SYSTEM_FAILURE | TaskRun failed |
| Execution Status | Description |
|---|---|
EXECUTING | Run is actively executing in a worker |
QUEUED_EXECUTING | Run is executing but was already queued for continuation (edge case) |
SUSPENDED | Run is suspended with checkpoint, waiting to be re-queued or resumed |
QUEUED | Run is in the queue waiting to be picked up |
DEQUEUED | Run has been dequeued and will start soon |
PENDING_EXECUTING | Run is transitioning to execution |
Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts117-247 internal-packages/run-engine/src/engine/statuses.ts
The checkpoint system integrates with the retry mechanism to enable "warm start" retries. When a task fails and is eligible for retry, the system checks if enough time has elapsed to use the checkpoint.
Retry Decision with Checkpoint
The threshold is configured via:
RUN_ENGINE_RETRY_WARM_START_THRESHOLD_MS (default: 30000ms)retryWarmStartThresholdMsWhen retryWarmStartThresholdMs is set, the RunAttemptSystem will attempt to use an existing checkpoint for retry if the execution duration exceeded the threshold.
Sources: apps/webapp/app/v3/runEngine.server.ts120 internal-packages/run-engine/src/engine/types.ts92 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts70
A critical aspect of the checkpoint system is proper concurrency management. When a run is checkpointed and suspended, its concurrency slot must be released so other runs can execute.
Concurrency Release Process:
This ensures:
When the run is resumed:
RunQueueSources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts195-202 internal-packages/run-engine/src/engine/systems/checkpointSystem.ts233-239
When createCheckpoint() receives an invalid snapshot:
snapshotId doesn't match the current or previous snapshot (when applicable), the checkpoint is discardedFINISHED, FAILED, or CANCELLED, the checkpoint is rejectedIn both cases, the system emits an incomingCheckpointDiscarded event and returns { ok: false, error: string }.
The QUEUED_EXECUTING state represents a special edge case where a run is being continued while still executing. This can happen when:
When a checkpoint is created in QUEUED_EXECUTING state:
QUEUED is createdSources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts64-88 internal-packages/run-engine/src/engine/systems/checkpointSystem.ts175-208
The checkpoint system is extensively tested to ensure correct behavior across various scenarios.
Test Coverage:
Key test scenarios:
Sources: internal-packages/run-engine/src/engine/tests/checkpoints.test.ts1-300
Refresh this wiki