This document describes the complete lifecycle of a task run execution, the state machine governing its progression, and the dual-status model that tracks both run-level and execution-level states. For information about how runs are queued and dequeued, see Queue Management. For details on the systems that orchestrate these state transitions, see Run Engine Architecture.
The run engine employs a dual-status tracking system to separate concerns between the overall run state and the granular execution state:
TaskRunStatus - Stored on the TaskRun database record, represents the overall lifecycle state of the run. This is the primary status visible to users and used for filtering and reporting.
TaskRunExecutionStatus - Stored on TaskRunExecutionSnapshot records, represents the detailed execution state at specific points in time. Each snapshot captures a moment in the run's execution, allowing the system to track progress, recover from failures, and coordinate distributed operations.
This separation allows the engine to maintain a simple, user-facing status (TaskRun.status) while internally tracking more granular execution states through immutable snapshots. Multiple snapshots can exist for a single run, forming a complete audit trail of execution progress.
Sources: internal-packages/database/prisma/schema.prisma internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts1-450
The TaskRunStatus enum defines the high-level states a run can be in:
| Status | Description | Terminal | Next States |
|---|---|---|---|
PENDING | Run is created and waiting to be queued | No | QUEUED, EXPIRED, PENDING_VERSION |
DELAYED | Run is waiting for delayUntil time | No | PENDING, CANCELED |
PENDING_VERSION | Waiting for a worker deployment to become available | No | QUEUED, SYSTEM_FAILURE |
QUEUED | Enqueued and waiting to be dequeued by a worker | No | DEQUEUED, EXPIRED |
DEQUEUED | Removed from queue, preparing to execute | No | EXECUTING |
EXECUTING | Currently executing | No | WAITING_TO_RESUME, COMPLETED_SUCCESSFULLY, COMPLETED_WITH_ERRORS, CRASHED, SYSTEM_FAILURE, CANCELED |
WAITING_TO_RESUME | Checkpointed and waiting to resume | No | EXECUTING, CANCELED |
COMPLETED_SUCCESSFULLY | Completed without errors | Yes | None |
COMPLETED_WITH_ERRORS | Completed but with errors after all retries | Yes | None |
SYSTEM_FAILURE | Failed due to system/infrastructure issues | Yes | None |
CRASHED | Worker crashed (OOM, segfault, etc.) | Yes | None |
EXPIRED | TTL exceeded before execution | Yes | None |
TIMED_OUT | Execution exceeded maxDuration | Yes | None |
CANCELED | Explicitly canceled | Yes | None |
INTERRUPTED | Interrupted (legacy, no longer used) | Yes | None |
Sources: internal-packages/run-engine/src/engine/statuses.ts44-61 internal-packages/database/prisma/schema.prisma
The TaskRunExecutionStatus enum defines the detailed execution states tracked in snapshots:
| Status | Description | Dequeuable | Checkpointable | Holds Concurrency |
|---|---|---|---|---|
RUN_CREATED | Initial snapshot created | No | Yes | No |
QUEUED | Enqueued in run queue | Yes | Yes | Yes |
QUEUED_EXECUTING | Queued while still executing (rare) | Yes | Yes | Yes |
PENDING_EXECUTING | Dequeued, waiting for attempt to start | No | No | Yes |
EXECUTING | Actively executing code | No | Yes | Yes |
EXECUTING_WITH_WAITPOINTS | Executing but blocked on waitpoints | No | Yes | No (released) |
SUSPENDED | Checkpointed and suspended | No | No | No (released) |
FINISHED | Execution complete | No | No | No |
PENDING_CANCEL | Cancellation requested | No | No | Yes |
Sources: internal-packages/run-engine/src/engine/statuses.ts1-62 internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts140-156
The following diagram shows all possible state transitions for a task run, including both TaskRunStatus (shown in bold) and TaskRunExecutionStatus (shown in context):
Sources: internal-packages/run-engine/src/engine/index.ts336-579 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts294-822 internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-661
When a run is triggered via RunEngine.trigger(), it begins in one of two initial states:
| Initial State | Condition | Execution Status |
|---|---|---|
DELAYED | delayUntil parameter provided | RUN_CREATED |
PENDING | No delay, or after delay expires | RUN_CREATED → QUEUED |
The trigger operation creates:
TaskRun record with the initial statusTaskRunExecutionSnapshot with executionStatus: RUN_CREATEDWaitpoint (type RUN) for parent trackingenqueueDelayedRun job via DelayedRunSystemEnqueueSystem.enqueueRun()Sources: internal-packages/run-engine/src/engine/index.ts339-579 internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts100-166
The EnqueueSystem transitions runs from initial states to the queue:
Enqueuing performs:
executionStatus: QUEUED(queueTimestamp ?? createdAt) - priorityMsRunQueue (Redis-backed)expireRun jobSources: internal-packages/run-engine/src/engine/systems/enqueueSystem.ts25-104 internal-packages/run-queue/src/index.ts
The DequeueSystem removes runs from the queue and prepares them for execution:
Key validations during dequeue:
QUEUED or QUEUED_EXECUTING stateSources: internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-661 internal-packages/run-engine/src/engine/systems/dequeueSystem.ts281-489
The RunAttemptSystem manages the execution lifecycle:
Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts294-628 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts630-822
Waitpoints allow runs to pause execution while waiting for external events. The WaitpointSystem manages these states:
When a run enters EXECUTING_WITH_WAITPOINTS:
continueRunIfUnblocked() is triggeredSources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts368-496 internal-packages/run-engine/src/engine/systems/waitpointSystem.ts499-737
The CheckpointSystem allows runs to suspend execution and free all resources:
Checkpoint states:
WAITING_TO_RESUME with SUSPENDED snapshot: Checkpoint exists, run is fully suspendedSources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts36-249 internal-packages/run-engine/src/engine/systems/checkpointSystem.ts254-365
Runs enter terminal states through RunAttemptSystem.completeRunAttempt():
Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts630-667 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts669-822 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts824-1145
TaskRunExecutionSnapshot records provide an immutable audit trail of run state changes:
Key properties:
previousSnapshotIdgetLatestExecutionSnapshot() finds the most recent valid snapshotgetExecutionSnapshotsSince() retrieves snapshot historySnapshot creation flow:
ExecutionSnapshotSystem.createExecutionSnapshot() is called by any system changing stateexecutionStatus and runStatusexecutionSnapshotCreated event is emitted for realtime updatesSources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts226-337 internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts95-113
Different states consume or release concurrency tokens:
| State | Environment Concurrency | Queue Concurrency | Notes |
|---|---|---|---|
RUN_CREATED | No | No | Not yet queued |
QUEUED | Yes | Yes | Holds both tokens |
PENDING_EXECUTING | Yes | Yes | Still holding while starting |
EXECUTING | Yes | Yes | Active execution |
EXECUTING_WITH_WAITPOINTS | No | No | Released during wait |
SUSPENDED | No | No | Released after checkpoint |
QUEUED_EXECUTING | Yes | Yes | Rare state during re-queue |
FINISHED | No | No | Execution complete |
This design allows the system to maximize throughput:
EXECUTING_WITH_WAITPOINTS) don't block other runsSUSPENDED) completely free resources for other workThe concurrency release happens via:
RunQueue.releaseAllConcurrency(orgId, runId) for both levelsCheckpointSystem and WaitpointSystemRunQueue.acknowledgeMessage() on completionSources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts195-203 internal-packages/run-engine/src/engine/systems/waitpointSystem.ts431-440 internal-packages/run-queue/src/index.ts
The heartbeat system detects stalled runs and initiates recovery:
Heartbeat intervals by status (from HeartbeatTimeouts):
| Status | Default Timeout | Behavior on Timeout |
|---|---|---|
PENDING_EXECUTING | 60,000ms | Requeue (assume startup failure) |
PENDING_CANCEL | 60,000ms | Force cancel |
EXECUTING | 60,000ms | Mark as crashed (OOM/segfault) |
EXECUTING_WITH_WAITPOINTS | 60,000ms | Usually shouldn't timeout |
SUSPENDED | 600,000ms | Retry heartbeat with exponential backoff |
The heartbeat mechanism:
ExecutionSnapshotSystem.createExecutionSnapshot() schedules initial heartbeatExecutionSnapshotSystem.heartbeatRun() reschedules on successful heartbeatheartbeatRun() periodically during executionhandleStalledSnapshot() is triggeredSources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts308-319 internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts339-394 internal-packages/run-engine/src/engine/types.ts89-95 internal-packages/run-engine/src/engine/index.ts199-204
The following table maps each state transition to the responsible code entity:
| Transition | Triggered By | Primary System | Method |
|---|---|---|---|
Initial → DELAYED | User trigger | RunEngine | trigger() |
Initial → PENDING | User trigger | RunEngine | trigger() |
DELAYED → PENDING | Scheduled job | DelayedRunSystem | enqueueDelayedRun() |
PENDING → QUEUED | System | EnqueueSystem | enqueueRun() |
QUEUED → PENDING_EXECUTING | Worker pull | DequeueSystem | dequeueFromWorkerQueue() |
PENDING_EXECUTING → EXECUTING | Worker | RunAttemptSystem | startRunAttempt() |
EXECUTING → EXECUTING_WITH_WAITPOINTS | Task code | WaitpointSystem | blockRunWithWaitpoint() |
EXECUTING_WITH_WAITPOINTS → EXECUTING | Waitpoint completion | WaitpointSystem | continueRunIfUnblocked() |
EXECUTING → SUSPENDED | Task code | CheckpointSystem | createCheckpoint() |
SUSPENDED → QUEUED | Unblock/resume | EnqueueSystem | enqueueRun() |
PENDING_EXECUTING → EXECUTING (resume) | Worker | CheckpointSystem | continueRunExecution() |
EXECUTING → COMPLETED_SUCCESSFULLY | Task completion | RunAttemptSystem | attemptSucceeded() |
EXECUTING → COMPLETED_WITH_ERRORS | Task error | RunAttemptSystem | attemptFailed() |
EXECUTING → EXECUTING (retry) | Task error | RunAttemptSystem | attemptFailed() + startRunAttempt() |
EXECUTING → CRASHED | Heartbeat timeout | RunAttemptSystem | handleStalledSnapshot() |
EXECUTING → PENDING_CANCEL | User cancellation | RunAttemptSystem | cancelRun() |
PENDING_CANCEL → CANCELED | Worker ACK | RunAttemptSystem | cancelRun() (finalize) |
PENDING → EXPIRED | TTL timeout | TtlSystem | expireRun() |
PENDING → PENDING_VERSION | No deployment | DequeueSystem | #pendingVersion() |
All state transitions occur within a distributed lock acquired via RunLocker to ensure consistency across the cluster.
Sources: internal-packages/run-engine/src/engine/index.ts337-579 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts internal-packages/run-engine/src/engine/systems/dequeueSystem.ts internal-packages/run-engine/src/engine/systems/waitpointSystem.ts internal-packages/run-engine/src/engine/systems/checkpointSystem.ts internal-packages/run-engine/src/engine/systems/ttlSystem.ts internal-packages/run-engine/src/engine/locking.ts
Refresh this wiki