Task Execution Engine

Relevant source files

Purpose and Scope

The Task Execution Engine is the core orchestration system responsible for managing the complete lifecycle of task runs from trigger to completion. It coordinates distributed execution across worker nodes, manages state transitions, enforces concurrency limits, handles retries, and provides specialized features like checkpoints (for resuming long-running tasks) and waitpoints (for blocking until external conditions are met).

This document covers the architecture and core systems of the engine. For information about:

Task definition and SDK triggering APIs, see Trigger.dev SDK
High-level task triggering service implementations, see Task Triggering Services
Database schema for runs and execution snapshots, see Task and Run Models
Deployment and worker infrastructure, see Development and Deployment

Architecture Overview

The Task Execution Engine is implemented by the RunEngine class, which composes 11 specialized subsystems to handle different aspects of task execution.

RunEngine Class Structure

Sources: internal-packages/run-engine/src/engine/index.ts76-384

System Initialization and Configuration

The RunEngine constructor accepts a comprehensive configuration object defining all subsystem parameters:

Configuration Category	Key Options	Purpose
`prisma`	`PrismaClient`, optional `readOnlyPrisma`	Database persistence and read replicas
`worker`	`redis`, `workers`, `tasksPerWorker`, `pollIntervalMs`	Background job processing configuration
`queue`	`redis`, `defaultEnvConcurrency`, `shardCount`	Fair queue selection and hierarchical queue management
`runLock`	`redis`, `duration`, `automaticExtensionThreshold`, `retryConfig`	Distributed locking via Redlock algorithm
`machines`	`defaultMachine`, `machines`, `baseCostInCents`	Machine preset definitions for different execution environments
`heartbeatTimeoutsMs`	`PENDING_EXECUTING`, `EXECUTING`, `SUSPENDED`, etc.	Per-state timeouts for stall detection
`tracer`, `meter`	OpenTelemetry instances	Distributed tracing and metrics instrumentation

Sources: internal-packages/run-engine/src/engine/types.ts23-107 internal-packages/run-engine/src/engine/index.ts106-384

Task Run Execution Flow

Complete Execution Pipeline

The following diagram illustrates the flow from triggering a task through to completion:

Sources: internal-packages/run-engine/src/engine/index.ts389-727 internal-packages/run-engine/src/engine/systems/enqueueSystem.ts25-102 internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-770 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts294-628

Execution State Machine

Task runs progress through execution states tracked by TaskRunExecutionSnapshot records. The executionStatus field tracks fine-grained internal state:

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts226-387 internal-packages/run-engine/src/engine/statuses.ts

Core Execution Systems

ExecutionSnapshotSystem

The ExecutionSnapshotSystem maintains an immutable audit trail of all run state transitions by creating TaskRunExecutionSnapshot records.

Key Responsibilities:

Create new snapshots for every state transition
Link snapshots via previousSnapshotId to form an immutable chain
Validate that only the latest snapshot is used for state mutations
Track completed waitpoints and checkpoints at each state

Snapshot Data Model:

The getLatestExecutionSnapshot() helper retrieves the most recent valid snapshot for a run.

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts226-387 internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts95-113

EnqueueSystem and DequeueSystem

EnqueueSystem

The EnqueueSystem handles placing runs into the queue hierarchy:

Creates a QUEUED execution snapshot
Calls runQueue.enqueueMessage(orgId, runId, message) to add to queues
Schedules TTL expiration job if configured
Emits runQueued event via EventBus

Sources: internal-packages/run-engine/src/engine/systems/enqueueSystem.ts25-102

DequeueSystem

The DequeueSystem retrieves runs for execution using fair selection:

Calls runQueue.dequeueMessageFromWorkerQueue() to get a fairly-selected run
Acquires distributed lock via runLocker.lock()
Validates run is in dequeueable state (QUEUED, RUN_CREATED)
Locks run to specific BackgroundWorkerTask and version:
- Sets lockedById to task ID
- Sets lockedToVersionId to worker version ID
- Sets taskVersion, sdkVersion, cliVersion from worker metadata
Updates TaskRun to DEQUEUED status
Creates PENDING_EXECUTING snapshot
Returns DequeuedMessage with execution context

The dequeue operation includes comprehensive validation to ensure the run can execute, checking for:

Valid worker deployment with image reference
Matching background worker if filtering by worker ID
Task exists in latest deployment
Queue configuration exists

Sources: internal-packages/run-engine/src/engine/systems/dequeueSystem.ts88-770 internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-578

RunAttemptSystem

The RunAttemptSystem orchestrates the lifecycle of execution attempts and retry logic.

startRunAttempt()

Creates a new attempt and transitions to executing state:

Acquires run lock via runLocker.lock()
Validates snapshot ID matches latest
Increments attemptNumber on TaskRun (starts at 1)
Checks attemptNumber against MAX_TASK_RUN_ATTEMPTS constant
Updates run status to EXECUTING
Creates EXECUTING snapshot
Resolves complete execution context:
- Task metadata from cache (BackgroundWorkerTask)
- Queue configuration from cache (TaskQueue)
- Organization details from cache
- Project details from cache
- Machine preset based on run configuration
- Deployment information if applicable
Returns TaskRunExecution object with all context

The system maintains in-memory caches using @internal/cache with Redis backing to avoid database queries on each attempt start.

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts294-628 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts106-171

completeRunAttempt()

Handles attempt completion with success or failure:

Routes to attemptSucceeded() or attemptFailed() based on completion.ok
Acquires run lock
Validates snapshot ID
Updates run with completion result

For successful attempts (attemptSucceeded()):

Sets status to COMPLETED_SUCCESSFULLY
Stores output in output and outputType fields
Creates FINISHED snapshot
Completes associated waitpoint if run was triggered via triggerAndWait()
Notifies parent run if resumeParentOnCompletion is set
Emits runCompleted event

For failed attempts (attemptFailed()):

Calls retryOutcomeFromCompletion() to determine if retryable
If retryable and under maxAttempts:
- Calculates exponential backoff delay
- Re-enqueues run with delay
- Creates QUEUED snapshot
If not retryable or max attempts reached:
- Sets status to COMPLETED_WITH_ERRORS
- Creates FINISHED snapshot with error
- Handles parent run resumption with error

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts630-667 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts669-728 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts730-1019

Specialized Feature Systems

CheckpointSystem

The CheckpointSystem enables suspending long-running tasks to free resources while preserving execution state.

createCheckpoint()

Creates a checkpoint and suspends execution:

Acquires run lock via runLocker.lock("createCheckpoint")
Validates snapshot is latest or previous with QUEUED_EXECUTING state
Validates run is in checkpointable state:
- EXECUTING
- QUEUED_EXECUTING
Creates TaskRunCheckpoint record with:
- type: "DOCKER" or other checkpoint mechanism
- location: External storage location (e.g., S3 URL)
- imageRef: Container image reference for restoration
- reason: Optional explanation
Updates run status to WAITING_TO_RESUME
Creates SUSPENDED snapshot linking to checkpoint
Releases all concurrency held by run via runQueue.releaseAllConcurrency()
Returns checkpoint information and execution result

Special case for QUEUED_EXECUTING: If run is executing but also queued (rare race condition), the checkpoint causes the run to be re-enqueued rather than suspended.

Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts36-249

continueRunExecution()

Resumes execution from a checkpoint:

Acquires run lock
Validates run is in SUSPENDED or QUEUED state
Creates new snapshot to continue from checkpoint
Re-enqueues run for execution via enqueueSystem.enqueueRun()
Returns execution result with continued state

The dequeue operation will detect the checkpoint reference and pass it to the worker for restoration.

Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts254-319

WaitpointSystem

The WaitpointSystem enables blocking task execution until external conditions are satisfied.

Waitpoint Types:

Type	Completion Trigger	Use Case
`DATETIME`	Specified timestamp reached	`wait.for(duration)` delays
`MANUAL`	Explicit API call to complete	Human-in-the-loop workflows, external callbacks
`BATCH`	All runs in batch finish	Parent waiting for all batch children

blockRunWithWaitpoint()

Blocks a run on one or more waitpoints:

Acquires run lock via runLocker.lock("blockRunWithWaitpoint")
Gets latest execution snapshot
Uses raw SQL to atomically:
- Create TaskRunWaitpoint junction records
- Create _WaitpointRunConnections many-to-many records
- Count pending (incomplete) waitpoints blocking the run
Determines new execution state:
- If executing and now blocked: EXECUTING_WITH_WAITPOINTS
- If suspended and now blocked: SUSPENDED
Creates new snapshot if state changed
Notifies worker via EventBus to suspend if currently executing
Schedules timeout job if timeout specified
If no pending waitpoints, schedules immediate continueRunIfUnblocked job

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts368-497

completeWaitpoint()

Marks a waitpoint as completed and unblocks affected runs:

Updates Waitpoint status to COMPLETED via updateMany (idempotent)
Stores optional output data
Queries all TaskRunWaitpoint records for this waitpoint
For each affected run:
- Enqueues continueRunIfUnblocked job with 50ms delay
- Emits cachedRunCompleted event if spanIdToComplete present

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts70-172

continueRunIfUnblocked()

Checks if a run is unblocked and continues execution:

Acquires run lock via runLocker.lock("continueRunIfUnblocked")
Queries all TaskRunWaitpoint records for the run
Checks if any waitpoints are still PENDING
If still blocked: returns { status: "blocked" } and exits
If unblocked:
- Gets current execution state
- If EXECUTING_WITH_WAITPOINTS: Creates QUEUED_EXECUTING snapshot and notifies worker
- If SUSPENDED: Re-enqueues run via enqueueSystem.enqueueRun()
- If already in another state: Returns skipped

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts499-709

DelayedRunSystem

The DelayedRunSystem handles runs scheduled for future execution via the delay option.

scheduleDelayedRunEnqueuing()

When a run is triggered with delayUntil:

Creates TaskRun in DELAYED status
Creates DELAYED snapshot
Enqueues background job: enqueueDelayedRun:${runId}
Job scheduled to fire at delayUntil timestamp

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts56-133

enqueueDelayedRun()

Background job handler that fires at scheduled time:

Acquires run lock
Validates run is still in DELAYED state
Gets latest snapshot
Calls enqueueSystem.enqueueRun() to queue the run
Creates QUEUED snapshot

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts135-191

rescheduleDelayedRun()

Allows changing the delay time before a run executes:

Validates run is in DELAYED state
Updates delayUntil timestamp on TaskRun
Reschedules background job via worker.reschedule()
Creates new DELAYED snapshot with updated description
Emits runDelayRescheduled event

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts26-95

Distributed Locking

RunLocker Implementation

The RunLocker class provides distributed mutual exclusion using the Redlock algorithm over Redis.

Architecture:

Key Features:

Automatic Lock Extension: Locks automatically extend before expiration if operation is still running
Nested Lock Support: Uses AsyncLocalStorage to detect if lock already held in current async context
Exponential Backoff Retry: Configurable retry with jitter to prevent thundering herd
Comprehensive Observability: Emits OpenTelemetry metrics and traces for lock operations

Sources: internal-packages/run-engine/src/engine/locking.ts70-599

lock() Method Signature:

Retry Configuration:

Parameter	Default	Purpose
`maxAttempts`	10	Maximum lock acquisition attempts
`baseDelay`	100ms	Initial retry delay
`maxDelay`	3000ms	Maximum retry delay cap
`backoffMultiplier`	1.8	Exponential backoff factor
`jitterFactor`	0.15	Randomization (±15%) to prevent synchronized retries
`maxTotalWaitTime`	15000ms	Total timeout for all retry attempts
`duration`	5000ms	Lock TTL before automatic expiration
`automaticExtensionThreshold`	1000ms	Start extending when <1s remains

Sources: internal-packages/run-engine/src/engine/locking.ts55-68 internal-packages/run-engine/src/engine/index.ts124-140

Locking Patterns

All state-changing operations follow this pattern:

This pattern ensures:

Atomicity: Lock prevents concurrent modifications
Consistency: Snapshot validation prevents lost updates
Isolation: Each operation sees a consistent view of run state
Durability: New snapshot persisted before lock release

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts294-628 internal-packages/run-engine/src/engine/systems/checkpointSystem.ts53-90 internal-packages/run-engine/src/engine/systems/waitpointSystem.ts397-496

Queue Management

Hierarchical Queue Structure

The RunQueue implements a three-tier hierarchy for fair task distribution:

Queue Tiers:

Master Queue: Single queue per deployment type (e.g., "prod")
Environment Queues: One per runtime environment (e.g., "env:prod-env-123")
Task Queues: One per task within an environment (e.g., "task:email-sender")

Each tier is a Redis sorted set with scores representing queue timestamps for fairness.

Sources: internal-packages/run-engine/src/run-queue/index.ts

FairQueueSelectionStrategy

The FairQueueSelectionStrategy selects which environment queue to dequeue from using weighted scoring:

Score Formula:

score = (concurrencyLimitBias × concurrencyScore) + 
        (availableCapacityBias × capacityScore) + 
        (queueAgeRandomization × random())

Score Components:

Component	Default Weight	Calculation	Purpose
Concurrency Limit Bias	0.75	`available / limit`	Prioritize environments with more capacity headroom
Available Capacity Bias	0.3	`1 - (queueSize / limit)`	Factor in current queue depth
Queue Age Randomization	0.25	`random()`	Prevent starvation via randomness

The strategy maintains a snapshot of environment states and reuses it for a configurable number of dequeues (default: 0, refresh every time) to reduce Redis queries.

Sources: internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts apps/webapp/app/v3/runEngine.server.ts58-67

Observability

EventBus

The EventBus is a typed EventEmitter that broadcasts run lifecycle events for monitoring and UI updates.

Key Events:

Event Name	When Emitted	Payload Fields
`runCreated`	New `TaskRun` record created	`runId`, `time`
`runQueued`	Run added to queue	`runId`, `orgId`, `envId`, `queueName`, `time`
`runLocked`	Run locked to worker during dequeue	`run`, `organization`, `project`, `environment`, `time`
`runAttemptStarted`	Attempt begins executing	`run`, `organization`, `project`, `environment`, `time`
`runStatusChanged`	Run `status` field changes	`run`, `organization`, `project`, `environment`, `time`
`runCompleted`	Run reaches final state	`run`, `result`, `time`
`incomingCheckpointDiscarded`	Checkpoint rejected (invalid state)	`run`, `checkpoint`, `snapshot`, `time`

The EventBus enables decoupled real-time features without blocking execution paths. Listeners can subscribe to events for:

Real-time UI updates via Server-Sent Events
Metrics collection for monitoring dashboards
Audit logging for compliance
Webhook notifications

Sources: internal-packages/run-engine/src/engine/eventBus.ts internal-packages/run-engine/src/engine/index.ts712-715

OpenTelemetry Integration

All critical operations are instrumented with OpenTelemetry tracing using the startSpan() helper:

Traced Operations:

trigger: Creating new runs
dequeueFromWorkerQueue: Fair queue selection and dequeue
startRunAttempt: Starting execution attempts
completeRunAttempt: Completing attempts
createCheckpoint: Creating checkpoints
blockRunWithWaitpoint: Blocking on waitpoints
Lock acquisition and retries

Semantic Attributes: Spans include semantic attributes following OpenTelemetry conventions:

run_id: Task run identifier
snapshot_id: Execution snapshot identifier
organization_id: Organization identifier
environment_id: Environment identifier
run_engine.lock.type: Lock operation type
run_engine.lock.resources: Locked resource keys

Sources: internal-packages/run-engine/src/engine/index.ts1-5 internal-packages/run-engine/src/engine/locking.ts18-24 apps/webapp/app/v3/tracer.server.ts113-142

Error Handling and Resilience

Stalled Run Detection

The engine monitors for stalled runs using per-state heartbeat timeouts configured via heartbeatTimeoutsMs:

Execution State	Default Timeout	Detection Purpose
`PENDING_EXECUTING`	60s	Worker crashed after dequeue but before attempt start
`PENDING_CANCEL`	60s	Cancellation notification not processed
`EXECUTING`	5 minutes	Worker crashed during normal execution
`EXECUTING_WITH_WAITPOINTS`	5 minutes	Worker crashed while blocked on waitpoints
`SUSPENDED`	10 minutes	Checkpoint restoration never requested

Background Job Handlers:

The Worker processes background jobs for maintenance operations:

Job Type	Purpose	Handler
`heartbeatSnapshot`	Check if snapshot is stalled	`#handleStalledSnapshot()`
`repairSnapshot`	Attempt to recover stalled run	`#handleRepairSnapshot()`
`expireRun`	Handle TTL expiration	`ttlSystem.expireRun()`
`cancelRun`	Process cancellation request	`runAttemptSystem.cancelRun()`
`continueRunIfUnblocked`	Check if waitpoints unblocked	`waitpointSystem.continueRunIfUnblocked()`
`enqueueDelayedRun`	Queue delayed run at scheduled time	`delayedRunSystem.enqueueDelayedRun()`

Sources: internal-packages/run-engine/src/engine/index.ts199-243 internal-packages/run-engine/src/engine/types.ts109-115

Retry and Exponential Backoff

The retryOutcomeFromCompletion() function determines if a failed attempt should be retried:

Retry Decision Logic:

Check if error type is retryable (user errors, some internal errors)
Check if attemptNumber < maxAttempts

Calculate exponential backoff delay:

delay = baseDelay × (factor ^ (attemptNumber - 1))
delay = min(delay, maxDelay)
delay = delay × (1 + jitter × random())

Default Retry Configuration:

maxAttempts: 3 (configurable per task or run)
factor: 2 (exponential growth)
minTimeoutInMs: 1000 (1 second minimum)
maxTimeoutInMs: 3600000 (1 hour maximum)
randomize: true (adds jitter)

If retry is determined, the run is re-enqueued with a delayUntil timestamp calculated from the backoff delay.

Sources: internal-packages/run-engine/src/engine/retrying.ts internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts730-1019

Singleton Instantiation

The RunEngine is instantiated as a singleton in the webapp via the singleton() helper:

Configuration Loading:

Environment variables prefixed with RUN_ENGINE_* control engine behavior:

Variable Group	Purpose	Examples
`RUN_ENGINE_WORKER_*`	Worker pool configuration	`WORKER_COUNT`, `WORKER_CONCURRENCY_LIMIT`
`RUN_ENGINE_RUN_QUEUE_*`	Queue and selection strategy	`PARENT_QUEUE_LIMIT`, `CONCURRENCY_LIMIT_BIAS`
`RUN_ENGINE_RUN_LOCK_*`	Lock duration and retry	`LOCK_DURATION`, `LOCK_MAX_RETRIES`
`RUN_ENGINE_TIMEOUT_*`	Per-state stall detection	`TIMEOUT_EXECUTING`, `TIMEOUT_SUSPENDED`
`RUN_ENGINE__REDIS_`	Redis connection per subsystem	`WORKER_REDIS_HOST`, `RUN_QUEUE_REDIS_PORT`

The engine supports separate Redis instances for different subsystems (worker, queue, lock) to enable horizontal scaling and isolation.

Sources: apps/webapp/app/v3/runEngine.server.ts11-192 apps/webapp/app/env.server.ts560-602

Testing

The engine includes comprehensive integration tests using the containerTest() helper from @internal/testcontainers:

Test Categories:

Test File	Coverage
`attemptFailures.test.ts`	Retry logic, error handling, max attempts
`checkpoints.test.ts`	Checkpoint creation, resumption, race conditions
`waitpoints.test.ts`	Waitpoint blocking, completion, timeouts, batch coordination
`locking.test.ts`	Lock acquisition, retry, extension, nested locks

Tests spin up isolated PostgreSQL and Redis containers for each test case, ensuring complete isolation and reproducibility.

Sources: internal-packages/run-engine/src/engine/tests/waitpoints.test.ts1-13 internal-packages/run-engine/src/engine/tests/checkpoints.test.ts1-13 internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts1-8 internal-packages/run-engine/src/engine/tests/locking.test.ts1-8

This page provides an overview of the Task Execution Engine architecture. For detailed information on specific subsystems:

Run Engine Architecture: Deep dive into system composition and resource management. See Run Engine Architecture
Run Lifecycle and State Machine: Complete state transition rules and status mappings. See Run Lifecycle and State Machine
Queue Management: Fair selection strategies, concurrency limits, and queue hierarchy. See Queue Management
Checkpoint and Resume System: Checkpoint storage, restoration, and concurrency release. See Checkpoint and Resume System
Waitpoint System: Waitpoint types, blocking semantics, and unblocking logic. See Waitpoint System
Concurrency Management: Environment, queue, and organization concurrency limits. See Concurrency Management
Worker Execution: ZodWorker, background job catalog, and handler implementations. See Worker Execution
Retry and Error Handling: Retry decision logic, error classification, and backoff calculations. See Retry and Error Handling
Task Triggering Services: High-level triggering APIs and validation. See Task Triggering Services
Observability and Tracing: OpenTelemetry integration and semantic attributes. See Observability and Tracing

Sources: [Table of Contents JSON]

Task Execution Engine

Relevant source files

Purpose and Scope

This document covers the architecture and core systems of the engine. For information about:

Task definition and SDK triggering APIs, see Trigger.dev SDK
High-level task triggering service implementations, see Task Triggering Services
Database schema for runs and execution snapshots, see Task and Run Models
Deployment and worker infrastructure, see Development and Deployment

Architecture Overview

The Task Execution Engine is implemented by the RunEngine class, which composes 11 specialized subsystems to handle different aspects of task execution.

RunEngine Class Structure

Sources: internal-packages/run-engine/src/engine/index.ts76-384

System Initialization and Configuration

The RunEngine constructor accepts a comprehensive configuration object defining all subsystem parameters:

Configuration Category	Key Options	Purpose
`prisma`	`PrismaClient`, optional `readOnlyPrisma`	Database persistence and read replicas
`worker`	`redis`, `workers`, `tasksPerWorker`, `pollIntervalMs`	Background job processing configuration
`queue`	`redis`, `defaultEnvConcurrency`, `shardCount`	Fair queue selection and hierarchical queue management
`runLock`	`redis`, `duration`, `automaticExtensionThreshold`, `retryConfig`	Distributed locking via Redlock algorithm
`machines`	`defaultMachine`, `machines`, `baseCostInCents`	Machine preset definitions for different execution environments
`heartbeatTimeoutsMs`	`PENDING_EXECUTING`, `EXECUTING`, `SUSPENDED`, etc.	Per-state timeouts for stall detection
`tracer`, `meter`	OpenTelemetry instances	Distributed tracing and metrics instrumentation

Sources: internal-packages/run-engine/src/engine/types.ts23-107 internal-packages/run-engine/src/engine/index.ts106-384

Task Run Execution Flow

Complete Execution Pipeline

The following diagram illustrates the flow from triggering a task through to completion:

Execution State Machine

Task runs progress through execution states tracked by TaskRunExecutionSnapshot records. The executionStatus field tracks fine-grained internal state:

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts226-387 internal-packages/run-engine/src/engine/statuses.ts

Core Execution Systems

ExecutionSnapshotSystem

The ExecutionSnapshotSystem maintains an immutable audit trail of all run state transitions by creating TaskRunExecutionSnapshot records.

Key Responsibilities:

Create new snapshots for every state transition
Link snapshots via previousSnapshotId to form an immutable chain
Validate that only the latest snapshot is used for state mutations
Track completed waitpoints and checkpoints at each state

Snapshot Data Model:

The getLatestExecutionSnapshot() helper retrieves the most recent valid snapshot for a run.

Sources: internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts226-387 internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts95-113

EnqueueSystem and DequeueSystem

EnqueueSystem

The EnqueueSystem handles placing runs into the queue hierarchy:

Creates a QUEUED execution snapshot
Calls runQueue.enqueueMessage(orgId, runId, message) to add to queues
Schedules TTL expiration job if configured
Emits runQueued event via EventBus

Sources: internal-packages/run-engine/src/engine/systems/enqueueSystem.ts25-102

DequeueSystem

The DequeueSystem retrieves runs for execution using fair selection:

Calls runQueue.dequeueMessageFromWorkerQueue() to get a fairly-selected run
Acquires distributed lock via runLocker.lock()
Validates run is in dequeueable state (QUEUED, RUN_CREATED)
Locks run to specific BackgroundWorkerTask and version:
- Sets lockedById to task ID
- Sets lockedToVersionId to worker version ID
- Sets taskVersion, sdkVersion, cliVersion from worker metadata
Updates TaskRun to DEQUEUED status
Creates PENDING_EXECUTING snapshot
Returns DequeuedMessage with execution context

The dequeue operation includes comprehensive validation to ensure the run can execute, checking for:

Valid worker deployment with image reference
Matching background worker if filtering by worker ID
Task exists in latest deployment
Queue configuration exists

Sources: internal-packages/run-engine/src/engine/systems/dequeueSystem.ts88-770 internal-packages/run-engine/src/engine/systems/dequeueSystem.ts105-578

RunAttemptSystem

The RunAttemptSystem orchestrates the lifecycle of execution attempts and retry logic.

startRunAttempt()

Creates a new attempt and transitions to executing state:

Acquires run lock via runLocker.lock()
Validates snapshot ID matches latest
Increments attemptNumber on TaskRun (starts at 1)
Checks attemptNumber against MAX_TASK_RUN_ATTEMPTS constant
Updates run status to EXECUTING
Creates EXECUTING snapshot
Resolves complete execution context:
- Task metadata from cache (BackgroundWorkerTask)
- Queue configuration from cache (TaskQueue)
- Organization details from cache
- Project details from cache
- Machine preset based on run configuration
- Deployment information if applicable
Returns TaskRunExecution object with all context

The system maintains in-memory caches using @internal/cache with Redis backing to avoid database queries on each attempt start.

Sources: internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts294-628 internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts106-171

completeRunAttempt()

Handles attempt completion with success or failure:

Routes to attemptSucceeded() or attemptFailed() based on completion.ok
Acquires run lock
Validates snapshot ID
Updates run with completion result

For successful attempts (attemptSucceeded()):

Sets status to COMPLETED_SUCCESSFULLY
Stores output in output and outputType fields
Creates FINISHED snapshot
Completes associated waitpoint if run was triggered via triggerAndWait()
Notifies parent run if resumeParentOnCompletion is set
Emits runCompleted event

For failed attempts (attemptFailed()):

Calls retryOutcomeFromCompletion() to determine if retryable
If retryable and under maxAttempts:
- Calculates exponential backoff delay
- Re-enqueues run with delay
- Creates QUEUED snapshot
If not retryable or max attempts reached:
- Sets status to COMPLETED_WITH_ERRORS
- Creates FINISHED snapshot with error
- Handles parent run resumption with error

Specialized Feature Systems

CheckpointSystem

The CheckpointSystem enables suspending long-running tasks to free resources while preserving execution state.

createCheckpoint()

Creates a checkpoint and suspends execution:

Acquires run lock via runLocker.lock("createCheckpoint")
Validates snapshot is latest or previous with QUEUED_EXECUTING state
Validates run is in checkpointable state:
- EXECUTING
- QUEUED_EXECUTING
Creates TaskRunCheckpoint record with:
- type: "DOCKER" or other checkpoint mechanism
- location: External storage location (e.g., S3 URL)
- imageRef: Container image reference for restoration
- reason: Optional explanation
Updates run status to WAITING_TO_RESUME
Creates SUSPENDED snapshot linking to checkpoint
Releases all concurrency held by run via runQueue.releaseAllConcurrency()
Returns checkpoint information and execution result

Special case for QUEUED_EXECUTING: If run is executing but also queued (rare race condition), the checkpoint causes the run to be re-enqueued rather than suspended.

Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts36-249

continueRunExecution()

Resumes execution from a checkpoint:

Acquires run lock
Validates run is in SUSPENDED or QUEUED state
Creates new snapshot to continue from checkpoint
Re-enqueues run for execution via enqueueSystem.enqueueRun()
Returns execution result with continued state

The dequeue operation will detect the checkpoint reference and pass it to the worker for restoration.

Sources: internal-packages/run-engine/src/engine/systems/checkpointSystem.ts254-319

WaitpointSystem

The WaitpointSystem enables blocking task execution until external conditions are satisfied.

Waitpoint Types:

Type	Completion Trigger	Use Case
`DATETIME`	Specified timestamp reached	`wait.for(duration)` delays
`MANUAL`	Explicit API call to complete	Human-in-the-loop workflows, external callbacks
`BATCH`	All runs in batch finish	Parent waiting for all batch children

blockRunWithWaitpoint()

Blocks a run on one or more waitpoints:

Acquires run lock via runLocker.lock("blockRunWithWaitpoint")
Gets latest execution snapshot
Uses raw SQL to atomically:
- Create TaskRunWaitpoint junction records
- Create _WaitpointRunConnections many-to-many records
- Count pending (incomplete) waitpoints blocking the run
Determines new execution state:
- If executing and now blocked: EXECUTING_WITH_WAITPOINTS
- If suspended and now blocked: SUSPENDED
Creates new snapshot if state changed
Notifies worker via EventBus to suspend if currently executing
Schedules timeout job if timeout specified
If no pending waitpoints, schedules immediate continueRunIfUnblocked job

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts368-497

completeWaitpoint()

Marks a waitpoint as completed and unblocks affected runs:

Updates Waitpoint status to COMPLETED via updateMany (idempotent)
Stores optional output data
Queries all TaskRunWaitpoint records for this waitpoint
For each affected run:
- Enqueues continueRunIfUnblocked job with 50ms delay
- Emits cachedRunCompleted event if spanIdToComplete present

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts70-172

continueRunIfUnblocked()

Checks if a run is unblocked and continues execution:

Acquires run lock via runLocker.lock("continueRunIfUnblocked")
Queries all TaskRunWaitpoint records for the run
Checks if any waitpoints are still PENDING
If still blocked: returns { status: "blocked" } and exits
If unblocked:
- Gets current execution state
- If EXECUTING_WITH_WAITPOINTS: Creates QUEUED_EXECUTING snapshot and notifies worker
- If SUSPENDED: Re-enqueues run via enqueueSystem.enqueueRun()
- If already in another state: Returns skipped

Sources: internal-packages/run-engine/src/engine/systems/waitpointSystem.ts499-709

DelayedRunSystem

The DelayedRunSystem handles runs scheduled for future execution via the delay option.

scheduleDelayedRunEnqueuing()

When a run is triggered with delayUntil:

Creates TaskRun in DELAYED status
Creates DELAYED snapshot
Enqueues background job: enqueueDelayedRun:${runId}
Job scheduled to fire at delayUntil timestamp

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts56-133

enqueueDelayedRun()

Background job handler that fires at scheduled time:

Acquires run lock
Validates run is still in DELAYED state
Gets latest snapshot
Calls enqueueSystem.enqueueRun() to queue the run
Creates QUEUED snapshot

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts135-191

rescheduleDelayedRun()

Allows changing the delay time before a run executes:

Validates run is in DELAYED state
Updates delayUntil timestamp on TaskRun
Reschedules background job via worker.reschedule()
Creates new DELAYED snapshot with updated description
Emits runDelayRescheduled event

Sources: internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts26-95

Distributed Locking

RunLocker Implementation

The RunLocker class provides distributed mutual exclusion using the Redlock algorithm over Redis.

Architecture:

Key Features:

Automatic Lock Extension: Locks automatically extend before expiration if operation is still running
Nested Lock Support: Uses AsyncLocalStorage to detect if lock already held in current async context
Exponential Backoff Retry: Configurable retry with jitter to prevent thundering herd
Comprehensive Observability: Emits OpenTelemetry metrics and traces for lock operations

Sources: internal-packages/run-engine/src/engine/locking.ts70-599

lock() Method Signature:

Retry Configuration:

Parameter	Default	Purpose
`maxAttempts`	10	Maximum lock acquisition attempts
`baseDelay`	100ms	Initial retry delay
`maxDelay`	3000ms	Maximum retry delay cap
`backoffMultiplier`	1.8	Exponential backoff factor
`jitterFactor`	0.15	Randomization (±15%) to prevent synchronized retries
`maxTotalWaitTime`	15000ms	Total timeout for all retry attempts
`duration`	5000ms	Lock TTL before automatic expiration
`automaticExtensionThreshold`	1000ms	Start extending when <1s remains

Sources: internal-packages/run-engine/src/engine/locking.ts55-68 internal-packages/run-engine/src/engine/index.ts124-140

Locking Patterns

All state-changing operations follow this pattern:

This pattern ensures:

Atomicity: Lock prevents concurrent modifications
Consistency: Snapshot validation prevents lost updates
Isolation: Each operation sees a consistent view of run state
Durability: New snapshot persisted before lock release

Queue Management

Hierarchical Queue Structure

The RunQueue implements a three-tier hierarchy for fair task distribution:

Queue Tiers:

Master Queue: Single queue per deployment type (e.g., "prod")
Environment Queues: One per runtime environment (e.g., "env:prod-env-123")
Task Queues: One per task within an environment (e.g., "task:email-sender")

Each tier is a Redis sorted set with scores representing queue timestamps for fairness.

Sources: internal-packages/run-engine/src/run-queue/index.ts

FairQueueSelectionStrategy

The FairQueueSelectionStrategy selects which environment queue to dequeue from using weighted scoring:

Score Formula:

score = (concurrencyLimitBias × concurrencyScore) + 
        (availableCapacityBias × capacityScore) + 
        (queueAgeRandomization × random())

Score Components:

Component	Default Weight	Calculation	Purpose
Concurrency Limit Bias	0.75	`available / limit`	Prioritize environments with more capacity headroom
Available Capacity Bias	0.3	`1 - (queueSize / limit)`	Factor in current queue depth
Queue Age Randomization	0.25	`random()`	Prevent starvation via randomness

The strategy maintains a snapshot of environment states and reuses it for a configurable number of dequeues (default: 0, refresh every time) to reduce Redis queries.

Sources: internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts apps/webapp/app/v3/runEngine.server.ts58-67

Observability

EventBus

The EventBus is a typed EventEmitter that broadcasts run lifecycle events for monitoring and UI updates.

Key Events:

Event Name	When Emitted	Payload Fields
`runCreated`	New `TaskRun` record created	`runId`, `time`
`runQueued`	Run added to queue	`runId`, `orgId`, `envId`, `queueName`, `time`
`runLocked`	Run locked to worker during dequeue	`run`, `organization`, `project`, `environment`, `time`
`runAttemptStarted`	Attempt begins executing	`run`, `organization`, `project`, `environment`, `time`
`runStatusChanged`	Run `status` field changes	`run`, `organization`, `project`, `environment`, `time`
`runCompleted`	Run reaches final state	`run`, `result`, `time`
`incomingCheckpointDiscarded`	Checkpoint rejected (invalid state)	`run`, `checkpoint`, `snapshot`, `time`

The EventBus enables decoupled real-time features without blocking execution paths. Listeners can subscribe to events for:

Real-time UI updates via Server-Sent Events
Metrics collection for monitoring dashboards
Audit logging for compliance
Webhook notifications

Sources: internal-packages/run-engine/src/engine/eventBus.ts internal-packages/run-engine/src/engine/index.ts712-715

OpenTelemetry Integration

All critical operations are instrumented with OpenTelemetry tracing using the startSpan() helper:

Traced Operations:

trigger: Creating new runs
dequeueFromWorkerQueue: Fair queue selection and dequeue
startRunAttempt: Starting execution attempts
completeRunAttempt: Completing attempts
createCheckpoint: Creating checkpoints
blockRunWithWaitpoint: Blocking on waitpoints
Lock acquisition and retries

Semantic Attributes: Spans include semantic attributes following OpenTelemetry conventions:

run_id: Task run identifier
snapshot_id: Execution snapshot identifier
organization_id: Organization identifier
environment_id: Environment identifier
run_engine.lock.type: Lock operation type
run_engine.lock.resources: Locked resource keys

Sources: internal-packages/run-engine/src/engine/index.ts1-5 internal-packages/run-engine/src/engine/locking.ts18-24 apps/webapp/app/v3/tracer.server.ts113-142

Error Handling and Resilience

Stalled Run Detection

The engine monitors for stalled runs using per-state heartbeat timeouts configured via heartbeatTimeoutsMs:

Execution State	Default Timeout	Detection Purpose
`PENDING_EXECUTING`	60s	Worker crashed after dequeue but before attempt start
`PENDING_CANCEL`	60s	Cancellation notification not processed
`EXECUTING`	5 minutes	Worker crashed during normal execution
`EXECUTING_WITH_WAITPOINTS`	5 minutes	Worker crashed while blocked on waitpoints
`SUSPENDED`	10 minutes	Checkpoint restoration never requested

Background Job Handlers:

The Worker processes background jobs for maintenance operations:

Job Type	Purpose	Handler
`heartbeatSnapshot`	Check if snapshot is stalled	`#handleStalledSnapshot()`
`repairSnapshot`	Attempt to recover stalled run	`#handleRepairSnapshot()`
`expireRun`	Handle TTL expiration	`ttlSystem.expireRun()`
`cancelRun`	Process cancellation request	`runAttemptSystem.cancelRun()`
`continueRunIfUnblocked`	Check if waitpoints unblocked	`waitpointSystem.continueRunIfUnblocked()`
`enqueueDelayedRun`	Queue delayed run at scheduled time	`delayedRunSystem.enqueueDelayedRun()`

Sources: internal-packages/run-engine/src/engine/index.ts199-243 internal-packages/run-engine/src/engine/types.ts109-115

Retry and Exponential Backoff

The retryOutcomeFromCompletion() function determines if a failed attempt should be retried:

Retry Decision Logic:

Check if error type is retryable (user errors, some internal errors)
Check if attemptNumber < maxAttempts

Calculate exponential backoff delay:

delay = baseDelay × (factor ^ (attemptNumber - 1))
delay = min(delay, maxDelay)
delay = delay × (1 + jitter × random())

Default Retry Configuration:

maxAttempts: 3 (configurable per task or run)
factor: 2 (exponential growth)
minTimeoutInMs: 1000 (1 second minimum)
maxTimeoutInMs: 3600000 (1 hour maximum)
randomize: true (adds jitter)

If retry is determined, the run is re-enqueued with a delayUntil timestamp calculated from the backoff delay.

Sources: internal-packages/run-engine/src/engine/retrying.ts internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts730-1019

Singleton Instantiation

The RunEngine is instantiated as a singleton in the webapp via the singleton() helper:

Configuration Loading:

Environment variables prefixed with RUN_ENGINE_* control engine behavior:

Variable Group	Purpose	Examples
`RUN_ENGINE_WORKER_*`	Worker pool configuration	`WORKER_COUNT`, `WORKER_CONCURRENCY_LIMIT`
`RUN_ENGINE_RUN_QUEUE_*`	Queue and selection strategy	`PARENT_QUEUE_LIMIT`, `CONCURRENCY_LIMIT_BIAS`
`RUN_ENGINE_RUN_LOCK_*`	Lock duration and retry	`LOCK_DURATION`, `LOCK_MAX_RETRIES`
`RUN_ENGINE_TIMEOUT_*`	Per-state stall detection	`TIMEOUT_EXECUTING`, `TIMEOUT_SUSPENDED`
`RUN_ENGINE__REDIS_`	Redis connection per subsystem	`WORKER_REDIS_HOST`, `RUN_QUEUE_REDIS_PORT`

The engine supports separate Redis instances for different subsystems (worker, queue, lock) to enable horizontal scaling and isolation.

Sources: apps/webapp/app/v3/runEngine.server.ts11-192 apps/webapp/app/env.server.ts560-602

Testing

The engine includes comprehensive integration tests using the containerTest() helper from @internal/testcontainers:

Test Categories:

Test File	Coverage
`attemptFailures.test.ts`	Retry logic, error handling, max attempts
`checkpoints.test.ts`	Checkpoint creation, resumption, race conditions
`waitpoints.test.ts`	Waitpoint blocking, completion, timeouts, batch coordination
`locking.test.ts`	Lock acquisition, retry, extension, nested locks

Tests spin up isolated PostgreSQL and Redis containers for each test case, ensuring complete isolation and reproducibility.

This page provides an overview of the Task Execution Engine architecture. For detailed information on specific subsystems:

Run Engine Architecture: Deep dive into system composition and resource management. See Run Engine Architecture
Run Lifecycle and State Machine: Complete state transition rules and status mappings. See Run Lifecycle and State Machine
Queue Management: Fair selection strategies, concurrency limits, and queue hierarchy. See Queue Management
Checkpoint and Resume System: Checkpoint storage, restoration, and concurrency release. See Checkpoint and Resume System
Waitpoint System: Waitpoint types, blocking semantics, and unblocking logic. See Waitpoint System
Concurrency Management: Environment, queue, and organization concurrency limits. See Concurrency Management
Worker Execution: ZodWorker, background job catalog, and handler implementations. See Worker Execution
Retry and Error Handling: Retry decision logic, error classification, and backoff calculations. See Retry and Error Handling
Task Triggering Services: High-level triggering APIs and validation. See Task Triggering Services
Observability and Tracing: OpenTelemetry integration and semantic attributes. See Observability and Tracing

Sources: [Table of Contents JSON]

Task Execution Engine

Purpose and Scope

Architecture Overview

RunEngine Class Structure

System Initialization and Configuration

Task Run Execution Flow

Complete Execution Pipeline

Execution State Machine

Core Execution Systems

ExecutionSnapshotSystem

EnqueueSystem and DequeueSystem

RunAttemptSystem

Specialized Feature Systems

CheckpointSystem

WaitpointSystem

DelayedRunSystem

Distributed Locking

RunLocker Implementation

Locking Patterns

Queue Management

Hierarchical Queue Structure

FairQueueSelectionStrategy

Observability

EventBus

OpenTelemetry Integration

Error Handling and Resilience

Stalled Run Detection

Retry and Exponential Backoff

Singleton Instantiation

Testing

Related Subsystems

On this page

Task Execution Engine

Purpose and Scope

Architecture Overview

RunEngine Class Structure

System Initialization and Configuration

Task Run Execution Flow

Complete Execution Pipeline

Execution State Machine

Core Execution Systems

ExecutionSnapshotSystem

EnqueueSystem and DequeueSystem

RunAttemptSystem

Specialized Feature Systems

CheckpointSystem

WaitpointSystem

DelayedRunSystem

Distributed Locking

RunLocker Implementation

Locking Patterns

Queue Management

Hierarchical Queue Structure

FairQueueSelectionStrategy

Observability

EventBus

OpenTelemetry Integration

Error Handling and Resilience

Stalled Run Detection

Retry and Exponential Backoff

Singleton Instantiation

Testing

Related Subsystems

On this page