Skip to content

Sandbox Architecture

Sibyl's sandbox system provides isolated, ephemeral execution environments for AI agents. The architecture splits into two planes: a control plane (the API server) that manages sandbox lifecycle and task routing, and an execution plane (runner daemons) that actually runs tasks inside sandboxed environments. Runtime lifecycle is delegated to the kubernetes-sigs/agent-sandbox Sandbox CRD.

Architecture Overview

┌─────────────────────────────────────────┐
│           Sibyl API Server              │
│  ┌──────────────┐ ┌──────────────────┐  │
│  │ Controller   │ │   Dispatcher      │  │
│  │ (CRD client) │ │  (task queue)     │  │
│  └──────┬───────┘ └────────┬─────────┘  │
│         │                  │            │
│  ┌──────┴──────────────────┴─────────┐  │
│  │      WebSocket Protocol           │  │
│  └──────────────┬────────────────────┘  │
└─────────────────┼───────────────────────┘

     ┌────────────┼──────────────┐
     │            │              │
 ┌───┴───┐  ┌───┴───┐     ┌────────────────┐
 │Runner │  │Runner │ ... │ agent-sandbox  │
 │(pod)  │  │(pod)  │     │ controller/CRD │
 └───────┘  └───────┘     └────────────────┘
  • Controller manages sandbox lifecycle by creating/updating/deleting agents.x-k8s.io/v1alpha1Sandbox resources.
  • Dispatcher routes tasks to runners based on availability, capabilities, and warm worktree proximity.
  • WebSocket Protocol provides bidirectional communication between server and runners.
  • Runners are stateless daemons that register with the server and execute assigned tasks.

Compute Tiers

Sibyl supports multiple isolation levels, chosen per deployment:

TierIsolationUse Case
LocalProcess-levelDevelopment, testing
DockerContainer-levelCI/CD, staging
KubernetesPod-levelProduction
vClusterCluster-levelMulti-tenant production

Higher tiers provide stronger isolation at the cost of provisioning latency.

BYOD Model

Sibyl uses a Bring Your Own Device model for runner infrastructure. Runners self-register with the API server, declaring:

  • Capabilities — What the runner can do (e.g., docker, gpu, high-memory)
  • Project affinity — Which projects have warm worktrees on this runner

The task router scores candidate runners on three axes:

  1. Availability — Is the runner idle or near capacity?
  2. Capability match — Does the runner have the required capabilities?
  3. Warm worktree proximity — Does the runner already have the project cloned and ready?

This scoring model minimizes cold-start time by preferring runners that already have the right environment cached.

Runner Daemon Protocol

Communication between the API server and runners uses a WebSocket-based bidirectional protocol.

Server-to-Runner Messages

MessageDescription
heartbeatPeriodic liveness check
task_assignAssign a queued task to this runner
task_cancelCancel a running task

Runner-to-Server Messages

MessageDescription
heartbeat_ackAcknowledge liveness check
statusReport runner load, capabilities, health
task_ackConfirm task assignment accepted
task_completeReport task finished (with result/artifacts)
task_rejectDecline task assignment (capacity, mismatch)
agent_updateStream agent progress/logs during execution
project_registerRegister or update project affinity

The protocol is designed to be resilient to transient disconnects. Runners automatically reconnect and re-register on connection loss. Tasks that were in-flight during a disconnect enter a grace period before being reassigned.

Sandbox Lifecycle

pending → starting → running → suspending → suspended → deleted
                        │                       │
                        └── failed ◄────────────┘
StateDescription
pendingSandbox requested, waiting for resources
startingProvisioning environment (pulling images, cloning repos)
runningActive and accepting tasks
suspendingSaving state before suspension
suspendedIdle, state preserved, resources released
deletedCleaned up, resources freed
failedError state, reachable from any other state

Sandboxes auto-suspend after SIBYL_SANDBOX_IDLE_TTL_SECONDS of inactivity and are hard-deleted after SIBYL_SANDBOX_MAX_LIFETIME_SECONDS. Suspend/resume maps to Sandbox.spec.replicas = 0|1 in the agent-sandbox CRD.

Task Lifecycle

queued → dispatched → acked → running → completed

                                  ├── failed → retry → queued
                                  └── canceled
StateDescription
queuedTask submitted, waiting for runner assignment
dispatchedAssigned to a runner, awaiting acknowledgment
ackedRunner confirmed receipt
runningActively executing
completedFinished successfully
failedExecution error (may retry)
retryScheduled for re-queue after failure
canceledExplicitly canceled by user or system

Failed tasks are retried up to a configurable limit before being marked as permanently failed.

Auth Model

Sandbox runners authenticate via JWT tokens with specific claims:

ClaimDescription
orgOrganization ID (tenant isolation)
subSubject (user or service account)
ridRunner ID
sidSandbox ID (if bound to a sandbox)
scpScope — must include sandbox:runner

Strict binding is enforced for sandbox-bound runners: a runner token with a sid claim can only execute tasks within that specific sandbox. This prevents a compromised runner from accessing other sandboxes in the same organization.

Configuration Reference

All sandbox configuration uses the SIBYL_SANDBOX_ prefix:

VariableDefaultDescription
SIBYL_SANDBOX_MODEoffPolicy: off, shadow, enforced
SIBYL_SANDBOX_DEFAULT_IMAGEghcr.io/hyperb1iss/sibyl-sandbox:latestDefault container image for sandboxes
SIBYL_SANDBOX_WORKTREE_BASE/tmp/sibyl/sandboxesBase path mounted for sandbox worktrees
SIBYL_SANDBOX_IDLE_TTL_SECONDS1800Auto-suspend after idle (seconds)
SIBYL_SANDBOX_MAX_LIFETIME_SECONDS14400Maximum sandbox lifetime (seconds)
SIBYL_SANDBOX_K8S_NAMESPACEdefaultKubernetes namespace for sandbox pods
SIBYL_SANDBOX_RECONCILE_ENABLEDtrueEnable background reconciliation loop

Deployment Modes

The SIBYL_SANDBOX_MODE variable controls how sandboxes are enforced:

off (Default)

Sandbox system is completely disabled. Tasks execute directly without isolation. Suitable for single-user development or when external orchestration handles isolation.

shadow

Sandbox operations are observed and logged but not enforced. Tasks can execute without a sandbox, but the system tracks what would have been sandboxed. Useful for:

  • Validating sandbox configuration before enforcement
  • Monitoring task patterns to tune runner capacity
  • Gradual rollout of sandbox requirements

enforced

All task execution requires a sandbox. Tasks submitted without a valid sandbox assignment are rejected. This is the recommended mode for production deployments where isolation guarantees matter.

Released under the MIT License.