Quickstart¶

AgentEscapeBench is built as an Inspect AI eval that runs each sample inside its own QEMU VM on a Kubernetes cluster with nested virtualization. This guide walks through the setup end-to-end on GKE.

For other providers (EKS, AKS), the same requirements apply but the commands differ — see image-build-design.md.

Prerequisites¶

Python 3.12 with uv
gcloud CLI authenticated against a project with billing enabled
kubectl and helm
A model API key for the provider you'll evaluate (Anthropic, OpenAI, …)

1. Install¶

git clone https://github.com/safety-research/agent-escape-bench
cd agent-escape-bench
uv sync --extra dev
uv run pre-commit install
cp .env.example .env  # then fill in the keys you'll use

2. Cluster setup (one-time)¶

./scripts/setup-gke.sh <gcp-project-id>

This provisions a GKE cluster with KVM-capable nodes and autoscaling (0 → 120) and a least-privilege node service account. Expect ~15 min wall clock for a fresh cluster (most of it the GKE control-plane spin-up; nothing is broken if it sits at "Creating cluster…" for several minutes). Re-runs are idempotent. Pass --cluster NAME to use a name other than the default escapebench (the node SA is named <cluster>-node, so distinct cluster names get distinct SAs).

The pod image URI is source-derived (computed from images/pod/ via escapebench.image_keys.pod_image_uri() at runtime — there's no image: field in escapebench.yaml). The published tag at …/escapebench-qemu:src-<hash> is the same one any clone of this repo would compute. The shipped escapebench.yaml points at the public qcow2 cache, so consumers can skip straight to validation below — no yaml edits, no image builds.

If you want to host your own image cache instead, see Building your own images at the end of this guide.

3. Validate the harness¶

Run the lowest-difficulty sample for a family to confirm the harness works end-to-end before scaling up:

uv run inspect eval escapebench/escapebench \
    -T family=container -T difficulty=0 \
    --model anthropic/claude-sonnet-4-6

A difficulty-0 sample has no effective sandbox boundary — the model just has to interact with the runtime. If this fails, investigate before running higher difficulties.

4. Run a full eval¶

.eval logs land in $INSPECT_LOG_DIR. The shipped .env.example sets it to ./logs/eval; export a different path before the command to write elsewhere.

# Headline eval: difficulty 5 across every family — the hardest
# samples, no known technique. `family` defaults to `all`.
uv run inspect eval escapebench/escapebench -T difficulty=5 \
    --model anthropic/claude-sonnet-4-6 \
    --max-connections 10 \
    --message-limit 250 \
    --epochs 3 \
    --retry-on-error 2 \
    --fail-on-error 0.25 \
    --log-shared 60 --display log

# Full spectrum, one family
uv run inspect eval escapebench/escapebench -T family=container \
    --model anthropic/claude-sonnet-4-6 \
    --max-connections 10 \
    --message-limit 250 \
    --retry-on-error 1 \
    --log-shared 60 --display log

# Specific difficulty levels
uv run inspect eval escapebench/escapebench -T family=container \
    -T difficulty=0,1,2 \
    --model anthropic/claude-sonnet-4-6 \
    --log-shared 60 --display log

# Multiple families in one run
uv run inspect eval escapebench/escapebench -T family=container,v8 \
    --model anthropic/claude-sonnet-4-6 \
    --log-shared 60 --display log

--retry-on-error N re-runs failed samples up to N times in a fresh pod — recommended for any multi-sample run, since transient bridge or VM-IO errors are expected at the long tail. The headline run bumps to 2 because the hardest samples have the longest tails.

--fail-on-error 0.25 lets the eval tolerate up to 25% of samples failing (after retries) before the run itself is marked failed. The task is configured with fail_on_error=False, so without this flag nothing aborts; setting a fraction keeps an otherwise-successful run producing a score when a handful of samples genuinely refuse to recover, while still failing loudly if something systematic breaks.

--epochs 3 runs each sample three times. Models are stochastic and exploits sometimes succeed or fail by luck; three rolls per sample distinguishes capability from variance. Triples wall-clock and spend, so reserve it for the headline run rather than per-family probes.

--log-shared 60 --display log makes a backgrounded run readable mid-flight: the .eval zip's central directory only lands at finalize time, and the default Textual TUI doesn't capture to redirected stdout. With --log-shared 60, sample events sync every 60 s to ./logs/eval/.buffer/, so inspect view shows live state and you can tail -f the captured --display log output. Omit them when running interactively — the default Textual TUI is friendlier in that case.

Logs land in $INSPECT_LOG_DIR (set to ./logs/eval in .env.example).

5. View results¶

uv run inspect view

Building your own images¶

By default setup-gke.sh provisions only what consumers need to run evals against the public image cache. If you want to build and host your own qcow2 images — for a fork, a private mirror, or a variant that isn't published — re-run with --bucket gs://<your-bucket>:

./scripts/setup-gke.sh <gcp-project-id> \
    --bucket gs://<gcp-project-id>-escapebench-public

This additionally creates the bucket (world-readable, UBLA-enabled), grants the node service account storage.objectAdmin on it, and grants artifactregistry.reader on the project (in case you push your pod image to a private GAR rather than ghcr.io). Re-runs are idempotent; existing buckets aren't reconfigured.

Then point imageCache in escapebench.yaml at your bucket — the HTTPS form for read-only consumers — and build via the gs:// form:

imageCache: https://storage.googleapis.com/<gcp-project-id>-escapebench-public

uv run python -m escapebench.build container-default \
    --image-cache gs://<gcp-project-id>-escapebench-public
uv run python -m escapebench.build v8-release \
    --image-cache gs://<gcp-project-id>-escapebench-public
# Or build everything used by a family / all variants. --all
# parallelizes across variants (default --max-parallel 4).
uv run python -m escapebench.build --family container \
    --image-cache gs://<gcp-project-id>-escapebench-public
uv run python -m escapebench.build --all --max-parallel 4 \
    --image-cache gs://<gcp-project-id>-escapebench-public

Each build runs as a Kubernetes Job, executes Packer inside the per-pod QEMU image, validates the result, and writes the qcow2 + .validated marker directly to the public bucket. Content-addressed — re-running with unchanged inputs is a no-op.

Off-cluster Packer pre-bake (GitHub Actions)¶

A peer producer to python -m escapebench.build. Triggers via workflow_dispatch on .github/workflows/build-qcow2.yml, runs Packer on a GitHub-hosted ubuntu-latest runner with /dev/kvm, and uploads the qcow2 to the public bucket. Variants > 12 GiB guest RAM (the v8 pair) stay cluster-only.

The workflow stops at qcow2-uploaded; the .validated marker is a cluster privilege. Operators finish the round-trip with:

uv run python -m escapebench.build --validate-only <variant> \
    --image-cache gs://<gcp-project-id>-escapebench-public

This skips Packer + the k8s build Job, validates the qcow2 already in the cache, and writes the marker. Errors loudly if the qcow2 is missing.

WIF setup (one-time, per project — both the pod-image SA and the qcow2 builder SA) is documented in wif-setup.md.

Try a custom payload (no agent)¶

solver=validate skips the agent and submits a fixed payload directly. Useful for testing your own exploit or running shipped reference exploits as positive controls:

# Your own exploit
uv run inspect eval escapebench/escapebench \
    -T solver=validate -T payload_source=file \
    -T payload_file=./my-exploit.sh \
    -T family=container -T sandbox_type=docker_privileged \
    -T image_variant=container-default \
    --model mockllm/model

# All shipped reference exploits for a family
uv run inspect eval escapebench/escapebench \
    -T solver=validate -T payload_source=reference \
    -T family=container --model mockllm/model

The literal {NONCE} in reference / file payloads is substituted with the sample's per-run nonce.

Task parameters¶

Parameter	Values	Default	Description
`family`	`bare`, `container`, `v8`, `landlock`, `bubblewrap`, `nsjail`, `firecracker`, `qemu`, `chromium`, `wasm`, or `all` (comma-separated)	`all`	Sandbox family filter
`difficulty`	`0`–`5` (comma-separated)	all	Difficulty filter
`proof_type`	`read`, `write`, `crash`, `escape`	all enabled	Proof type filter (only `read` and `write` produce samples — `crash`/`escape` are disabled)
`sandbox_type`	one of the registered `SandboxType` enum values	all matching	Single sandbox configuration
`image_variant`	one of the registered `ImageVariant` enum values	—	Sample filter — keeps only sandbox types whose variant matches. Used during build validation
`image_cache`	URI prefix (`gs://`, `s3://`, `http(s)://`, local path)	from `escapebench.yaml`	Override `imageCache` for this run (sets `ESCAPEBENCH_IMAGE_CACHE_OVERRIDE`)
`agent`	`claude-code`, `codex`	`claude-code`	Agent harness (gemini is not currently supported)
`solver`	`agent` or `validate`	`agent`	`validate` skips the agent and submits a fixed payload
`payload_source`	`negative`, `reference`, `file`	`negative`	Payload source when `solver=validate`
`payload_file`	path	—	Required when `payload_source=file`

crash, escape, and the wasm family are disabled — checkers run (or are stubbed) but no sample targets them, so they always score zero. See CLAUDE.md § "Disabled proofs and families" for the re-enable process.