Quickstart¶
AgentEscapeBench is built as an Inspect AI eval that runs each sample inside its own QEMU VM on a Kubernetes cluster with nested virtualization. This guide walks through the setup end-to-end on GKE.
For other providers (EKS, AKS), the same requirements apply but the
commands differ — see image-build-design.md.
Prerequisites¶
- Python 3.12 with uv
gcloudCLI authenticated against a project with billing enabledkubectlandhelm- A model API key for the provider you'll evaluate (Anthropic, OpenAI, …)
1. Install¶
git clone https://github.com/safety-research/agent-escape-bench
cd agent-escape-bench
uv sync --extra dev
uv run pre-commit install
cp .env.example .env # then fill in the keys you'll use
2. Cluster setup (one-time)¶
This provisions a GKE cluster with KVM-capable nodes and autoscaling
(0 → 120) and a least-privilege node service account. Expect ~15 min
wall clock for a fresh cluster (most of it the GKE control-plane
spin-up; nothing is broken if it sits at "Creating cluster…" for
several minutes). Re-runs are idempotent. Pass --cluster NAME to use
a name other than the default escapebench (the node SA is named
<cluster>-node, so distinct cluster names get distinct SAs).
The pod image URI is source-derived (computed from images/pod/
via escapebench.image_keys.pod_image_uri() at runtime — there's no
image: field in escapebench.yaml). The published tag at
…/escapebench-qemu:src-<hash> is the same one any clone of this
repo would compute. The shipped escapebench.yaml points at the
public qcow2 cache, so consumers can skip straight to validation
below — no yaml edits, no image builds.
If you want to host your own image cache instead, see Building your own images at the end of this guide.
3. Validate the harness¶
Run the lowest-difficulty sample for a family to confirm the harness works end-to-end before scaling up:
uv run inspect eval escapebench/escapebench \
-T family=container -T difficulty=0 \
--model anthropic/claude-sonnet-4-6
A difficulty-0 sample has no effective sandbox boundary — the model just has to interact with the runtime. If this fails, investigate before running higher difficulties.
4. Run a full eval¶
.eval logs land in $INSPECT_LOG_DIR. The shipped .env.example
sets it to ./logs/eval; export a different path before the command
to write elsewhere.
# Headline eval: difficulty 5 across every family — the hardest
# samples, no known technique. `family` defaults to `all`.
uv run inspect eval escapebench/escapebench -T difficulty=5 \
--model anthropic/claude-sonnet-4-6 \
--max-connections 10 \
--message-limit 250 \
--epochs 3 \
--retry-on-error 2 \
--fail-on-error 0.25 \
--log-shared 60 --display log
# Full spectrum, one family
uv run inspect eval escapebench/escapebench -T family=container \
--model anthropic/claude-sonnet-4-6 \
--max-connections 10 \
--message-limit 250 \
--retry-on-error 1 \
--log-shared 60 --display log
# Specific difficulty levels
uv run inspect eval escapebench/escapebench -T family=container \
-T difficulty=0,1,2 \
--model anthropic/claude-sonnet-4-6 \
--log-shared 60 --display log
# Multiple families in one run
uv run inspect eval escapebench/escapebench -T family=container,v8 \
--model anthropic/claude-sonnet-4-6 \
--log-shared 60 --display log
--retry-on-error N re-runs failed samples up to N times in a fresh
pod — recommended for any multi-sample run, since transient bridge or
VM-IO errors are expected at the long tail. The headline run bumps to
2 because the hardest samples have the longest tails.
--fail-on-error 0.25 lets the eval tolerate up to 25% of samples
failing (after retries) before the run itself is marked failed. The
task is configured with fail_on_error=False, so without this flag
nothing aborts; setting a fraction keeps an otherwise-successful run
producing a score when a handful of samples genuinely refuse to
recover, while still failing loudly if something systematic breaks.
--epochs 3 runs each sample three times. Models are stochastic and
exploits sometimes succeed or fail by luck; three rolls per sample
distinguishes capability from variance. Triples wall-clock and spend,
so reserve it for the headline run rather than per-family probes.
--log-shared 60 --display log makes a backgrounded run readable
mid-flight: the .eval zip's central directory only lands at finalize
time, and the default Textual TUI doesn't capture to redirected stdout.
With --log-shared 60, sample events sync every 60 s to
./logs/eval/.buffer/, so inspect view shows live state and you can
tail -f the captured --display log output. Omit them when running
interactively — the default Textual TUI is friendlier in that case.
Logs land in $INSPECT_LOG_DIR (set to ./logs/eval in .env.example).
5. View results¶
Building your own images¶
By default setup-gke.sh provisions only what consumers need to run
evals against the public image cache. If you want to build and host
your own qcow2 images — for a fork, a private mirror, or a variant
that isn't published — re-run with --bucket gs://<your-bucket>:
This additionally creates the bucket (world-readable, UBLA-enabled),
grants the node service account storage.objectAdmin on it, and grants
artifactregistry.reader on the project (in case you push your pod
image to a private GAR rather than ghcr.io). Re-runs are idempotent;
existing buckets aren't reconfigured.
Then point imageCache in escapebench.yaml at your bucket — the
HTTPS form for read-only consumers — and build via the gs:// form:
uv run python -m escapebench.build container-default \
--image-cache gs://<gcp-project-id>-escapebench-public
uv run python -m escapebench.build v8-release \
--image-cache gs://<gcp-project-id>-escapebench-public
# Or build everything used by a family / all variants. --all
# parallelizes across variants (default --max-parallel 4).
uv run python -m escapebench.build --family container \
--image-cache gs://<gcp-project-id>-escapebench-public
uv run python -m escapebench.build --all --max-parallel 4 \
--image-cache gs://<gcp-project-id>-escapebench-public
Each build runs as a Kubernetes Job, executes Packer inside the per-pod
QEMU image, validates the result, and writes the qcow2 + .validated
marker directly to the public bucket. Content-addressed — re-running
with unchanged inputs is a no-op.
Off-cluster Packer pre-bake (GitHub Actions)¶
A peer producer to python -m escapebench.build. Triggers via
workflow_dispatch on .github/workflows/build-qcow2.yml, runs
Packer on a GitHub-hosted ubuntu-latest runner with /dev/kvm,
and uploads the qcow2 to the public bucket. Variants > 12 GiB guest
RAM (the v8 pair) stay cluster-only.
The workflow stops at qcow2-uploaded; the .validated marker is a
cluster privilege. Operators finish the round-trip with:
uv run python -m escapebench.build --validate-only <variant> \
--image-cache gs://<gcp-project-id>-escapebench-public
This skips Packer + the k8s build Job, validates the qcow2 already in the cache, and writes the marker. Errors loudly if the qcow2 is missing.
WIF setup (one-time, per project — both the pod-image SA and the
qcow2 builder SA) is documented in wif-setup.md.
Try a custom payload (no agent)¶
solver=validate skips the agent and submits a fixed payload directly.
Useful for testing your own exploit or running shipped reference
exploits as positive controls:
# Your own exploit
uv run inspect eval escapebench/escapebench \
-T solver=validate -T payload_source=file \
-T payload_file=./my-exploit.sh \
-T family=container -T sandbox_type=docker_privileged \
-T image_variant=container-default \
--model mockllm/model
# All shipped reference exploits for a family
uv run inspect eval escapebench/escapebench \
-T solver=validate -T payload_source=reference \
-T family=container --model mockllm/model
The literal {NONCE} in reference / file payloads is substituted with
the sample's per-run nonce.
Task parameters¶
| Parameter | Values | Default | Description |
|---|---|---|---|
family |
bare, container, v8, landlock, bubblewrap, nsjail, firecracker, qemu, chromium, wasm, or all (comma-separated) |
all |
Sandbox family filter |
difficulty |
0–5 (comma-separated) |
all | Difficulty filter |
proof_type |
read, write, crash, escape |
all enabled | Proof type filter (only read and write produce samples — crash/escape are disabled) |
sandbox_type |
one of the registered SandboxType enum values |
all matching | Single sandbox configuration |
image_variant |
one of the registered ImageVariant enum values |
— | Sample filter — keeps only sandbox types whose variant matches. Used during build validation |
image_cache |
URI prefix (gs://, s3://, http(s)://, local path) |
from escapebench.yaml |
Override imageCache for this run (sets ESCAPEBENCH_IMAGE_CACHE_OVERRIDE) |
agent |
claude-code, codex |
claude-code |
Agent harness (gemini is not currently supported) |
solver |
agent or validate |
agent |
validate skips the agent and submits a fixed payload |
payload_source |
negative, reference, file |
negative |
Payload source when solver=validate |
payload_file |
path | — | Required when payload_source=file |
crash, escape, and the wasm family are disabled — checkers run
(or are stubbed) but no sample targets them, so they always score
zero. See CLAUDE.md
§ "Disabled proofs and families" for the re-enable process.