Benchmarks
Run Benchmarks
Launch a benchmark evaluation (SWE-bench, GAIA, AIME, GPQA, …) driven by the agent of your choice. Fire-and-forget: returns a benchmarkRunId immediately.
POST
This endpoint starts a benchmark run that evaluates an agent across a dataset of tasks. The run is fire-and-forget — it returns a
To run tasks concurrently inside a single sandbox (rather than one sandbox each), use
See Benchmark Logs & Status for the status/logs/stream details.
benchmarkRunId immediately and executes in the background on the managed sandbox. Like tasks, a benchmark can run on the Claude, Codex, or Grok Build agent runtime, selected from model or forced with agent.
Authentication
All requests require a BLACKBOX API key as a Bearer token (Pro plan required forPOST). See Authentication.
Headers
API Key of the form
Bearer <api_key>.Must be
application/json.Request Body
The benchmark to run. One of:
swe-bench, swe-bench-lite, swe-bench-multilingual, swe-gym, swe-bench-multimodal, hle, gaia, aime, gpqa, gpqa-main, omni-math, mmlu-pro, simpleqa. An unknown value returns 400 listing the supported names.Model that drives the agent. The id also selects the agent runtime — Anthropic/Claude ids run the Claude Agent SDK; OpenAI/Codex ids (e.g.
blackboxai/openai/gpt-5.3-codex) run the Codex SDK; xAI Grok Build ids run the Grok CLI. See Agent Runtimes.Explicit agent-runtime override —
"claude", "codex", or "grok". Overrides the runtime inferred from model. Omit to auto-select (default: claude). An invalid value returns 400 listing the supported agents.Bring-your-own router — your OpenAI-compatible router key (bearer token). Must be paired with
baseUrl. When supplied, the agent runtime is pointed at your endpoint instead of the platform router, and model is passed through verbatim (no allowlist check). The key is used in-memory for the run only and is never persisted. See Bring-your-own router.Bring-your-own router — your router base URL (e.g.
https://my-router.example.com). Must be an http(s) URL and paired with apiKey.Optional extra instruction appended to every task’s auto-generated prompt — a global steer applied across the whole run (e.g.
"Prefer minimal diffs; add a regression test"). It does not replace the dataset-generated task instruction; it’s added after it. Omit for the standard benchmark instruction.Maximum concurrent tasks within the run. Range
1–16 (defaults to 16). What this concurrency uses depends on env — see Concurrency & execution backends.Number of dataset instances to evaluate. Minimum
1; there is no fixed maximum — it’s clamped only to the benchmark’s own dataset size (totalInstances, e.g. 12,032 for MMLU-Pro, 60 for AIME). Larger runs take proportionally longer and cost more.Per-task agent timeout in seconds. Defaults to the benchmark’s own default (e.g.
1800 for SWE-bench, 900 for AIME/MMLU-Pro/SimpleQA).Execution backend — how each task’s environment is provisioned. See Concurrency & execution backends.
"sandbox-per-task"(default) — each concurrent task runs in its own isolated sandbox (restored from the prepared snapshot). Strong isolation; concurrency is bounded by your account’s concurrent-sandbox quota."docker-in-parent"— all tasks run as concurrent Docker containers inside a single sandbox. No per-task sandboxes, so concurrency is bounded by that one VM’s CPU / RAM instead of the sandbox quota. Cheaper for small runs.
Response
Unique id for the run. Use it to poll status, list tasks, or stream logs.
Initial status —
"queued".Resolved canonical benchmark name.
The model driving the agent (or
null for the default).The resolved runtime that will actually run —
"claude", "codex", or "grok" — after applying the override or model-based inference. This is also stored on the run, so results report the true agent.Echo of bring-your-own router usage (or
null). The apiKey is masked: { "baseUrl": "...", "apiKey": "***", "model": "..." }.The extra instruction appended to each task (or
null if none was provided).Resolved concurrency.
Resolved instance count.
Resolved per-task timeout (seconds).
Resolved execution backend.
Bring-your-own router
SupplyapiKey + baseUrl (always as a pair) to run the benchmark against your own OpenAI-compatible endpoint and key instead of the platform router. When BYO creds are present:
- the agent runtime (claude/codex/grok) is pointed at your
baseUrlwith yourapiKey; modelis passed through verbatim — no allowlist check — so you can evaluate self-hosted or third-party models;- the key is used in-memory for the run only and is never stored;
- the response echoes a masked
byoblock (apiKeyshown as***).
apiKey / baseUrl fields are also accepted by Create Task to drive the interactive agent against your endpoint.
Concurrency & execution backends
nConcurrent sets how many tasks run at once; env decides what that concurrency consumes.
sandbox-per-task (default) | docker-in-parent | |
|---|---|---|
| Concurrency unit | One isolated sandbox per task (restored from the prepared snapshot) | One Docker container per task, all inside a single sandbox |
| Bounded by | Your account’s concurrent-sandbox quota (+ model rate limits) | That one VM’s CPU / RAM / Docker (+ model rate limits) |
| Isolation | Strong — separate VMs | Shared VM |
| Sandbox count | nConcurrent sandboxes | 1 sandbox total |
| Best for | Heavy/long tasks (e.g. SWE-bench), strong isolation | Small/cheap runs, or high concurrency without using sandbox quota |
env: "docker-in-parent":
nConcurrent is capped at 16. In sandbox-per-task the practical limit is your Vercel concurrent-sandbox quota; in docker-in-parent it’s the single VM’s resources (too many parallel containers + agent processes will saturate CPU/RAM). Raise the cap only alongside the matching infra headroom.Listing runs & sub-resources
GET /api/v1/benchmarks/runs returns the authenticated user’s benchmark runs (optionally filtered by ?status=). Per-run sub-resources live under /api/v1/benchmarks/runs/{benchmarkRunId}/…:
| Sub-resource | Purpose |
|---|---|
GET …/results | Score + percentage + per-task traces + tracking log — see Benchmark Results |
GET …/status | Lightweight progress poll |
GET …/tasks | Per-instance task results |
GET …/logs | Log snapshot (live or persisted) |
GET …/logs/stream | SSE log stream (live, or DB replay after the run ends) |
POST …/cancel | Cancel an in-flight run |
Error Codes
| Status | Description |
|---|---|
| 200 | Run queued |
| 400 | Missing/invalid benchmark, invalid agent, or bad JSON |
| 401 | Invalid or missing API key |
| 403 | Pro subscription required |
| 429 | Too many concurrent benchmark runs |
| 500 | Failed to launch the run |
Benchmark Results
Score, percentage, per-task traces, and the tracking log.
Benchmark Logs & Status
Poll status and stream logs (live, then from the DB).
Agent Runtimes
How
model / agent choose Claude vs Codex vs Grok.Models
Model ids and their runtime mapping.