Skip to main content
POST
/
api
/
v1
/
benchmarks
/
runs
curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'
{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}
This endpoint starts a benchmark run that evaluates an agent across a dataset of tasks. The run is fire-and-forget — it returns a benchmarkRunId immediately and executes in the background on the managed sandbox. Like tasks, a benchmark can run on the Claude, Codex, or Grok Build agent runtime, selected from model or forced with agent.
After launching, read the score and per-task traces with Benchmark Results, and follow progress with Benchmark Logs & Status.

Authentication

All requests require a BLACKBOX API key as a Bearer token (Pro plan required for POST). See Authentication.

Headers

Authorization
string
required
API Key of the form Bearer <api_key>.
Content-Type
string
required
Must be application/json.

Request Body

benchmark
string
required
The benchmark to run. One of: swe-bench, swe-bench-lite, swe-bench-multilingual, swe-gym, swe-bench-multimodal, hle, gaia, aime, gpqa, gpqa-main, omni-math, mmlu-pro, simpleqa. An unknown value returns 400 listing the supported names.
model
string
Model that drives the agent. The id also selects the agent runtime — Anthropic/Claude ids run the Claude Agent SDK; OpenAI/Codex ids (e.g. blackboxai/openai/gpt-5.3-codex) run the Codex SDK; xAI Grok Build ids run the Grok CLI. See Agent Runtimes.
For the Grok Build runtime, pass the model’s full router idblackboxai/x-ai/grok-build-0.1 — not the bare x-ai/grok-build-0.1. The Grok CLI validates -m against the router’s model list and rejects an unprefixed id with "unknown model id".
agent
string
Explicit agent-runtime override — "claude", "codex", or "grok". Overrides the runtime inferred from model. Omit to auto-select (default: claude). An invalid value returns 400 listing the supported agents.
apiKey
string
Bring-your-own router — your OpenAI-compatible router key (bearer token). Must be paired with baseUrl. When supplied, the agent runtime is pointed at your endpoint instead of the platform router, and model is passed through verbatim (no allowlist check). The key is used in-memory for the run only and is never persisted. See Bring-your-own router.
baseUrl
string
Bring-your-own router — your router base URL (e.g. https://my-router.example.com). Must be an http(s) URL and paired with apiKey.
prompt
string
Optional extra instruction appended to every task’s auto-generated prompt — a global steer applied across the whole run (e.g. "Prefer minimal diffs; add a regression test"). It does not replace the dataset-generated task instruction; it’s added after it. Omit for the standard benchmark instruction.
nConcurrent
number
default:"16"
Maximum concurrent tasks within the run. Range 116 (defaults to 16). What this concurrency uses depends on env — see Concurrency & execution backends.
limit
number
default:"10"
Number of dataset instances to evaluate. Minimum 1; there is no fixed maximum — it’s clamped only to the benchmark’s own dataset size (totalInstances, e.g. 12,032 for MMLU-Pro, 60 for AIME). Larger runs take proportionally longer and cost more.
timeout
number
Per-task agent timeout in seconds. Defaults to the benchmark’s own default (e.g. 1800 for SWE-bench, 900 for AIME/MMLU-Pro/SimpleQA).
env
string
default:"sandbox-per-task"
Execution backend — how each task’s environment is provisioned. See Concurrency & execution backends.
  • "sandbox-per-task" (default) — each concurrent task runs in its own isolated sandbox (restored from the prepared snapshot). Strong isolation; concurrency is bounded by your account’s concurrent-sandbox quota.
  • "docker-in-parent" — all tasks run as concurrent Docker containers inside a single sandbox. No per-task sandboxes, so concurrency is bounded by that one VM’s CPU / RAM instead of the sandbox quota. Cheaper for small runs.

Response

benchmarkRunId
string
Unique id for the run. Use it to poll status, list tasks, or stream logs.
status
string
Initial status — "queued".
benchmark
string
Resolved canonical benchmark name.
model
string
The model driving the agent (or null for the default).
agent
string
The resolved runtime that will actually run — "claude", "codex", or "grok" — after applying the override or model-based inference. This is also stored on the run, so results report the true agent.
byo
object | null
Echo of bring-your-own router usage (or null). The apiKey is masked: { "baseUrl": "...", "apiKey": "***", "model": "..." }.
prompt
string | null
The extra instruction appended to each task (or null if none was provided).
nConcurrent
number
Resolved concurrency.
limit
number
Resolved instance count.
timeout
number
Resolved per-task timeout (seconds).
env
string
Resolved execution backend.
curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'
{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}

Bring-your-own router

Supply apiKey + baseUrl (always as a pair) to run the benchmark against your own OpenAI-compatible endpoint and key instead of the platform router. When BYO creds are present:
  • the agent runtime (claude/codex/grok) is pointed at your baseUrl with your apiKey;
  • model is passed through verbatim — no allowlist check — so you can evaluate self-hosted or third-party models;
  • the key is used in-memory for the run only and is never stored;
  • the response echoes a masked byo block (apiKey shown as ***).
The same apiKey / baseUrl fields are also accepted by Create Task to drive the interactive agent against your endpoint.

Concurrency & execution backends

nConcurrent sets how many tasks run at once; env decides what that concurrency consumes.
sandbox-per-task (default)docker-in-parent
Concurrency unitOne isolated sandbox per task (restored from the prepared snapshot)One Docker container per task, all inside a single sandbox
Bounded byYour account’s concurrent-sandbox quota (+ model rate limits)That one VM’s CPU / RAM / Docker (+ model rate limits)
IsolationStrong — separate VMsShared VM
Sandbox countnConcurrent sandboxes1 sandbox total
Best forHeavy/long tasks (e.g. SWE-bench), strong isolationSmall/cheap runs, or high concurrency without using sandbox quota
To run tasks concurrently inside a single sandbox (rather than one sandbox each), use env: "docker-in-parent":
curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "aime",
    "model": "blackboxai/x-ai/grok-build-0.1",
    "agent": "grok",
    "limit": 90,
    "nConcurrent": 16,
    "env": "docker-in-parent"
  }'
nConcurrent is capped at 16. In sandbox-per-task the practical limit is your Vercel concurrent-sandbox quota; in docker-in-parent it’s the single VM’s resources (too many parallel containers + agent processes will saturate CPU/RAM). Raise the cap only alongside the matching infra headroom.

Listing runs & sub-resources

GET /api/v1/benchmarks/runs returns the authenticated user’s benchmark runs (optionally filtered by ?status=). Per-run sub-resources live under /api/v1/benchmarks/runs/{benchmarkRunId}/…:
Sub-resourcePurpose
GET …/resultsScore + percentage + per-task traces + tracking log — see Benchmark Results
GET …/statusLightweight progress poll
GET …/tasksPer-instance task results
GET …/logsLog snapshot (live or persisted)
GET …/logs/streamSSE log stream (live, or DB replay after the run ends)
POST …/cancelCancel an in-flight run
See Benchmark Logs & Status for the status/logs/stream details.

Error Codes

StatusDescription
200Run queued
400Missing/invalid benchmark, invalid agent, or bad JSON
401Invalid or missing API key
403Pro subscription required
429Too many concurrent benchmark runs
500Failed to launch the run

Benchmark Results

Score, percentage, per-task traces, and the tracking log.

Benchmark Logs & Status

Poll status and stream logs (live, then from the DB).

Agent Runtimes

How model / agent choose Claude vs Codex vs Grok.

Models

Model ids and their runtime mapping.