Run Benchmarks - BLACKBOX AI

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'

{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}

POST

api

benchmarks

runs

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'

{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}

This endpoint starts a benchmark run that evaluates an agent across a dataset of tasks. The run is fire-and-forget — it returns a benchmarkRunId immediately and executes in the background on the managed sandbox. Like tasks, a benchmark can run on the Claude, Codex, or Grok Build agent runtime, selected from model or forced with agent.

After launching, read the score and per-task traces with Benchmark Results, and follow progress with Benchmark Logs & Status.

Authentication

All requests require a BLACKBOX API key as a Bearer token (Pro plan required for POST). See Authentication.

Headers

Authorization

string

required

API Key of the form Bearer <api_key>.

Content-Type

string

required

Must be application/json.

Request Body

benchmark

string

required

The benchmark to run. One of: swe-bench, swe-bench-lite, swe-bench-multilingual, swe-gym, swe-bench-multimodal, hle, gaia, aime, gpqa, gpqa-main, omni-math, mmlu-pro, simpleqa. An unknown value returns 400 listing the supported names.

model

string

Model that drives the agent. The id also selects the agent runtime — Anthropic/Claude ids run the Claude Agent SDK; OpenAI/Codex ids (e.g. blackboxai/openai/gpt-5.3-codex) run the Codex SDK; xAI Grok Build ids run the Grok CLI. See Agent Runtimes.

For the Grok Build runtime, pass the model’s full router id — blackboxai/x-ai/grok-build-0.1 — not the bare x-ai/grok-build-0.1. The Grok CLI validates -m against the router’s model list and rejects an unprefixed id with "unknown model id".

agent

string

Explicit agent-runtime override — "claude", "codex", or "grok". Overrides the runtime inferred from model. Omit to auto-select (default: claude). An invalid value returns 400 listing the supported agents.

apiKey

string

Bring-your-own router — your OpenAI-compatible router key (bearer token). Must be paired with baseUrl. When supplied, the agent runtime is pointed at your endpoint instead of the platform router, and model is passed through verbatim (no allowlist check). The key is used in-memory for the run only and is never persisted. See Bring-your-own router.

baseUrl

string

Bring-your-own router — your router base URL (e.g. https://my-router.example.com). Must be an http(s) URL and paired with apiKey.

prompt

string

Optional extra instruction appended to every task’s auto-generated prompt — a global steer applied across the whole run (e.g. "Prefer minimal diffs; add a regression test"). It does not replace the dataset-generated task instruction; it’s added after it. Omit for the standard benchmark instruction.

nConcurrent

number

default:"16"

Maximum concurrent tasks within the run. Range 1–16 (defaults to 16). What this concurrency uses depends on env — see Concurrency & execution backends.

limit

number

default:"10"

Number of dataset instances to evaluate. Minimum 1; there is no fixed maximum — it’s clamped only to the benchmark’s own dataset size (totalInstances, e.g. 12,032 for MMLU-Pro, 60 for AIME). Larger runs take proportionally longer and cost more.

timeout

number

Per-task agent timeout in seconds. Defaults to the benchmark’s own default (e.g. 1800 for SWE-bench, 900 for AIME/MMLU-Pro/SimpleQA).

env

string

default:"sandbox-per-task"

Execution backend — how each task’s environment is provisioned. See Concurrency & execution backends.

"sandbox-per-task" (default) — each concurrent task runs in its own isolated sandbox (restored from the prepared snapshot). Strong isolation; concurrency is bounded by your account’s concurrent-sandbox quota.
"docker-in-parent" — all tasks run as concurrent Docker containers inside a single sandbox. No per-task sandboxes, so concurrency is bounded by that one VM’s CPU / RAM instead of the sandbox quota. Cheaper for small runs.

Response

benchmarkRunId

string

Unique id for the run. Use it to poll status, list tasks, or stream logs.

status

string

Initial status — "queued".

benchmark

string

Resolved canonical benchmark name.

model

string

The model driving the agent (or null for the default).

agent

string

The resolved runtime that will actually run — "claude", "codex", or "grok" — after applying the override or model-based inference. This is also stored on the run, so results report the true agent.

byo

object | null

Echo of bring-your-own router usage (or null). The apiKey is masked: { "baseUrl": "...", "apiKey": "***", "model": "..." }.

prompt

string | null

The extra instruction appended to each task (or null if none was provided).

nConcurrent

number

Resolved concurrency.

limit

number

Resolved instance count.

timeout

number

Resolved per-task timeout (seconds).

env

string

Resolved execution backend.

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'

{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}

Bring-your-own router

Supply apiKey + baseUrl (always as a pair) to run the benchmark against your own OpenAI-compatible endpoint and key instead of the platform router. When BYO creds are present:

the agent runtime (claude/codex/grok) is pointed at your baseUrl with your apiKey;
model is passed through verbatim — no allowlist check — so you can evaluate self-hosted or third-party models;
the key is used in-memory for the run only and is never stored;
the response echoes a masked byo block (apiKey shown as ***).

The same apiKey / baseUrl fields are also accepted by Create Task to drive the interactive agent against your endpoint.

Concurrency & execution backends

nConcurrent sets how many tasks run at once; env decides what that concurrency consumes.

	`sandbox-per-task` (default)	`docker-in-parent`
Concurrency unit	One isolated sandbox per task (restored from the prepared snapshot)	One Docker container per task, all inside a single sandbox
Bounded by	Your account’s concurrent-sandbox quota (+ model rate limits)	That one VM’s CPU / RAM / Docker (+ model rate limits)
Isolation	Strong — separate VMs	Shared VM
Sandbox count	`nConcurrent` sandboxes	1 sandbox total
Best for	Heavy/long tasks (e.g. SWE-bench), strong isolation	Small/cheap runs, or high concurrency without using sandbox quota

To run tasks concurrently inside a single sandbox (rather than one sandbox each), use env: "docker-in-parent":

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "aime",
    "model": "blackboxai/x-ai/grok-build-0.1",
    "agent": "grok",
    "limit": 90,
    "nConcurrent": 16,
    "env": "docker-in-parent"
  }'

nConcurrent is capped at 16. In sandbox-per-task the practical limit is your Vercel concurrent-sandbox quota; in docker-in-parent it’s the single VM’s resources (too many parallel containers + agent processes will saturate CPU/RAM). Raise the cap only alongside the matching infra headroom.

Listing runs & sub-resources

GET /api/v1/benchmarks/runs returns the authenticated user’s benchmark runs (optionally filtered by ?status=). Per-run sub-resources live under /api/v1/benchmarks/runs/{benchmarkRunId}/…:

Sub-resource	Purpose
`GET …/results`	Score + percentage + per-task traces + tracking log — see Benchmark Results
`GET …/status`	Lightweight progress poll
`GET …/tasks`	Per-instance task results
`GET …/logs`	Log snapshot (live or persisted)
`GET …/logs/stream`	SSE log stream (live, or DB replay after the run ends)
`POST …/cancel`	Cancel an in-flight run

See Benchmark Logs & Status for the status/logs/stream details.

Error Codes

Status	Description
200	Run queued
400	Missing/invalid `benchmark`, invalid `agent`, or bad JSON
401	Invalid or missing API key
403	Pro subscription required
429	Too many concurrent benchmark runs
500	Failed to launch the run

Benchmark Results

Score, percentage, per-task traces, and the tracking log.

Benchmark Logs & Status

Poll status and stream logs (live, then from the DB).

Agent Runtimes

How model / agent choose Claude vs Codex vs Grok.

Models

Model ids and their runtime mapping.

List Branches Benchmark Results

​Authentication

​Headers

​Request Body

​Response

​Bring-your-own router

​Concurrency & execution backends

​Listing runs & sub-resources

​Error Codes

Benchmark Results

Benchmark Logs & Status

Agent Runtimes

Models

Authentication

Headers

Request Body

Response

Bring-your-own router

Concurrency & execution backends

Listing runs & sub-resources

Error Codes