Bench Command
Run performance benchmarks for a Homeboy component and surface regression deltas against a stored baseline.
Synopsis
homeboy bench <component> [options] [-- <runner-args>]
homeboy bench list <component> [options] [-- <runner-args>]
homeboy bench history <component> [--scenario <id>] [--rig <id>] [--limit 20]
homeboy bench distribution <component> --field <metadata.path> [--scenario <id>] [--rig <id>] [--status <status>] [--limit 20]
homeboy bench compare --from-run <run-id> --to-run <run-id>Description
The bench command invokes the extension’s bench runner, which measures
one or more scenarios over N iterations and emits a structured JSON
results file. Homeboy parses the results, compares declared numeric
metrics against a saved baseline, and returns a structured report plus
an exit code suitable for CI gates.
bench is a sibling of test, lint, and build under homeboy’s
extension capability model. The runner contract, manifest shape, and
baseline primitive (homeboy.json → baselines.bench) are shared with
the other capabilities.
Arguments
<component>: Component to benchmark. Auto-detected from the current working directory if omitted. The component must have a linked extension that declares abenchcapability.
Options
--iterations <N>: Iterations per scenario (default10). Forwarded to the runner via$HOMEBOY_BENCH_ITERATIONS. Extensions may clamp.--baseline: Save the current run as the new baseline underhomeboy.json→baselines.bench.--ignore-baseline: Run without comparing to any saved baseline.--ratchet: When scenarios improve, auto-update the saved baseline so the improvement "sticks". Ignored when the run regresses.--regression-threshold <PERCENT>: Legacy p95 regression tolerance (default5.0) used when the runner does not declaremetric_policies. A p95 scenario regresses when its currentp95_msexceedsbaseline.p95_ms * (1 + threshold/100).--shared-state <DIR>: Directory shared across iterations and concurrent runner instances. Forwarded to workloads via$HOMEBOY_BENCH_SHARED_STATE.--concurrency <N>: Number of parallel bench runner instances to spawn (default1). Values greater than1require--shared-state.--setting <key=value>: Override component settings (may be repeated).--setting-json <key=json>: Override component settings with typed JSON values for arrays, objects, numbers, booleans, or null.--path <PATH>: Override the component’slocal_pathfor this run.--json-summary: Include a compact machine-readable summary in the JSON output envelope (for CI wrappers).--report side-by-side: Select the combined side-by-side comparison report for a multi-rig bench envelope. The report is emitted underreports.side_by_sideand includes each rig’s status, elapsed time, key metrics, artifact paths/URLs, and failure reason.--rig <RIG_ID[,RIG_ID...]>: Pin the run to one or more rigs. Single rig pins the rig and stores its baseline under a rig-scoped key. If that rig declaresbench.components, the command fans out across those components under one rig-state snapshot. Multiple rigs (comma-separated) run the same component + workload + iteration count against each rig in sequence and emit a cross-rig comparison envelope. See "Cross-rig comparison" below.--rig-concurrency <N>: For multi-rig comparisons, run up toNrigs concurrently. Default1preserves sequential CI behavior. Values greater than1are opt-in and preserve output ordering by the selected rig order.--scenario <SCENARIO_ID>: Run or list only the exact scenario id. May be repeated. Homeboy validates selected ids against discovery before execution and forwards the comma-separated selector to runners via$HOMEBOY_BENCH_SCENARIOS.--ci-profile <ID>: Run using env and passthrough args from a single extension-declared CI profile whose only job declarescommand: "bench". This keeps bench parity on the normal Homeboy bench workflow instead of parsing arbitrary provider YAML into runnable commands.--ignore-default-baseline: Skip automatic single-rig expansion when the rig declaresbench.default_baseline_rig.
Arguments after -- are passed verbatim to the extension’s bench runner
script.
When --ci-profile <ID> is used, args declared on that profile’s bench job
are forwarded before explicit CLI passthrough arguments, and job env is passed
to the bench runner. Profiles with zero jobs, multiple jobs, or a non-bench
job are rejected for command-native bench reproduction; use homeboy ci run --profile <ID> when you need to execute a multi-job CI profile directly.
homeboy bench is resource-policy aware. If the current machine is already warm
or hot according to homeboy doctor resources, Homeboy prints a stderr warning
before running because the extra load can skew benchmark results. Use global
--force-hot when running under load is intentional:
homeboy --force-hot bench my-componentObservation History
Every homeboy bench run is persisted to the local observation store. The
normal bench output includes hints with the persisted run ID, the observation DB
path, and follow-up commands such as:
homeboy runs show <run-id>
homeboy runs list --kind bench --component <component> --rig <rig>Omit --rig <rig> from the list command for unpinned bench runs.
homeboy bench history <component> lists persisted benchmark runs from the local observation store. --scenario filters to runs whose stored metadata includes the scenario, and --rig narrows to rig-pinned runs.
homeboy bench distribution <component> --field <metadata.path> summarizes repeated categorical values from persisted benchmark run metadata. It is generic over metadata shape: scalar string, number, and boolean values are counted directly, and arrays are flattened. Use --scenario, --rig, --status, and --limit to narrow the persisted run window before aggregation.
homeboy bench compare --from-run <run-id> --to-run <run-id> compares numeric metrics recorded in two persisted benchmark runs. It matches scenario + metric rows, reports absolute deltas and percent changes, and lists metrics that exist in only one run. The command is read-only and exits successfully for a valid comparison even when the numbers regress.
Scenario Discovery
homeboy bench list <component> asks the extension runner for its scenario
inventory without executing any workload code. The runner receives
$HOMEBOY_BENCH_LIST_ONLY=1 and writes the normal BenchResults envelope
with iterations: 0, empty per-scenario metrics, and optional discovery
metadata such as file, source, default_iterations, and tags.
This is the safe first step for agent-driven or CI-driven perf work: inspect what can be measured before deciding which full bench run is worth paying for.
Semantic Gates
Bench runners can attach scenario-level correctness gates to non-timing metrics. Gates are evaluated after metrics are parsed and aggregated. Any failed gate marks the scenario and run failed, even when timing metrics improve, and the failure details are emitted before baseline comparison data in the JSON output.
Supported operators are eq, gte, and lte:
{
"id": "studio-agent-loop",
"iterations": 10,
"metrics": {
"p95_ms": 1200.0,
"assistant_message_count": 1,
"identifies_studio_rate": 1.0
},
"gates": [
{ "metric": "assistant_message_count", "op": "gte", "value": 1 },
{ "metric": "identifies_studio_rate", "op": "eq", "value": 1.0 }
]
}Failed gates add gate_results, set the scenario’s passed field to
false, and add top-level gate_failures plus budget_findings to the bench
output.
Budget Findings
Benchmark and profile workloads can emit top-level budget_findings for fixed
threshold failures that should report and gate consistently across extensions.
The common shape is:
{
"category": "budget",
"code": "rest.max_response_bytes",
"severity": "error",
"file": null,
"context_label": "profile:wordpress-rest",
"message": "REST response exceeded 250 KB budget",
"actual": 4378195,
"expected": 250000,
"unit": "bytes",
"subject": "/wp-json/datamachine/v1/pipelines?per_page=100",
"passed": false
}severity: "error" or passed: false fails the bench run. Lower severities
are report-only. code is the stable grouping key, subject identifies the
endpoint/resource/phase being measured, and actual / expected / unit are
rendered by homeboy report failure-digest.
Rigs can also declare gates for scenario metrics when the workload output is
owned elsewhere. The keys under metric_gates are exact scenario ids:
{
"bench": {
"default_component": "studio",
"metric_gates": {
"wordpress-is-dead": {
"native_block_quality_pass": { "equals": 1 },
"tool_error_count": { "equals": 0 },
"success_rate": { "equals": 1 }
}
}
}
}Rig-declared metric gates are merged with workload-emitted gates before
Homeboy evaluates scenario status. Supported rig operators are equals, gte,
and lte.
Invocation Isolation
Every bench child process receives generic invocation-scoped environment variables, independent of runner implementation:
HOMEBOY_INVOCATION_ID: stable identifier for this child workload invocation.HOMEBOY_INVOCATION_STATE_DIR: private state directory for files that should outlive one internal iteration but remain scoped to this invocation.HOMEBOY_INVOCATION_ARTIFACT_DIR: private artifact directory for logs, screenshots, traces, and downloaded/build artifacts.HOMEBOY_INVOCATION_TMP_DIR: private temporary directory for project copies, browser profiles, wasm caches, and other scratch state.
Runtime path contract
Invocation directories are placed under a short, platform-aware root so any
workload can put a UNIX domain socket or other path-length-sensitive primitive
under HOMEBOY_INVOCATION_STATE_DIR without bespoke defense.
Layout. The invocation owns three sibling directories under the runtime root:
<root>/<short-id>→HOMEBOY_INVOCATION_STATE_DIR(the leaf the workload owns; downstream sockets bind directly here).<root>/<short-id>.a→HOMEBOY_INVOCATION_ARTIFACT_DIR<root>/<short-id>.t→HOMEBOY_INVOCATION_TMP_DIR
There is no s/a/t subdir layer underneath the short id. The invocation is
1:1 with a single workload run, so an extra workload-id segment under
STATE_DIR would burn sockaddr_un budget for no isolation gain. Workloads
that need internal subdirs under STATE_DIR can still create them, but they
own the path-length budget at that point.
Root selection. In priority order:
HOMEBOY_INVOCATION_RUNTIME_DIRenv override (tests and unusual host configurations)./tmp/hbon every Unix host when/tmpis a writable directory. macOS apps that respect$TMPDIRget per-user isolation under/var/folders/<14>/T/...(~50 bytes), which leaves no realisticsockaddr_unbudget. Anchoring to/tmpsaves ~35 bytes of headroom and is writable on every standard macOS / Linux configuration.- Linux fallback:
$XDG_RUNTIME_DIR/hbwhen set and/tmpis unusable. - macOS fallback:
$TMPDIR/hbwhen/tmpis unusable. - Generic fallback:
~/.cache/homeboy/inv(or$XDG_CACHE_HOME/homeboy/inv).
Path components use a short opaque id (~10 hex chars) instead of a full UUID v4. The full UUID is retained inside the on-disk invocation lease for traceability, but is never embedded in path components.
Path budget contract. Homeboy guarantees that
HOMEBOY_INVOCATION_STATE_DIR, HOMEBOY_INVOCATION_ARTIFACT_DIR, and
HOMEBOY_INVOCATION_TMP_DIR leave at least:
- 48 bytes of headroom under the macOS
sockaddr_unsun_pathcapacity (104 bytes), and - 32 bytes of headroom under the Linux capacity (108 bytes).
Workloads can append a realistic workload-relative socket name like
<workload-id>/daemon/daemon.sock (≈40 bytes) directly under STATE_DIR and
bind to it without EINVAL. If $HOME or the configured root are
unusually long, Homeboy fails fast at invocation acquisition with a clear
error naming sockaddr_un, the platform-specific limit, the available
headroom, and the HOMEBOY_INVOCATION_RUNTIME_DIR override — instead of
letting a downstream workload hit the limit at bind time.
Rig workload object entries can request optional shared-machine primitives:
{
"bench_workloads": {
"nodejs": [
{
"path": "${package.root}/bench/playground-server.bench.mjs",
"port_range_size": 8,
"named_leases": ["playground-browser-profile"]
}
]
}
}When port_range_size is set, Homeboy allocates a non-overlapping local port
range for each child invocation and exports HOMEBOY_INVOCATION_PORT_BASE and
HOMEBOY_INVOCATION_PORT_MAX. Leases are persisted under Homeboy’s local config
directory while the child is running, guarded by a local index lock, and stale
PID leases are pruned on the next allocation.
named_leases are for truly shared machine-local resources that cannot be
namespaced with dirs or ports. Conflicts fail before the workload starts and
name the held lease and holder invocation when available.
These primitives are intentionally generic enough for WordPress Playground-style benchmarks: one invocation can run browser/server/wasm/node services, consume multiple local ports, keep downloaded or built artifacts in the artifact dir, and use private temp project dirs without Studio-specific namespace code.
Examples
# Benchmark a component with defaults (10 iterations, 5% regression threshold)
homeboy bench my-component
# Inspect the persisted observation from a completed bench run
homeboy runs show <run-id>
# List related persisted bench runs for comparison loops
homeboy runs list --kind bench --component my-component --rig studio-trunk
# List declared scenarios without executing them
homeboy bench list my-component
# 50 iterations, stricter 2% regression threshold
homeboy bench my-component --iterations 50 --regression-threshold 2.0
# Save a new baseline
homeboy bench my-component --baseline
# Run with auto-ratchet on improvement
homeboy bench my-component --ratchet
# Select a single scenario
homeboy bench my-component --scenario hot_path
# Select two scenarios in a cross-rig comparison
homeboy bench studio --rig studio-trunk,studio-branch
--scenario studio-agent-runtime
--scenario wp-admin-load
# Share warm state across invocations and run four instances in parallel
homeboy bench my-component --shared-state /tmp/homeboy-bench --concurrency 4
# Pin to a single rig — preflight + rig-scoped baseline
homeboy bench studio --rig studio-trunk
# Pin to one rig and run every component declared in bench.components
homeboy bench --rig mdi-substrates --shared-state /tmp/mdi-bench
# Cross-rig comparison: same workload, two rigs, side-by-side report.
# First rig (`studio-trunk`) is the reference; the diff table expresses
# every other rig's metrics as percent deltas vs the reference.
homeboy bench studio --rig studio-trunk,studio-combined-fixes
--iterations 10
--report side-by-side
# Opt into parallel cross-rig execution for side-by-side exploratory runs.
homeboy bench studio --rig studio-agent-sdk,studio-bfb
--scenario studio-agent-site-build
--rig-concurrency 2
# Three-rig comparison to isolate one PR's contribution.
homeboy bench studio
--rig trunk,combined-fixes,combined-fixes-without-3120
--iterations 20Cross-rig comparison
--rig <a>,<b>[,<c>...] runs the same component + workload + iteration
count against each rig and emits a single comparison envelope. By default,
rigs run in sequence for stable CI behavior. Pass --rig-concurrency <N>
to opt into bounded parallel rig execution for exploratory or product-demo
workflows where fresh isolated sites should be built in the same wall-clock
window.
How it runs
For each rig, in input order by default, or in bounded parallel batches when
--rig-concurrency is greater than 1:
- Load the rig spec and run
rig check. Failure aborts the entire comparison — comparing against an unhealthy rig would produce garbage numbers. - Snapshot rig state (each component’s git SHA + branch) into the per-rig output entry.
- Run bench against the resolved component with the rig pinned.
After every rig finishes, results are aggregated into a
BenchComparisonOutput envelope with comparison: "cross_rig". The
first rig in the list is the reference: per-metric percent deltas
in the diff table express each subsequent rig as (current - reference) / reference * 100.
Multi-rig comparison envelopes include reports.side_by_side, a compact
report artifact for demo and product harnesses. It references every rig
result and surfaces:
- per-rig status, exit code, and failure reason
- elapsed time when
elapsed_msorduration_msmetrics are present - flattened key metrics, including grouped metrics like
prompt.hash_match - artifact paths and URLs, including URL-looking artifact paths
Parallel execution preserves the selected rig order in the output envelope, so the reference rig and diff interpretation do not change when concurrency is enabled.
What’s intentionally not done
- No baseline writes.
--baselineand--ratchetare rejected on cross-rig invocations. Baselines are per-rig; writing one from a comparison would silently bless one rig over the others. Runhomeboy bench --rig <id> --baselineonce per rig to ratchet individually. - No statistical-significance gating. Two rigs with overlapping
p95_msdistributions still produce a numeric delta. Treat single-digit percent moves with skepticism.
Rig bench defaults
Rig specs can reduce repeated CLI arguments for common main-vs-branch bench workflows:
{
"bench": {
"default_component": "studio",
"components": ["studio", "playground"],
"default_baseline_rig": "studio-trunk"
},
"bench_workloads": {
"wordpress": ["${package.root}/bench/studio-admin.php"]
}
}bench.default_componentletshomeboy bench --rig <id>omit the positional component. With multiple rigs, every rig must agree on the default unless the component is provided explicitly.bench.componentsletshomeboy bench --rig <id>fan out across a list of components from one rig spec. Scenarios are merged into the standard single-run envelope with:c<component>suffixes (for examplecold-boot:cstudio). When--shared-state <dir>is provided, each component gets its own<dir>/<component>subdirectory.bench.default_baseline_rigupgradeshomeboy bench --rig <candidate>intohomeboy bench --rig <baseline>,<candidate>unless the invocation already lists multiple rigs, writes a baseline (--baseline/--ratchet), passes--ignore-default-baseline, or the candidate rig declares a multi-componentbench.componentsmatrix.bench_workloadssupplies rig-owned workload files keyed by extension ID. Paths support~,${env.NAME},${components.<id>.path}, and${package.root}expansion.${package.root}resolves to the installed rig package root, so portable rig packages can keep workload files next to the rig spec without hardcoded machine paths.
Output shape (cross-rig)
{
"comparison": "cross_rig",
"passed": true,
"component": "studio",
"exit_code": 0,
"iterations": 10,
"rigs": [
{
"rig_id": "studio-trunk",
"passed": true,
"status": "passed",
"exit_code": 0,
"artifacts": [
{
"scenario_id": "agent_boot",
"run_index": 0,
"name": "raw_result",
"path": "bench-artifacts/agent_boot/run-0/raw-result.json",
"kind": "json",
"label": "Raw result"
}
],
"results": { ... },
"rig_state": { "rig_id": "studio-trunk", "captured_at": "...", "components": { ... } }
},
{
"rig_id": "studio-combined-fixes",
"passed": true,
"status": "passed",
"exit_code": 0,
"results": { ... },
"rig_state": { ... }
}
],
"diff": {
"by_scenario": {
"agent_boot": {
"p95_ms": {
"studio-combined-fixes": {
"reference": 31200.0,
"current": 19400.0,
"delta_percent": -37.82
}
}
}
}
},
"hints": [ ... ]
}The reference rig is omitted from the inner diff.by_scenario.<id>.<metric>
map — its delta against itself would always be zero. A scenario or
metric missing from a non-reference rig is silently skipped (no
synthetic zeros).
Each rig entry also includes an artifacts index when workloads emit
artifact pointers. The full-fidelity data remains nested under
results.scenarios[].artifacts and results.scenarios[].runs[].artifacts,
but the index makes proof artifacts easy to find in cross-rig output.
run_index is zero-based and omitted for scenario-level artifacts that
are not tied to a specific --runs iteration.
Exit code
exit_code is 0 only when every rig passed. The first non-zero rig
exit code wins. passed is true only when every rig passed.
Baseline Ratchet Semantics
The bench baseline is a list of per-scenario snapshots stored in
homeboy.json under the baselines.bench key. Each snapshot records
{ id, metrics } plus the iteration count at capture time.
On every run without --baseline or --ignore-baseline:
- Each current scenario is matched against the baseline by
id. - If the runner declares
metric_policies, only those metrics are compared. Each policy declares whether lower or higher values are better and optional percent/absolute tolerances. - If a policy declares
variance_aware: true, Homeboy compares the metric’s raw sample distributions instead of only the summary value. The summary value still appears undermetrics.<name>for reports; the per-iteration samples live undermetrics.distributions.<name>. - If the runner omits
metric_policies, Homeboy keeps the historical default: comparep95_msas lower-is-better with the CLI threshold. - A scenario improves when any compared metric moves in the better direction.
- Scenarios present in one run but not the other are flagged as
new_scenario_ids/removed_scenario_ids. Neither state triggers a regression by itself — they’re informational. - If any scenario regressed, the command exits
1regardless of the runner’s own exit code. - If any scenario improved and
--ratchetis set, the baseline is overwritten with the current snapshot.
p95 remains the default for legacy latency benchmarks because it is less
sensitive than mean to one-off GC pauses but more sensitive than p99 to
genuine regressions. Runners that care about non-latency signals should
declare metric_policies instead.
Runner Contract
The extension’s bench script must:
- Read
$HOMEBOY_BENCH_ITERATIONSto determine iteration count. - Write its JSON output to
$HOMEBOY_BENCH_RESULTS_FILE. - Exit with a non-zero status only on runner-level failure (script error, workload crash) — regressions are homeboy’s domain.
JSON output schema
{
"component_id": "string",
"iterations": 10,
"metric_policies": {
"error_rate": {
"direction": "lower_is_better",
"regression_threshold_absolute": 0.01
},
"requests_per_second": {
"direction": "higher_is_better",
"regression_threshold_percent": 5.0
},
"agent_loop_ms": {
"direction": "lower_is_better",
"regression_threshold_percent": 10.0,
"variance_aware": true,
"min_iterations_for_variance": 20,
"regression_test": "mann_whitney_u"
}
},
"scenarios": [
{
"id": "scenario_slug",
"file": "tests/bench/some-workload.ext",
"iterations": 10,
"metrics": {
"mean_ms": 120.3,
"p50_ms": 118.0,
"p95_ms": 145.0,
"p99_ms": 160.0,
"min_ms": 110.0,
"max_ms": 172.0,
"error_rate": 0.0,
"requests_per_second": 180.5,
"status_500_count": 0,
"agent_loop_ms": 1200.0,
"distributions": {
"agent_loop_ms": [1100.0, 1200.0, 1300.0]
}
},
"memory": { "peak_bytes": 41943040 },
"artifacts": {
"raw_result": {
"path": "bench-artifacts/scenario_slug/raw-result.json",
"kind": "json",
"label": "Raw result"
}
}
}
]
}- Top-level keys are strict — unknown top-level fields are rejected to keep the contract honest.
metricsis an arbitrary map of numeric values. Homeboy core does not attach domain meaning to metric names.metric_policiesis optional. If omitted, Homeboy comparesp95_msusing the legacy lower-is-better latency policy.- Policy
directionacceptslower_is_better/lowerandhigher_is_better/higher. - Policy thresholds are optional.
regression_threshold_percentcompares relative movement;regression_threshold_absolutecompares raw numeric movement. If both are present, a metric must exceed both tolerances to regress. - Policy
variance_aware: truerequires a matchingmetrics.distributions.<metric>array on every scenario that emits the metric. Ifmin_iterations_for_varianceis set and the sample array is smaller, parsing fails before baseline comparison. - Policy
regression_testacceptspoint_delta,mann_whitney_u, andkolmogorov_smirnov.point_deltais the legacy summary-value check. Variance-aware metrics default tomann_whitney_uwhen the field is omitted. Mann-Whitney uses a one-sided 95% normal approximation; Kolmogorov-Smirnov uses the standard 5% two-sample critical value. - Scenario-level unknown keys are tolerated, so extensions can emit additional metadata (tags, environment info, warmup counts) without breaking parsing.
- Scenario
idvalues must be unique within one bench results envelope. Workload-discovering runners should derive ids from paths relative to the bench root (for example,reads/heavy.php→reads-heavy) instead of file basenames alone. memoryis optional. Extensions that can’t measure peak memory omit it.fileis optional but recommended for diagnostics.artifactsis optional. Values are local paths plus optionalkindandlabelmetadata. Homeboy preserves and indexes these pointers but does not upload, retain, or diff artifact contents.
Artifact kind conventions
kind is an open string, not a closed enum. Homeboy preserves unknown kinds
so extensions can add new evidence types without requiring a core release.
Use these conventional values when they fit:
json— structured machine-readable data, summaries, or transcripts.directory— a directory containing multiple related files.playwright-trace— Playwright trace archive, usually a.zipfile.screenshot— browser or UI screenshot evidence.network-log— captured request/response metadata such as HAR or JSONL.console-log— browser console output or page-level JavaScript logs.
For browser benchmarks, prefer specific labels such as Playwright trace,
Final screenshot, Network log, or Console log. The JSON result keeps
the full artifact map under each scenario, and Homeboy also promotes artifact
pointers into the top-level artifacts index. Markdown reports render those
labels and paths so users can find trace and screenshot files without opening
the raw JSON payload.
Environment variables injected
Bench scripts receive the standard runner contract plus bench-specific variables:
HOMEBOY_BENCH_RESULTS_FILE— where to write JSON output.HOMEBOY_BENCH_ITERATIONS— iteration count to use.HOMEBOY_RUN_DIR— per-run directory (shared with test/lint/build).HOMEBOY_EXTENSION_ID,HOMEBOY_COMPONENT_ID,HOMEBOY_COMPONENT_PATH, and the usual execution-context vars.HOMEBOY_SETTINGS_JSON— component settings as JSON.
Component Requirements
For a component to be benchmarkable, it must have:
- A linked extension whose manifest declares a
benchcapability. - A bench-runner script provided by the extension.
Extension manifest:
{
"bench": {
"extension_script": "scripts/bench/bench-runner.sh"
}
}Exit Codes
0— All scenarios passed, no regressions detected (or no baseline exists yet).1— At least one scenario regressed beyond the threshold, or the runner itself failed.- Other non-zero — Runner exit code passthrough (extension-specific).