Bench Command
Run performance benchmarks for a Homeboy component and surface regression deltas against a stored baseline.
Synopsis
homeboy bench <component> [options] [-- <runner-args>]
homeboy bench list <component> [options] [-- <runner-args>]Description
The bench command invokes the extension’s bench runner, which measures
one or more scenarios over N iterations and emits a structured JSON
results file. Homeboy parses the results, compares declared numeric
metrics against a saved baseline, and returns a structured report plus
an exit code suitable for CI gates.
bench is a sibling of test, lint, and build under homeboy’s
extension capability model. The runner contract, manifest shape, and
baseline primitive (homeboy.json → baselines.bench) are shared with
the other capabilities.
Arguments
<component>: Component to benchmark. Auto-detected from the current working directory if omitted. The component must have a linked extension that declares abenchcapability.
Options
--iterations <N>: Iterations per scenario (default10). Forwarded to the runner via$HOMEBOY_BENCH_ITERATIONS. Extensions may clamp.--baseline: Save the current run as the new baseline underhomeboy.json→baselines.bench.--ignore-baseline: Run without comparing to any saved baseline.--ratchet: When scenarios improve, auto-update the saved baseline so the improvement "sticks". Ignored when the run regresses.--regression-threshold <PERCENT>: Legacy p95 regression tolerance (default5.0) used when the runner does not declaremetric_policies. A p95 scenario regresses when its currentp95_msexceedsbaseline.p95_ms * (1 + threshold/100).--shared-state <DIR>: Directory shared across iterations and concurrent runner instances. Forwarded to workloads via$HOMEBOY_BENCH_SHARED_STATE.--concurrency <N>: Number of parallel bench runner instances to spawn (default1). Values greater than1require--shared-state.--setting <key=value>: Override component settings (may be repeated).--setting-json <key=json>: Override component settings with typed JSON values for arrays, objects, numbers, booleans, or null.--path <PATH>: Override the component’slocal_pathfor this run.--json-summary: Include a compact machine-readable summary in the JSON output envelope (for CI wrappers).--rig <RIG_ID[,RIG_ID...]>: Pin the run to one or more rigs. Single rig pins the rig and stores its baseline under a rig-scoped key. If that rig declaresbench.components, the command fans out across those components under one rig-state snapshot. Multiple rigs (comma-separated) run the same component + workload + iteration count against each rig in sequence and emit a cross-rig comparison envelope. See "Cross-rig comparison" below.--scenario <SCENARIO_ID>: Run or list only the exact scenario id. May be repeated. Homeboy validates selected ids against discovery before execution and forwards the comma-separated selector to runners via$HOMEBOY_BENCH_SCENARIOS.--ignore-default-baseline: Skip automatic single-rig expansion when the rig declaresbench.default_baseline_rig.
Arguments after -- are passed verbatim to the extension’s bench runner
script.
Scenario Discovery
homeboy bench list <component> asks the extension runner for its scenario
inventory without executing any workload code. The runner receives
$HOMEBOY_BENCH_LIST_ONLY=1 and writes the normal BenchResults envelope
with iterations: 0, empty per-scenario metrics, and optional discovery
metadata such as file, source, default_iterations, and tags.
This is the safe first step for agent-driven or CI-driven perf work: inspect what can be measured before deciding which full bench run is worth paying for.
Semantic Gates
Bench runners can attach scenario-level correctness gates to non-timing metrics. Gates are evaluated after metrics are parsed and aggregated. Any failed gate marks the scenario and run failed, even when timing metrics improve, and the failure details are emitted before baseline comparison data in the JSON output.
Supported operators are eq, gte, and lte:
{
"id": "studio-agent-loop",
"iterations": 10,
"metrics": {
"p95_ms": 1200.0,
"assistant_message_count": 1,
"identifies_studio_rate": 1.0
},
"gates": [
{ "metric": "assistant_message_count", "op": "gte", "value": 1 },
{ "metric": "identifies_studio_rate", "op": "eq", "value": 1.0 }
]
}Failed gates add gate_results, set the scenario’s passed field to
false, and add top-level gate_failures to the bench output.
Examples
# Benchmark a component with defaults (10 iterations, 5% regression threshold)
homeboy bench my-component
# List declared scenarios without executing them
homeboy bench list my-component
# 50 iterations, stricter 2% regression threshold
homeboy bench my-component --iterations 50 --regression-threshold 2.0
# Save a new baseline
homeboy bench my-component --baseline
# Run with auto-ratchet on improvement
homeboy bench my-component --ratchet
# Select a single scenario
homeboy bench my-component --scenario hot_path
# Select two scenarios in a cross-rig comparison
homeboy bench studio --rig studio-trunk,studio-branch
--scenario studio-agent-runtime
--scenario wp-admin-load
# Share warm state across invocations and run four instances in parallel
homeboy bench my-component --shared-state /tmp/homeboy-bench --concurrency 4
# Pin to a single rig — preflight + rig-scoped baseline
homeboy bench studio --rig studio-trunk
# Pin to one rig and run every component declared in bench.components
homeboy bench --rig mdi-substrates --shared-state /tmp/mdi-bench
# Cross-rig comparison: same workload, two rigs, side-by-side report.
# First rig (`studio-trunk`) is the reference; the diff table expresses
# every other rig's metrics as percent deltas vs the reference.
homeboy bench studio --rig studio-trunk,studio-combined-fixes --iterations 10
# Three-rig comparison to isolate one PR's contribution.
homeboy bench studio
--rig trunk,combined-fixes,combined-fixes-without-3120
--iterations 20Cross-rig comparison
--rig <a>,<b>[,<c>...] runs the same component + workload + iteration
count against each rig in sequence and emits a single comparison
envelope. Useful for "is my fix actually faster than trunk?" — same
question, two rigs differing only in component commit state.
How it runs
For each rig, in input order:
- Load the rig spec and run
rig check. Failure aborts the entire comparison — comparing against an unhealthy rig would produce garbage numbers. - Snapshot rig state (each component’s git SHA + branch) into the per-rig output entry.
- Run bench against the resolved component with the rig pinned.
After every rig finishes, results are aggregated into a
BenchComparisonOutput envelope with comparison: "cross_rig". The
first rig in the list is the reference: per-metric percent deltas
in the diff table express each subsequent rig as (current - reference) / reference * 100.
What’s intentionally not done
- No baseline writes.
--baselineand--ratchetare rejected on cross-rig invocations. Baselines are per-rig; writing one from a comparison would silently bless one rig over the others. Runhomeboy bench --rig <id> --baselineonce per rig to ratchet individually. - No statistical-significance gating. Two rigs with overlapping
p95_msdistributions still produce a numeric delta. Treat single-digit percent moves with skepticism.
Rig bench defaults
Rig specs can reduce repeated CLI arguments for common main-vs-branch bench workflows:
{
"bench": {
"default_component": "studio",
"components": ["studio", "playground"],
"default_baseline_rig": "studio-trunk"
},
"bench_workloads": {
"wordpress": ["${package.root}/bench/studio-admin.php"]
}
}bench.default_componentletshomeboy bench --rig <id>omit the positional component. With multiple rigs, every rig must agree on the default unless the component is provided explicitly.bench.componentsletshomeboy bench --rig <id>fan out across a list of components from one rig spec. Scenarios are merged into the standard single-run envelope with:c<component>suffixes (for examplecold-boot:cstudio). When--shared-state <dir>is provided, each component gets its own<dir>/<component>subdirectory.bench.default_baseline_rigupgradeshomeboy bench --rig <candidate>intohomeboy bench --rig <baseline>,<candidate>unless the invocation already lists multiple rigs, writes a baseline (--baseline/--ratchet), passes--ignore-default-baseline, or the candidate rig declares a multi-componentbench.componentsmatrix.bench_workloadssupplies rig-owned workload files keyed by extension ID. Paths support~,${env.NAME},${components.<id>.path}, and${package.root}expansion.${package.root}resolves to the installed rig package root, so portable rig packages can keep workload files next to the rig spec without hardcoded machine paths.
Output shape (cross-rig)
{
"comparison": "cross_rig",
"passed": true,
"component": "studio",
"exit_code": 0,
"iterations": 10,
"rigs": [
{
"rig_id": "studio-trunk",
"passed": true,
"status": "passed",
"exit_code": 0,
"artifacts": [
{
"scenario_id": "agent_boot",
"run_index": 0,
"name": "raw_result",
"path": "bench-artifacts/agent_boot/run-0/raw-result.json",
"kind": "json",
"label": "Raw result"
}
],
"results": { ... },
"rig_state": { "rig_id": "studio-trunk", "captured_at": "...", "components": { ... } }
},
{
"rig_id": "studio-combined-fixes",
"passed": true,
"status": "passed",
"exit_code": 0,
"results": { ... },
"rig_state": { ... }
}
],
"diff": {
"by_scenario": {
"agent_boot": {
"p95_ms": {
"studio-combined-fixes": {
"reference": 31200.0,
"current": 19400.0,
"delta_percent": -37.82
}
}
}
}
},
"hints": [ ... ]
}The reference rig is omitted from the inner diff.by_scenario.<id>.<metric>
map — its delta against itself would always be zero. A scenario or
metric missing from a non-reference rig is silently skipped (no
synthetic zeros).
Each rig entry also includes an artifacts index when workloads emit
artifact pointers. The full-fidelity data remains nested under
results.scenarios[].artifacts and results.scenarios[].runs[].artifacts,
but the index makes proof artifacts easy to find in cross-rig output.
run_index is zero-based and omitted for scenario-level artifacts that
are not tied to a specific --runs iteration.
Exit code
exit_code is 0 only when every rig passed. The first non-zero rig
exit code wins. passed is true only when every rig passed.
Baseline Ratchet Semantics
The bench baseline is a list of per-scenario snapshots stored in
homeboy.json under the baselines.bench key. Each snapshot records
{ id, metrics } plus the iteration count at capture time.
On every run without --baseline or --ignore-baseline:
- Each current scenario is matched against the baseline by
id. - If the runner declares
metric_policies, only those metrics are compared. Each policy declares whether lower or higher values are better and optional percent/absolute tolerances. - If a policy declares
variance_aware: true, Homeboy compares the metric’s raw sample distributions instead of only the summary value. The summary value still appears undermetrics.<name>for reports; the per-iteration samples live undermetrics.distributions.<name>. - If the runner omits
metric_policies, Homeboy keeps the historical default: comparep95_msas lower-is-better with the CLI threshold. - A scenario improves when any compared metric moves in the better direction.
- Scenarios present in one run but not the other are flagged as
new_scenario_ids/removed_scenario_ids. Neither state triggers a regression by itself — they’re informational. - If any scenario regressed, the command exits
1regardless of the runner’s own exit code. - If any scenario improved and
--ratchetis set, the baseline is overwritten with the current snapshot.
p95 remains the default for legacy latency benchmarks because it is less
sensitive than mean to one-off GC pauses but more sensitive than p99 to
genuine regressions. Runners that care about non-latency signals should
declare metric_policies instead.
Runner Contract
The extension’s bench script must:
- Read
$HOMEBOY_BENCH_ITERATIONSto determine iteration count. - Write its JSON output to
$HOMEBOY_BENCH_RESULTS_FILE. - Exit with a non-zero status only on runner-level failure (script error, workload crash) — regressions are homeboy’s domain.
JSON output schema
{
"component_id": "string",
"iterations": 10,
"metric_policies": {
"error_rate": {
"direction": "lower_is_better",
"regression_threshold_absolute": 0.01
},
"requests_per_second": {
"direction": "higher_is_better",
"regression_threshold_percent": 5.0
},
"agent_loop_ms": {
"direction": "lower_is_better",
"regression_threshold_percent": 10.0,
"variance_aware": true,
"min_iterations_for_variance": 20,
"regression_test": "mann_whitney_u"
}
},
"scenarios": [
{
"id": "scenario_slug",
"file": "tests/bench/some-workload.ext",
"iterations": 10,
"metrics": {
"mean_ms": 120.3,
"p50_ms": 118.0,
"p95_ms": 145.0,
"p99_ms": 160.0,
"min_ms": 110.0,
"max_ms": 172.0,
"error_rate": 0.0,
"requests_per_second": 180.5,
"status_500_count": 0,
"agent_loop_ms": 1200.0,
"distributions": {
"agent_loop_ms": [1100.0, 1200.0, 1300.0]
}
},
"memory": { "peak_bytes": 41943040 },
"artifacts": {
"raw_result": {
"path": "bench-artifacts/scenario_slug/raw-result.json",
"kind": "json",
"label": "Raw result"
}
}
}
]
}- Top-level keys are strict — unknown top-level fields are rejected to keep the contract honest.
metricsis an arbitrary map of numeric values. Homeboy core does not attach domain meaning to metric names.metric_policiesis optional. If omitted, Homeboy comparesp95_msusing the legacy lower-is-better latency policy.- Policy
directionacceptslower_is_better/lowerandhigher_is_better/higher. - Policy thresholds are optional.
regression_threshold_percentcompares relative movement;regression_threshold_absolutecompares raw numeric movement. If both are present, a metric must exceed both tolerances to regress. - Policy
variance_aware: truerequires a matchingmetrics.distributions.<metric>array on every scenario that emits the metric. Ifmin_iterations_for_varianceis set and the sample array is smaller, parsing fails before baseline comparison. - Policy
regression_testacceptspoint_delta,mann_whitney_u, andkolmogorov_smirnov.point_deltais the legacy summary-value check. Variance-aware metrics default tomann_whitney_uwhen the field is omitted. Mann-Whitney uses a one-sided 95% normal approximation; Kolmogorov-Smirnov uses the standard 5% two-sample critical value. - Scenario-level unknown keys are tolerated, so extensions can emit additional metadata (tags, environment info, warmup counts) without breaking parsing.
- Scenario
idvalues must be unique within one bench results envelope. Workload-discovering runners should derive ids from paths relative to the bench root (for example,reads/heavy.php→reads-heavy) instead of file basenames alone. memoryis optional. Extensions that can’t measure peak memory omit it.fileis optional but recommended for diagnostics.artifactsis optional. Values are local paths plus optionalkindandlabelmetadata. Homeboy preserves and indexes these pointers but does not upload, retain, or diff artifact contents.
Environment variables injected
Bench scripts receive the standard runner contract plus bench-specific variables:
HOMEBOY_BENCH_RESULTS_FILE— where to write JSON output.HOMEBOY_BENCH_ITERATIONS— iteration count to use.HOMEBOY_RUN_DIR— per-run directory (shared with test/lint/build).HOMEBOY_EXTENSION_ID,HOMEBOY_COMPONENT_ID,HOMEBOY_COMPONENT_PATH, and the usual execution-context vars.HOMEBOY_SETTINGS_JSON— component settings as JSON.
Component Requirements
For a component to be benchmarkable, it must have:
- A linked extension whose manifest declares a
benchcapability. - A bench-runner script provided by the extension.
Extension manifest:
{
"bench": {
"extension_script": "scripts/bench/bench-runner.sh"
}
}Exit Codes
0— All scenarios passed, no regressions detected (or no baseline exists yet).1— At least one scenario regressed beyond the threshold, or the runner itself failed.- Other non-zero — Runner exit code passthrough (extension-specific).