Developer Guide¶
This page is a developer-oriented map of the codebase. It tells you which file owns which responsibility, the conventions to follow when extending the system, and where to look for examples.
For a complete component map and line-by-line module descriptions, see System Architecture.
1. Code layout¶
src/aobench/
โโโ cli/ Typer app โ `aobench` console script
โโโ schemas/ Pydantic v2 data models (no logic)
โโโ loaders/ Stateless task / env loading
โโโ tasks/ Task discovery, dataset splits, RAG context, Lite selection
โโโ validation/ T1โT10 validity-gate orchestrator
โโโ environment/ Snapshot validator + tool-registry factory
โโโ tools/ Mock HPC tool families + tool catalog
โโโ adapters/ Agent backends: direct_qa / openai / anthropic / mcp
โโโ runners/ BenchmarkRunner, TraceWriter, ExecutionContext
โโโ scorers/ 12 scorers across 6 dimensions
โโโ scoring/ CuP gating + advanced scoring helpers
โโโ reports/ JSON, HTML, slice, CLEAR scorecard reports
โโโ exporters/ Optional observability (Langfuse)
โโโ leaderboard/ FastAPI submission service + DB models
โโโ reproducibility/ Artifact locking, paper-table targets, determinism check
โโโ judge/ LLM-judge runner + rubric loading
โโโ error_taxonomy/ HPC error classification
โโโ gym/ Gym-compatible wrapper (partial โ see future-work plan ยงC3)
โโโ taxonomy/ 24-leaf TRAIL-adapted error taxonomy YAML
โโโ utils/ Logging, ID generation, cost estimation
For each module, System Architecture ยง2 lists the public classes and functions. Use that as a quick reference.
2. The eight CLI sub-commands¶
The aobench console script is built with Typer. Its sub-commands are:
| Command | Module | Purpose |
|---|---|---|
aobench validate benchmark | cli/validate_cmd.py | Lint every task spec and snapshot bundle |
aobench run task / run all | cli/run_cmd.py | Execute one task or the dev split |
aobench lite select | cli/lite_cmd.py | Build the AOBench-Lite manifest |
aobench report json/html/slices | cli/report_cmd.py | Render reports from a run directory |
aobench compare RUN_A RUN_B | cli/compare_cmd.py | Diff two runs |
aobench robustness task/all | cli/robustness_cmd.py | Compute pass^k |
aobench clear run | cli/clear_cmd.py | Compute the CLEAR scorecard |
aobench leaderboard | cli/leaderboard_cmd.py | Start the FastAPI submission service |
The full reference, with every flag, is in COMMANDS.md.
3. Adding a new adapter¶
- Subclass
BaseAdapterinadapters/base.py. Implementrun(task, tools, ctx) -> Trace. - Inside
run, calltools.dispatch(tool_name, args)for each tool the agent invokes. The registry handles RBAC filtering and propagatespermission_deniedobservations. - Append every step to the
TraceWriterprovided in theExecutionContext. Finalise to obtain theTrace. - Register the adapter in the dispatch table in
cli/run_cmd.py(_ADAPTER_REGISTRY). The CLI accepts--adapter <name>:<model>. - Add unit tests under
tests/unit/test_<name>_adapter.py. Use themock_openai/mock_anthropicfixtures as templates.
The four shipped adapters cover the typical interaction patterns:
| Adapter | Shape | Reference |
|---|---|---|
direct_qa | Zero-tool baseline; returns the QA answer for the task ID | adapters/direct_qa_adapter.py |
openai | OpenAI / Azure function-calling loop | adapters/openai_adapter.py |
anthropic | Claude native tool_use blocks | adapters/anthropic_adapter.py |
mcp | Model Context Protocol over stdio + SSE | adapters/mcp_adapter.py |
4. Adding a new mock tool¶
- Subclass
BaseToolintools/base.py. Implement each public method as a pure function over theEnvBundle. - Register the tool's methods in
benchmark/configs/hpc_tool_catalog.yaml, includingdescription,parameters,role_visibility, and anydangerous_argsconditions. - Update
environment/snapshot_loader.build_tool_registryif the tool needs a new env-bundle field. - Add a sample invocation under each affected
env_NN/bundle. - Tests under
tests/unit/test_<tool>_tool.py.
RBAC is enforced inside the registry, not inside individual tools โ keep business logic in the tool, permission logic in tools/registry.py.
5. Adding a new scorer¶
- Subclass
BaseScorerinscorers/base.py. Implementscore(task, trace) -> ScorerOutput. - Wire the scorer into
scorers/aggregate.pyif it should contribute to the aggregate score. Updatescoring_profiles.yamlweights as needed. - If the scorer can hard-fail, populate
scorer_output.hard_fail_reason; the aggregate scorer will forceaggregate_score = 0.0. - Tests under
tests/unit/test_<scorer>_scorer.py(deterministic) andtests/integration/test_aggregate_scorer.py(end-to-end).
The 12 implemented scorers and their wiring status are listed in System Architecture ยง7. Notes:
WorfEvalScorer(workflow_scorer.py) is implemented but not yet wired intoAggregateScorer. Wiring is tracked in the future-work plan (ยงA3).CheckpointScoreris wired conditionally (only whentask.checkpointsis set). It contributes viaS_partial = 0.5 * (passed/total) + 0.5 * S_full.RobustnessScoreris invoked only via theaobench robustnessCLI; it is not part of the per-task aggregate.
6. Authoring a new task¶
End-to-end task authoring is described in Taxonomy ยง5 (Task Metadata Schema). The minimum acceptance criterion is that aobench validate benchmark succeeds with the new task: that runs T1โT10 validity gates (covered in src/aobench/validation/).
For programmatic task generation, see scripts/create_task_stubs.py and the guideline files under benchmark/tasks/guidelines/.
The interactive aobench task create helper is part of the v0.2 plan (see future-work plan ยงB1, task_authoring_spec).
7. Configuration files¶
| File | Purpose |
|---|---|
pyproject.toml | Package metadata, dependencies, console-script entry. Optional extras: [openai], [anthropic], [mcp], [langfuse]. |
Makefile | 40+ targets: test, lint, typecheck, validate, paper-table*, rubric-validate-all, repro-table-1/2, langfuse-up/down, leaderboard-serve. |
.env.example | Template for OPENAI_API_KEY, ANTHROPIC_API_KEY, LANGFUSE_*, leaderboard DB URL. |
Dockerfile | Reproducible runtime image. |
benchmark/configs/scoring_profiles.yaml | Named weight profiles. |
benchmark/configs/hpc_tool_catalog.yaml | All 16 tool methods, role visibility, dangerous-arg conditions. |
benchmark/configs/error_taxonomy.yaml | 14 score-based error categories. |
benchmark/environments/env_NN/rbac_policy.yaml | Per-env, per-role permissions (schema v1.1). |
8. Tests¶
tests/ is split into tests/unit/ (per-module) and tests/integration/ (end-to-end pipelines).
| Layer | Files | Examples |
|---|---|---|
| Unit | 30 | test_outcome_scorer.py, test_governance_scorer.py, test_snapshot_validator.py, test_task_loader.py |
| Integration | 6 | test_runner_e2e.py, test_clear_pipeline.py, test_checkpoint_pipeline.py |
Run them via:
make test # unit only, fast
make test-cov # unit + integration with coverage
make validate # validity gates over benchmark data
CI in .github/workflows/ci.yml runs lint, typecheck, test, and validate on Python 3.12, 3.13, and 3.14.
9. Conventions¶
- Pydantic v2 only. Schemas live in
schemas/; never inline ad-hoc dicts. BaseExporterfor observability. Avoid coupling the runner to a specific backend.- Logging via
utils/logging.get_logger(__name__). Do not callprintfrom library code. - Cost estimation via
utils/cost.estimate_cost. Token counts come from adapter responses; do not infer. - Trace IDs come from
utils/ids.make_trace_idandmake_run_id. They are stable across re-runs of the same task in the same run.
10. Reference¶
| Topic | Page |
|---|---|
| Implemented architecture | system-architecture.md |
| CLI reference | COMMANDS.md |
| Evaluation protocol | evaluation.md |
| Scorer reference | scoring-dimensions.md |
| Adapters & tools (plain English) | adapters-and-tools.md |
| Architecture diagrams | system-architecture.md ยง12 |
| How to Contribute | contributing.md |