Skip to content

Developer Guide

This page is a developer-oriented map of the codebase. It tells you which file owns which responsibility, the conventions to follow when extending the system, and where to look for examples.

For a complete component map and line-by-line module descriptions, see System Architecture.


1. Code layout

src/aobench/
โ”œโ”€โ”€ cli/             Typer app โ€” `aobench` console script
โ”œโ”€โ”€ schemas/         Pydantic v2 data models (no logic)
โ”œโ”€โ”€ loaders/         Stateless task / env loading
โ”œโ”€โ”€ tasks/           Task discovery, dataset splits, RAG context, Lite selection
โ”œโ”€โ”€ validation/      T1โ€“T10 validity-gate orchestrator
โ”œโ”€โ”€ environment/     Snapshot validator + tool-registry factory
โ”œโ”€โ”€ tools/           Mock HPC tool families + tool catalog
โ”œโ”€โ”€ adapters/        Agent backends: direct_qa / openai / anthropic / mcp
โ”œโ”€โ”€ runners/         BenchmarkRunner, TraceWriter, ExecutionContext
โ”œโ”€โ”€ scorers/         12 scorers across 6 dimensions
โ”œโ”€โ”€ scoring/         CuP gating + advanced scoring helpers
โ”œโ”€โ”€ reports/         JSON, HTML, slice, CLEAR scorecard reports
โ”œโ”€โ”€ exporters/       Optional observability (Langfuse)
โ”œโ”€โ”€ leaderboard/     FastAPI submission service + DB models
โ”œโ”€โ”€ reproducibility/ Artifact locking, paper-table targets, determinism check
โ”œโ”€โ”€ judge/           LLM-judge runner + rubric loading
โ”œโ”€โ”€ error_taxonomy/  HPC error classification
โ”œโ”€โ”€ gym/             Gym-compatible wrapper (partial โ€” see future-work plan ยงC3)
โ”œโ”€โ”€ taxonomy/        24-leaf TRAIL-adapted error taxonomy YAML
โ””โ”€โ”€ utils/           Logging, ID generation, cost estimation

For each module, System Architecture ยง2 lists the public classes and functions. Use that as a quick reference.


2. The eight CLI sub-commands

The aobench console script is built with Typer. Its sub-commands are:

Command Module Purpose
aobench validate benchmark cli/validate_cmd.py Lint every task spec and snapshot bundle
aobench run task / run all cli/run_cmd.py Execute one task or the dev split
aobench lite select cli/lite_cmd.py Build the AOBench-Lite manifest
aobench report json/html/slices cli/report_cmd.py Render reports from a run directory
aobench compare RUN_A RUN_B cli/compare_cmd.py Diff two runs
aobench robustness task/all cli/robustness_cmd.py Compute pass^k
aobench clear run cli/clear_cmd.py Compute the CLEAR scorecard
aobench leaderboard cli/leaderboard_cmd.py Start the FastAPI submission service

The full reference, with every flag, is in COMMANDS.md.


3. Adding a new adapter

  1. Subclass BaseAdapter in adapters/base.py. Implement run(task, tools, ctx) -> Trace.
  2. Inside run, call tools.dispatch(tool_name, args) for each tool the agent invokes. The registry handles RBAC filtering and propagates permission_denied observations.
  3. Append every step to the TraceWriter provided in the ExecutionContext. Finalise to obtain the Trace.
  4. Register the adapter in the dispatch table in cli/run_cmd.py (_ADAPTER_REGISTRY). The CLI accepts --adapter <name>:<model>.
  5. Add unit tests under tests/unit/test_<name>_adapter.py. Use the mock_openai / mock_anthropic fixtures as templates.

The four shipped adapters cover the typical interaction patterns:

Adapter Shape Reference
direct_qa Zero-tool baseline; returns the QA answer for the task ID adapters/direct_qa_adapter.py
openai OpenAI / Azure function-calling loop adapters/openai_adapter.py
anthropic Claude native tool_use blocks adapters/anthropic_adapter.py
mcp Model Context Protocol over stdio + SSE adapters/mcp_adapter.py

4. Adding a new mock tool

  1. Subclass BaseTool in tools/base.py. Implement each public method as a pure function over the EnvBundle.
  2. Register the tool's methods in benchmark/configs/hpc_tool_catalog.yaml, including description, parameters, role_visibility, and any dangerous_args conditions.
  3. Update environment/snapshot_loader.build_tool_registry if the tool needs a new env-bundle field.
  4. Add a sample invocation under each affected env_NN/ bundle.
  5. Tests under tests/unit/test_<tool>_tool.py.

RBAC is enforced inside the registry, not inside individual tools โ€” keep business logic in the tool, permission logic in tools/registry.py.


5. Adding a new scorer

  1. Subclass BaseScorer in scorers/base.py. Implement score(task, trace) -> ScorerOutput.
  2. Wire the scorer into scorers/aggregate.py if it should contribute to the aggregate score. Update scoring_profiles.yaml weights as needed.
  3. If the scorer can hard-fail, populate scorer_output.hard_fail_reason; the aggregate scorer will force aggregate_score = 0.0.
  4. Tests under tests/unit/test_<scorer>_scorer.py (deterministic) and tests/integration/test_aggregate_scorer.py (end-to-end).

The 12 implemented scorers and their wiring status are listed in System Architecture ยง7. Notes:

  • WorfEvalScorer (workflow_scorer.py) is implemented but not yet wired into AggregateScorer. Wiring is tracked in the future-work plan (ยงA3).
  • CheckpointScorer is wired conditionally (only when task.checkpoints is set). It contributes via S_partial = 0.5 * (passed/total) + 0.5 * S_full.
  • RobustnessScorer is invoked only via the aobench robustness CLI; it is not part of the per-task aggregate.

6. Authoring a new task

End-to-end task authoring is described in Taxonomy ยง5 (Task Metadata Schema). The minimum acceptance criterion is that aobench validate benchmark succeeds with the new task: that runs T1โ€“T10 validity gates (covered in src/aobench/validation/).

For programmatic task generation, see scripts/create_task_stubs.py and the guideline files under benchmark/tasks/guidelines/.

The interactive aobench task create helper is part of the v0.2 plan (see future-work plan ยงB1, task_authoring_spec).


7. Configuration files

File Purpose
pyproject.toml Package metadata, dependencies, console-script entry. Optional extras: [openai], [anthropic], [mcp], [langfuse].
Makefile 40+ targets: test, lint, typecheck, validate, paper-table*, rubric-validate-all, repro-table-1/2, langfuse-up/down, leaderboard-serve.
.env.example Template for OPENAI_API_KEY, ANTHROPIC_API_KEY, LANGFUSE_*, leaderboard DB URL.
Dockerfile Reproducible runtime image.
benchmark/configs/scoring_profiles.yaml Named weight profiles.
benchmark/configs/hpc_tool_catalog.yaml All 16 tool methods, role visibility, dangerous-arg conditions.
benchmark/configs/error_taxonomy.yaml 14 score-based error categories.
benchmark/environments/env_NN/rbac_policy.yaml Per-env, per-role permissions (schema v1.1).

8. Tests

tests/ is split into tests/unit/ (per-module) and tests/integration/ (end-to-end pipelines).

Layer Files Examples
Unit 30 test_outcome_scorer.py, test_governance_scorer.py, test_snapshot_validator.py, test_task_loader.py
Integration 6 test_runner_e2e.py, test_clear_pipeline.py, test_checkpoint_pipeline.py

Run them via:

make test         # unit only, fast
make test-cov     # unit + integration with coverage
make validate     # validity gates over benchmark data

CI in .github/workflows/ci.yml runs lint, typecheck, test, and validate on Python 3.12, 3.13, and 3.14.


9. Conventions

  • Pydantic v2 only. Schemas live in schemas/; never inline ad-hoc dicts.
  • BaseExporter for observability. Avoid coupling the runner to a specific backend.
  • Logging via utils/logging.get_logger(__name__). Do not call print from library code.
  • Cost estimation via utils/cost.estimate_cost. Token counts come from adapter responses; do not infer.
  • Trace IDs come from utils/ids.make_trace_id and make_run_id. They are stable across re-runs of the same task in the same run.

10. Reference

Topic Page
Implemented architecture system-architecture.md
CLI reference COMMANDS.md
Evaluation protocol evaluation.md
Scorer reference scoring-dimensions.md
Adapters & tools (plain English) adapters-and-tools.md
Architecture diagrams system-architecture.md ยง12
How to Contribute contributing.md