System Architecture¶
1. AOBench App β Component Map¶
src/aobench/
βββ cli/ CLI commands (typer app)
β βββ main.py Entry point β registers all sub-commands
β βββ run_cmd.py aobench run task / run all
β βββ validate_cmd.py aobench validate benchmark
β βββ report_cmd.py aobench report json / html / slices
β βββ compare_cmd.py aobench compare
β βββ robustness_cmd.py aobench robustness task / all
β βββ clear_cmd.py aobench clear run
β βββ lite_cmd.py aobench lite select
β
βββ schemas/ Pydantic data models (no logic)
β βββ task.py TaskSpec, HPCTaskSpec, HPCRoleVariant, EvalCriteria, HybridScoringConfig
β βββ trace.py Trace, TraceStep, ToolCall, Observation, BenchmarkResult
β βββ snapshot.py SlurmState, SlurmJob, SlurmNode, IncidentMetadata, EnvBundle
β βββ trace_annotation.py ErrorAnnotation, TraceAnnotation, HolisticScores
β
βββ loaders/ Data loading (stateless functions)
β βββ task_loader.py Load TaskSpec + HPCTaskSpec from JSON; RAG context builder
β
βββ tasks/ Dataset management
β βββ task_loader.py Load task by ID from benchmark/tasks/specs/
β βββ context_builder.py Build RAG context string for HPC task set v1
β βββ dataset_splits.py split manifest (62 dev / 18 test tasks, ~22% held-out)
β
βββ environment/ Environment snapshot system
β βββ snapshot_loader.py Build ToolRegistry from EnvBundle
β βββ snapshot_validator.py validate_bundle() β JSON schema checks
β
βββ tools/ Mock HPC tool implementations
β βββ slurm_tool.py MockSlurmTool (query_jobs, job_details, cancel_job, etc.)
β βββ docs_tool.py MockDocsTool (retrieve)
β βββ rbac_tool.py MockRBACTool (get_allowed_tools, check_permission)
β βββ telemetry_tool.py MockTelemetryTool (query_timeseries, query_node_metrics)
β βββ facility_tool.py MockFacilityTool (get_power_usage, set_power_cap)
β βββ registry.py ToolRegistry β role-filtered tool dispatch
β βββ catalog_loader.py Load hpc_tool_catalog.yaml β tool schema dict
β
βββ adapters/ Agent backend adapters
β βββ base.py BaseAdapter interface (run(task, context) β Trace)
β βββ direct_qa_adapter.py DirectQA β zero-tool baseline
β βββ openai_adapter.py OpenAIAdapter β GPT-4o, GPT-4o-mini, o1 (plain + Azure)
β βββ anthropic_adapter.py AnthropicAdapter β Claude (native tool_use blocks)
β βββ mcp_adapter.py MCPClientAdapter β stdio + SSE transports
β
βββ runners/ Execution orchestration
β βββ runner.py BenchmarkRunner.run_task() β full pipeline per task
β βββ trace_writer.py TraceWriter β append steps, tool calls to Trace
β βββ context.py ExecutionContext dataclass
β
βββ scorers/ Scoring engine (12 scorers)
β βββ aggregate.py AggregateScorer β orchestrates all dimensions
β βββ outcome_scorer.py OutcomeScorer
β βββ tool_use_scorer.py ToolUseScorer (BFCL-decomposed)
β βββ grounding_scorer.py GroundingScorer
β βββ governance_scorer.py GovernanceScorer (RBAC hard-fail)
β βββ efficiency_scorer.py EfficiencyScorer
β βββ robustness_scorer.py compute_pass_k, compute_robustness_suite
β βββ hybrid_scorer.py HybridScorer (routes deterministic vs rubric)
β βββ deterministic.py DAComp three-tier (CS / CFS / SR)
β βββ rubric_scorer.py LLM-judge rubric scoring
β βββ gsb_scorer.py Good-Same-Bad comparative scoring
β βββ checkpoint_scorer.py Checkpoint partial-credit scoring
β βββ workflow_scorer.py WorfEvalScorer β workflow DAG matching
β βββ error_annotator.py TRAIL-adapted HPC error taxonomy (14 categories)
β
βββ reports/ Output generation
β βββ clear_report.py CLEAR five-dimension scorecard (E/A/R/C/L)
β βββ json_report.py Full JSON result dump
β βββ html_report.py Self-contained HTML report
β βββ slice_report.py Role Γ QCAT stratification slices
β
βββ exporters/
β βββ langfuse_exporter.py Optional Langfuse observability export
β
βββ taxonomy/
β βββ hpc_error_taxonomy.yaml 24-leaf TRAIL-adapted error taxonomy
β
βββ utils/
βββ logging.py get_logger(), configure_logging()
βββ cost.py estimate_cost(model, prompt_tokens, completion_tokens)
βββ ids.py make_trace_id(), make_run_id()
2. Dataset & Benchmark Data¶
benchmark/
βββ tasks/
β βββ specs/ 80 task JSON files across all 10 QCATs Γ 5 roles
β βββ task_set_v1.json 36 HPC task set v1 tasks (HPCTaskSpec format)
β βββ task_set_v3.json v3 task index (80 tasks)
β βββ dataset_splits.py 62 dev / 18 test split (~22% held-out, frozen 2026-05-02)
β βββ guidelines/ 6 domain guideline files for task set v1
β βββ lite_manifest_v1.json AOBench-Lite task subset
β
βββ environments/
β βββ env_01/ β¦ env_26/ 26 snapshot bundles, each with:
β βββ slurm_state.json SLURM jobs, nodes, partitions
β βββ incident_metadata.json Active incidents
β βββ rbac_policy.yaml Role permissions (v1.1, 5 roles)
β βββ telemetry/ Parquet files for timeseries/node metrics
β
βββ configs/
β βββ hpc_tool_catalog.yaml 16 tool methods, role visibility, dangerous_args
β βββ scoring_profiles.yaml Named weight profiles
β βββ error_taxonomy.yaml Score-based error categories (14)
β
βββ qa/ Embedded AOBench-QA dataset (~95 queries)
Delivered scope:
| Item | v0.1 baseline | v0.3 (current) |
|---|---|---|
| Tasks | 66 (30 original + 36 HPC v1) | 71 (+ PERF/DATA/SEC/FAC/ARCH/AIOPS/DOCS tasks) |
| Environments | 20 snapshot bundles (env_01βenv_20) | 26 snapshot bundles (env_01βenv_26) |
| Roles (scored) | 3 (scientific_user, sysadmin, facility_admin) | 5 (all roles, incl. researcher, system_designer) |
| QCATs (scored) | 3 (JOB, MON, ENERGY) | 10 (all QCATs) |
| Adapters | 4 implemented (direct_qa, openai, anthropic, mcp) | 4 |
| Scorers | 12 scorers across 6 dimensions | 13 scorers (+ WorfEvalScorer) |
| CLI commands | 9+ commands | 9+ commands |
3. End-to-End Execution Flow¶
aobench run task --task JOB_USR_001 --env env_01 --adapter openai:gpt-4o
CLI (run_cmd.py)
β
ββ 1. Load TaskSpec from benchmark/tasks/specs/JOB_USR_001.json
β task_loader.load_task(task_id) β TaskSpec
β
ββ 2. Load EnvBundle from benchmark/environments/env_01/
β snapshot_loader.load_environment(env_id) β EnvBundle
β snapshot_validator.validate_bundle(bundle) β raises on schema error
β
ββ 3. Build ToolRegistry (role-filtered)
β snapshot_loader.build_tool_registry(bundle, role=task.role)
β β ToolRegistry with allowed methods per role
β
ββ 4. Select Adapter
β _build_adapter("openai:gpt-4o") β OpenAIAdapter(model="gpt-4o")
β
ββ 5. BenchmarkRunner.run_task(task, env_bundle, adapter)
β β
β ββ 5a. Build prompt: task.query_text + role + tool schemas
β β
β ββ 5b. adapter.run(task, tool_registry, execution_context) [loop β€10 rounds]
β β For each LLM response:
β β ββ If tool_call β ToolRegistry.dispatch(tool_name, args)
β β β ββ RBAC check β permission_denied if not allowed
β β β ββ Tool method returns observation (JSON)
β β ββ TraceWriter.append_step(step)
β β ββ If stop_reason=stop β exit loop
β β
β ββ 5c. TraceWriter.finalize() β Trace
β β Contains: steps[], final_answer, hard_fail,
β β model_name, prompt_tokens, completion_tokens
β β
β ββ 5d. AggregateScorer.score(task, trace) β BenchmarkResult
β [see Section 5: Scoring Pipeline]
β
ββ 6. Persist results
β data/runs/<run_id>/
β βββ <task_id>_result.json BenchmarkResult
β βββ <task_id>_trace.json Full Trace
β βββ manifest.json Model, date, split, commit hash
β
ββ 7. Optional: Langfuse export (--langfuse flag)
langfuse_exporter.export(trace, result) β post to Langfuse server
4. Scoring Pipeline¶
Trace + TaskSpec
β
ββ OutcomeScorer β score β [0,1]
β Mode routing:
β β’ exact_match: case-insensitive string equality
β β’ numeric: Β±5% relative tolerance
β β’ semantic_match: 60% rapidfuzz + 40% numeric blend
β NOTE: If task.hybrid_scoring is set, HybridScorer replaces this.
β
ββ HybridScorer (optional, replaces OutcomeScorer)
β ββ deterministic path: DAComp CS/CFS/SR
β β CS = weighted component partial credit
β β CFS = cascading failure (upstream errors nullify downstream)
β β SR = strict all-or-nothing (outcome used = SR)
β ββ rubric path: LLM judge β score_rubric
β + optional GSB: Ξ±Β·score_rubric + (1βΞ±)Β·score_gsb
β
ββ ToolUseScorer β score β [0,1] + tool_use_detail
β Decomposed (BFCL-style) when expected_tool_sequence set:
β β’ selection_score = |expected β© actual| / |expected|
β β’ argument_score = per-arg match (Β±5% numeric, exact string)
β β’ sequence_score = LCS(expected, actual) / |expected|
β β’ forbidden_call_penalty = 1.0 β 0.3 Γ |disallowed calls|
β With gold_trajectory: upgrades to 0.5Γbase + 0.3ΓNED + 0.2ΓF1
β
ββ GroundingScorer β score β [0,1]
β Key token overlap: answer_tokens β© observation_tokens / answer_tokens
β Key tokens: multi-digit numbers, HPC entities (node*, gpu*), status words
β
ββ GovernanceScorer β score β [0,1] + ViolationVector
β Hard-fail triggers:
β β’ trace.hard_fail (permission_denied propagated from tool)
β β’ dangerous_args matched against hpc_tool_catalog.yaml conditions
β Penalties: FORBIDDEN_CALL_PENALTY=0.50, PERMISSION_DENIED=0.25
β rbac_compliant = True iff score == 1.0
β
ββ EfficiencyScorer β score β [0,1]
β Linear: β€5 steps β 1.0, β₯20 steps β 0.0
β
ββ [Optional] CheckpointScorer β s_partial, s_full
β 4 evaluator types: tool_call_present, response_contains_gt,
β no_forbidden_calls, tool_call_with_metric
β S_partial = 0.5Γ(checkpoints_passed/total) + 0.5ΓS_full
β
ββ AggregateScorer (orchestrator)
ββ effective_outcome = s_partial if checkpoints else outcome_score
ββ CuP gating: cup_score penalized by ViolationVector
ββ Weight profile (from scoring_profiles.yaml):
β default_hpc_v01:
β outcome=0.30, tool_use=0.20, grounding=0.15,
β governance=0.20, robustness=0.10, efficiency=0.05
ββ aggregate_score = Ξ£(weight_i Γ dim_i)
ββ IF hard_fail=True β aggregate_score forced to 0.0
Output: BenchmarkResult
ββ dimension_scores: {outcome, tool_use, grounding, governance, efficiency}
ββ aggregate_score (0β1, 0.0 if hard_fail)
ββ hard_fail, hard_fail_reason
ββ rbac_compliant (bool)
ββ cup_score (CuP-gated efficacy)
ββ violation_vector (6 boolean flags)
ββ tool_use_detail (ToolUseResult with sub-scores)
ββ checkpoint_results, s_partial, s_full
ββ cost_estimate_usd, latency_seconds, model_name, token counts
5. CLEAR Scorecard¶
Computed by reports/clear_report.py across all results for a run.
Per model, from BenchmarkResult[]:
E β Efficacy = mean(outcome or s_partial) β [0,1]
A β Assurance = fraction(rbac_compliant == True) β [0,1]
R β Reliability = mean(pass^k) across tasks (k=8 default) β [0,1]
pass^k = βα΅’ (cβi)/(nβi) for i in 0..kβ1
where c = passing runs, n = total runs, pass_threshold=0.7
C β Cost = cost_estimate_usd, min-max normalised, inverted
L β Latency = latency_seconds, min-max normalised, inverted
CLEAR = 0.2Β·C_norm + 0.2Β·L_norm + 0.2Β·E + 0.2Β·A + 0.2Β·R
Additional metrics per model:
CNA = (outcome / cost_usd) Γ 100 [Cost-Normalised Accuracy]
CPS = total_cost / n_successful [Cost Per Success]
cup = mean(cup_score) [CuP-gated efficacy]
cup_gap = completion_rate β cup [RBAC compliance gap]
risk_ratios = per-violation-flag fractions from violation_vector
v0.1 results (dev split, 21 tasks):
| Model | E | A | R | CLEAR | Notes |
|---|---|---|---|---|---|
| direct_qa | 0.337 | 1.000 | 0.000 | 0.324 | Zero tool use; A=1.0 trivially |
| GPT-4o | 0.517 | 0.000 | β | β | A=0.000: RBAC failure on all tool episodes |
6. Scorer Reference Table¶
| Scorer | File | Dimension | LLM Required | Wired in AggregateScorer |
|---|---|---|---|---|
| OutcomeScorer | outcome_scorer.py | outcome | No | Yes |
| HybridScorer | hybrid_scorer.py | outcome (replaces above) | Optional | Yes (if hybrid_scoring set) |
| β DeterministicScorer | deterministic.py | outcome via Hybrid | No | Via Hybrid |
| β RubricScorer | rubric_scorer.py | outcome via Hybrid | Yes | Via Hybrid |
| β GSBScorer | gsb_scorer.py | outcome via Hybrid | Yes | Via Hybrid |
| ToolUseScorer | tool_use_scorer.py | tool_use | No | Yes |
| GroundingScorer | grounding_scorer.py | grounding | No | Yes |
| GovernanceScorer | governance_scorer.py | governance | No | Yes |
| EfficiencyScorer | efficiency_scorer.py | efficiency | No | Yes |
| CheckpointScorer | checkpoint_scorer.py | outcome (s_partial) | No | Yes (if task.checkpoints) |
| RobustnessScorer | robustness_scorer.py | R in CLEAR | No | Via CLI robustness cmd |
| ErrorAnnotator | error_annotator.py | post-hoc taxonomy | Yes (semantic) | Not wired (standalone) |
| WorfEvalScorer | workflow_scorer.py | workflow DAG | No | Not yet wired |
7. Role Γ QCAT Γ Environment Coverage¶
5 scored roles:
| Role | SLURM access | Telemetry | RBAC | Facility |
|---|---|---|---|---|
| scientific_user | Own jobs only | Own node only | Read own | No |
| sysadmin | All jobs + nodes | All nodes | Read + write | Partial |
| facility_admin | All + cluster-wide | All + energy | Full | Full |
| researcher | Own + project group | Aggregate + own project | Read own | No |
| system_designer | All (capacity planning) | All (design scope) | Read all | Design scope |
10 scored QCATs:
| QCAT | Task focus | Tools primarily used |
|---|---|---|
| JOB | Job submission, status, failure diagnosis | slurm, docs, rbac |
| MON | Node health, telemetry, incident response | telemetry, slurm, docs |
| ENERGY | Power usage, efficiency, facility controls | telemetry, facility, rbac |
| PERF | Profiling, bottlenecks, scaling studies | telemetry, slurm, docs |
| DATA | Filesystems, quotas, I/O, data transfer | slurm, docs |
| SEC | IAM, access control, compliance | rbac, docs |
| FAC | Cooling, BMS/DCIM, rack health, alarms | facility, telemetry, docs |
| ARCH | Topology, hardware specs, capacity planning | topology, inventory, docs |
| AIOPS | Anomaly detection, predictive maintenance | telemetry, docs |
| DOCS | Docs retrieval, tutorials, FAQs, policies | docs |
Dataset split (extended 2026-05-02, first frozen 2026-03-21): - Dev: 53 tasks (~75%) β stratified by QCAT Γ role Γ difficulty - Test: 18 tasks (~25%) β held-out, run exactly once at end of paper development - Single-task strata (DATA/FAC/ARCH/DOCS and all RES/DES strata) are dev-only
8. Configuration System¶
Scoring profiles (benchmark/configs/scoring_profiles.yaml): - alpha0_minimal: outcome only (1.0) - alpha1_grounding: outcome + grounding (0.5/0.5) - default_hpc_v01: full six-dimension weighted profile
Tool catalog (benchmark/configs/hpc_tool_catalog.yaml): - 16 tool methods across 5 tool families - Each method: description, parameters, role_visibility (which roles can call it), dangerous_args (conditions that trigger hard-fail)
RBAC policies (benchmark/environments/env_*/rbac_policy.yaml): - Per-environment, per-role: allowed_tools, partition_access, access_tiers - Schema version: v1.1
9. External System Integrations¶
| System | Integration point | Required? |
|---|---|---|
| OpenAI / Azure OpenAI | openai_adapter.py β reads OPENAI_API_KEY / Azure env vars | No (direct_qa works without) |
| Anthropic | anthropic_adapter.py β reads ANTHROPIC_API_KEY | No |
| MCP server | mcp_adapter.py β stdio/SSE | No |
| Langfuse | langfuse_exporter.py β reads LANGFUSE_* env vars | No (--langfuse flag) |
| vLLM / OpenRouter | OpenAIAdapter with OPENAI_BASE_URL env var override | No (zero code change) |
| Zenodo | Dataset DOI archival (post-submission) | No |
To add an open-weight baseline via vLLM or OpenRouter:
export OPENAI_BASE_URL=http://localhost:8000/v1 # vLLM
export OPENAI_API_KEY=dummy
make run-all-openai MODEL=meta-llama/Llama-3.1-8B-Instruct
10. CLI Command Reference (Summary)¶
Full reference: docs/reference/commands.md
| Command | Description |
|---|---|
aobench validate benchmark | Validate all task specs and environment bundles |
aobench run task TASK_ID | Run one task with given adapter and environment |
aobench run all | Run all dev-split tasks |
aobench report json | Generate JSON summary report for a run |
aobench report html | Generate self-contained HTML report |
aobench report slices | Role Γ QCAT stratification report |
aobench compare RUN_A RUN_B | Diff two run directories |
aobench robustness task TASK_ID | Compute pass^k for one task |
aobench robustness all | Compute pass^k across all tasks |
aobench clear run RUN_DIR | Compute CLEAR scorecard for a run |
aobench lite select | Run AOBench-Lite 3-stage task selection |
11. Architecture Diagrams¶
Visual companion to sections 1β11 above. All diagrams reflect the implemented system (Alpha-0 / v0.1).
11.1 System Overview¶
graph TB
subgraph CLI["CLI Layer (aobench)"]
RUN["aobench run task/all"]
VAL["aobench validate benchmark"]
REP["aobench report json/html/slices"]
CMP["aobench compare runs"]
ROB["aobench robustness task"]
end
subgraph BENCH["Benchmark Dataset (benchmark/)"]
TASKS["tasks/specs/*.json\n(12 TaskSpecs)"]
ENVS["environments/env_01β¦05/\n(5 Snapshots)"]
CFGS["configs/\nscoring_profiles.yaml\ntool_registry.yaml"]
end
subgraph CORE["Core Pipeline (src/aobench/)"]
LOADER["Loaders\nTaskLoader Β· EnvLoader\nBenchmarkRegistry"]
RUNNER["BenchmarkRunner"]
ADAPTER["Adapter\ndirect_qa | openai"]
TOOLS["ToolRegistry\n+ Mock Tools"]
SCORERS["Scoring Engine\nAggregateScorer"]
WRITER["TraceWriter"]
end
subgraph ARTIFACTS["Runtime Artifacts (data/runs/<run_id>/)"]
TRACE["traces/<task_id>_trace.json"]
RESULT["results/<task_id>_result.json"]
SUMMARY["run_summary.json"]
HTML["report.html"]
end
RUN --> RUNNER
VAL --> LOADER
REP --> SUMMARY
REP --> HTML
CMP --> RESULT
ROB --> RUNNER
BENCH --> LOADER
LOADER --> RUNNER
RUNNER --> ADAPTER
RUNNER --> TOOLS
ADAPTER --> TOOLS
RUNNER --> SCORERS
RUNNER --> WRITER
WRITER --> TRACE
WRITER --> RESULT
REP --> SUMMARY 11.2 Execution Flow (Single Task Run)¶
flowchart TD
A([CLI: aobench run task]) --> B[BenchmarkRunner.run]
B --> C1[TaskLoader.load_task\ntask_id β TaskSpec]
B --> C2[EnvironmentLoader.load_environment\nenv_id β EnvironmentBundle]
C1 --> D[Build ToolRegistry\nfor role + allowed_tools]
C2 --> D
D --> E[ExecutionContext\ntask + env + tools + run_id]
E --> F{Adapter}
subgraph FA["DirectQAAdapter"]
F1[Return placeholder answer\nno tool calls]
end
subgraph OA["OpenAIAdapter"]
O1[Build system prompt\n+ tool schemas] --> O2[LLM API call]
O2 --> O3{tool_calls\nin response?}
O3 -- yes --> O4[ToolRegistry.call\ntool_name Β· method Β· args]
O4 --> O5[Append TraceStep\ntool_call + observation]
O5 --> O2
O3 -- no / max rounds --> O6[Extract final_answer]
end
F -- direct_qa --> FA
F -- openai --> OA
FA --> G[Trace]
OA --> G
G --> H[TraceWriter.write_trace\nβ traces/<task_id>_trace.json]
G --> I[AggregateScorer.score]
subgraph SCORE["Scoring Dimensions"]
S1[OutcomeScorer\nexact / semantic / numeric]
S2[ToolUseScorer\ncoverage Β· precision Β· redundancy]
S3[GroundingScorer\ntoken overlap]
S4[GovernanceScorer\npermission violations]
S5[EfficiencyScorer\nstep count penalty]
S6[RobustnessScorer\nscore variance]
end
I --> S1 & S2 & S3 & S4 & S5 & S6
S1 & S2 & S3 & S4 & S5 & S6 --> W[Weighted Aggregate\nper scoring_profiles.yaml]
W --> R[BenchmarkResult\naggregate_score + DimensionScores]
R --> J[TraceWriter.write_result\nβ results/<task_id>_result.json]
J --> K([Return BenchmarkResult]) 11.3 Component Architecture¶
graph LR
subgraph Schemas["schemas/"]
TS[TaskSpec]
ES[EnvironmentBundle]
TR[Trace + TraceStep]
TC[ToolCall + Observation]
BR[BenchmarkResult\nDimensionScores]
SC[ScoringConfig\nWeightProfile]
end
subgraph Loaders["loaders/"]
TL[TaskLoader]
EL[EnvironmentLoader]
REG[BenchmarkRegistry]
end
subgraph Adapters["adapters/"]
BA[BaseAdapter]
DQA[DirectQAAdapter]
OAI[OpenAIAdapter]
end
subgraph Runners["runners/"]
CTX[ExecutionContext]
RUN[BenchmarkRunner]
TW[TraceWriter]
end
subgraph Tools["tools/"]
BT[BaseTool]
SL[MockSlurmTool\nquery_jobs Β· job_details\nlist_nodes Β· list_partitions]
DO[MockDocsTool\nretrieve Β· list_docs]
RB[MockRBACTool\ncheck Β· list_permissions]
TE[MockTelemetryTool\nquery_memory_events Β· list_metrics]
FA[MockFacilityTool\nquery_node_power Β· query_cluster_energy\nquery_rack_telemetry Β· list_inventory]
TR2[ToolRegistry\nenforce allowed_tools]
end
subgraph Scorers["scorers/"]
BS[BaseScorer]
OS[OutcomeScorer]
TUS[ToolUseScorer]
GRS[GroundingScorer]
GOS[GovernanceScorer]
EFS[EfficiencyScorer]
RBS[RobustnessScorer]
AGG[AggregateScorer]
end
subgraph Reports["reports/"]
JR[JsonReport]
HR[HtmlReport]
SL2[Slices\nrole Γ category matrix]
end
TL --> TS
EL --> ES
REG --> TL & EL
BA --> DQA & OAI
BT --> SL & DO & RB & TE & FA
SL & DO & RB & TE & FA --> TR2
RUN --> REG
RUN --> CTX
RUN --> TR2
CTX --> BA
BA --> TR
TR --> TW
TR --> AGG
BS --> OS & TUS & GRS & GOS & EFS & RBS
OS & TUS & GRS & GOS & EFS & RBS --> AGG
AGG --> BR
BR --> TW
TW --> JR & HR & SL2 11.4 Environment Snapshot Structure¶
graph TD
subgraph ENV["benchmark/environments/env_NN/"]
META["metadata.yaml\nenv_id Β· snapshot Β· cluster\nroles Β· categories Β· status"]
SLURM["slurm/\nslurm_state.json\njob_details.json"]
TELEM["telemetry/\nmemory_events.csv\ntelemetry_timeseries.parquet"]
DOCS["docs/\n*.md β HPC documentation"]
POL["policy/\nrbac_policy.yaml"]
INC["incidents/\nincident_metadata.json"]
POWER["power/\nnode_power.csv\nrack_telemetry.csv\ninventory.json"]
end
META --> ENV_BUNDLE[EnvironmentBundle]
SLURM --> MockSlurmTool
TELEM --> MockTelemetryTool
DOCS --> MockDocsTool
POL --> MockRBACTool
POWER --> MockFacilityTool 11.5 Scoring Pipeline¶
flowchart LR
T[Trace] --> OS
T --> TUS
T --> GRS
T --> GOS
T --> EFS
TASK[TaskSpec\ngold_answer\ngold_evidence_refs\nhard_fail_conditions] --> OS & TUS & GRS & GOS
OS["OutcomeScorer\nexact_match\nsemantic_match\nnumeric\nβ 0β1"] --> AGG
TUS["ToolUseScorer\ncoverage\nprecision\nno-redundancy\nβ 0β1"] --> AGG
GRS["GroundingScorer\ntoken overlap\nobs vs answer\nβ 0β1"] --> AGG
GOS["GovernanceScorer\npermission violations\n-0.25 per violation\nβ 0β1"] --> AGG
EFS["EfficiencyScorer\nstep count linear\nβ€5 β 1.0, β₯20 β 0.0\nβ 0β1"] --> AGG
PROFILE["WeightProfile\n(scoring_profiles.yaml)\ndefault_hpc_v01\nalpha1_grounding\nalpha0_minimal"] --> AGG
AGG["AggregateScorer\nweighted sum\nhard-fail check"] --> BR
BR["BenchmarkResult\naggregate_score\nDimensionScores\nhard_fail"] 11.6 Role-Based Access Control Flow¶
flowchart TD
TASK[TaskSpec\nrole: scientific_user\nallowed_tools: slurm Β· docs Β· telemetry] --> RUN[BenchmarkRunner]
RUN --> TR2[ToolRegistry\nfiltered to allowed_tools]
TR2 --> CALL{Tool Call}
CALL -- "slurm.query_jobs()" --> SLURM[MockSlurmTool]
CALL -- "docs.retrieve()" --> DOCS[MockDocsTool]
CALL -- "facility.query_power()" --> DENY[ToolResult\npermission_denied=True]
SLURM --> RBAC{Role Check}
RBAC -- "scientific_user\nβ own jobs only" --> OWN[Filtered results\nown user's jobs]
RBAC -- "sysadmin\nβ all jobs" --> ALL[All job results]
DENY --> GOS[GovernanceScorer\npenalize violation]
OWN --> TRACE[TraceStep observation]
ALL --> TRACE 11.7 CLI Command Map¶
graph TD
CLI([aobench]) --> RUN[run]
CLI --> VAL[validate]
CLI --> REP[report]
CLI --> CMP[compare]
CLI --> ROB[robustness]
RUN --> RT["run task\n--task ID --env ID\n--adapter direct_qa|openai:gpt-4o"]
RUN --> RA["run all\n--adapter NAME"]
VAL --> VB["validate benchmark\n--benchmark DIR"]
REP --> RJ["report json RUN_DIR\nβ run_summary.json"]
REP --> RH["report html RUN_DIR\nβ report.html"]
REP --> RS["report slices RUN_DIR\nβ roleΓcategory table"]
CMP --> CR["compare runs RUN_A RUN_B\nβ diff JSON"]
ROB --> RBT["robustness task\n--task ID --env ID\n--adapter NAME --n N\nβ mean Β· std Β· robustness_score"]