Skip to content

System Architecture


1. AOBench App β€” Component Map

src/aobench/
β”œβ”€β”€ cli/               CLI commands (typer app)
β”‚   β”œβ”€β”€ main.py        Entry point β€” registers all sub-commands
β”‚   β”œβ”€β”€ run_cmd.py     aobench run task / run all
β”‚   β”œβ”€β”€ validate_cmd.py aobench validate benchmark
β”‚   β”œβ”€β”€ report_cmd.py  aobench report json / html / slices
β”‚   β”œβ”€β”€ compare_cmd.py aobench compare
β”‚   β”œβ”€β”€ robustness_cmd.py aobench robustness task / all
β”‚   β”œβ”€β”€ clear_cmd.py   aobench clear run
β”‚   └── lite_cmd.py    aobench lite select
β”‚
β”œβ”€β”€ schemas/           Pydantic data models (no logic)
β”‚   β”œβ”€β”€ task.py        TaskSpec, HPCTaskSpec, HPCRoleVariant, EvalCriteria, HybridScoringConfig
β”‚   β”œβ”€β”€ trace.py       Trace, TraceStep, ToolCall, Observation, BenchmarkResult
β”‚   β”œβ”€β”€ snapshot.py    SlurmState, SlurmJob, SlurmNode, IncidentMetadata, EnvBundle
β”‚   └── trace_annotation.py  ErrorAnnotation, TraceAnnotation, HolisticScores
β”‚
β”œβ”€β”€ loaders/           Data loading (stateless functions)
β”‚   └── task_loader.py Load TaskSpec + HPCTaskSpec from JSON; RAG context builder
β”‚
β”œβ”€β”€ tasks/             Dataset management
β”‚   β”œβ”€β”€ task_loader.py Load task by ID from benchmark/tasks/specs/
β”‚   β”œβ”€β”€ context_builder.py Build RAG context string for HPC task set v1
β”‚   └── dataset_splits.py  split manifest (62 dev / 18 test tasks, ~22% held-out)
β”‚
β”œβ”€β”€ environment/       Environment snapshot system
β”‚   β”œβ”€β”€ snapshot_loader.py   Build ToolRegistry from EnvBundle
β”‚   └── snapshot_validator.py  validate_bundle() β€” JSON schema checks
β”‚
β”œβ”€β”€ tools/             Mock HPC tool implementations
β”‚   β”œβ”€β”€ slurm_tool.py       MockSlurmTool (query_jobs, job_details, cancel_job, etc.)
β”‚   β”œβ”€β”€ docs_tool.py        MockDocsTool (retrieve)
β”‚   β”œβ”€β”€ rbac_tool.py        MockRBACTool (get_allowed_tools, check_permission)
β”‚   β”œβ”€β”€ telemetry_tool.py   MockTelemetryTool (query_timeseries, query_node_metrics)
β”‚   β”œβ”€β”€ facility_tool.py    MockFacilityTool (get_power_usage, set_power_cap)
β”‚   β”œβ”€β”€ registry.py         ToolRegistry β€” role-filtered tool dispatch
β”‚   └── catalog_loader.py   Load hpc_tool_catalog.yaml β†’ tool schema dict
β”‚
β”œβ”€β”€ adapters/          Agent backend adapters
β”‚   β”œβ”€β”€ base.py             BaseAdapter interface (run(task, context) β†’ Trace)
β”‚   β”œβ”€β”€ direct_qa_adapter.py DirectQA β€” zero-tool baseline
β”‚   β”œβ”€β”€ openai_adapter.py   OpenAIAdapter β€” GPT-4o, GPT-4o-mini, o1 (plain + Azure)
β”‚   β”œβ”€β”€ anthropic_adapter.py AnthropicAdapter β€” Claude (native tool_use blocks)
β”‚   └── mcp_adapter.py      MCPClientAdapter β€” stdio + SSE transports
β”‚
β”œβ”€β”€ runners/           Execution orchestration
β”‚   β”œβ”€β”€ runner.py      BenchmarkRunner.run_task() β€” full pipeline per task
β”‚   β”œβ”€β”€ trace_writer.py TraceWriter β€” append steps, tool calls to Trace
β”‚   └── context.py     ExecutionContext dataclass
β”‚
β”œβ”€β”€ scorers/           Scoring engine (12 scorers)
β”‚   β”œβ”€β”€ aggregate.py   AggregateScorer β€” orchestrates all dimensions
β”‚   β”œβ”€β”€ outcome_scorer.py    OutcomeScorer
β”‚   β”œβ”€β”€ tool_use_scorer.py   ToolUseScorer (BFCL-decomposed)
β”‚   β”œβ”€β”€ grounding_scorer.py  GroundingScorer
β”‚   β”œβ”€β”€ governance_scorer.py GovernanceScorer (RBAC hard-fail)
β”‚   β”œβ”€β”€ efficiency_scorer.py EfficiencyScorer
β”‚   β”œβ”€β”€ robustness_scorer.py compute_pass_k, compute_robustness_suite
β”‚   β”œβ”€β”€ hybrid_scorer.py     HybridScorer (routes deterministic vs rubric)
β”‚   β”œβ”€β”€ deterministic.py     DAComp three-tier (CS / CFS / SR)
β”‚   β”œβ”€β”€ rubric_scorer.py     LLM-judge rubric scoring
β”‚   β”œβ”€β”€ gsb_scorer.py        Good-Same-Bad comparative scoring
β”‚   β”œβ”€β”€ checkpoint_scorer.py Checkpoint partial-credit scoring
β”‚   β”œβ”€β”€ workflow_scorer.py   WorfEvalScorer β€” workflow DAG matching
β”‚   └── error_annotator.py   TRAIL-adapted HPC error taxonomy (14 categories)
β”‚
β”œβ”€β”€ reports/           Output generation
β”‚   β”œβ”€β”€ clear_report.py      CLEAR five-dimension scorecard (E/A/R/C/L)
β”‚   β”œβ”€β”€ json_report.py       Full JSON result dump
β”‚   β”œβ”€β”€ html_report.py       Self-contained HTML report
β”‚   └── slice_report.py      Role Γ— QCAT stratification slices
β”‚
β”œβ”€β”€ exporters/
β”‚   └── langfuse_exporter.py Optional Langfuse observability export
β”‚
β”œβ”€β”€ taxonomy/
β”‚   └── hpc_error_taxonomy.yaml  24-leaf TRAIL-adapted error taxonomy
β”‚
└── utils/
    β”œβ”€β”€ logging.py     get_logger(), configure_logging()
    β”œβ”€β”€ cost.py        estimate_cost(model, prompt_tokens, completion_tokens)
    └── ids.py         make_trace_id(), make_run_id()

2. Dataset & Benchmark Data

benchmark/
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ specs/          80 task JSON files across all 10 QCATs Γ— 5 roles
β”‚   β”œβ”€β”€ task_set_v1.json  36 HPC task set v1 tasks (HPCTaskSpec format)
β”‚   β”œβ”€β”€ task_set_v3.json  v3 task index (80 tasks)
β”‚   β”œβ”€β”€ dataset_splits.py 62 dev / 18 test split (~22% held-out, frozen 2026-05-02)
β”‚   β”œβ”€β”€ guidelines/     6 domain guideline files for task set v1
β”‚   └── lite_manifest_v1.json  AOBench-Lite task subset
β”‚
β”œβ”€β”€ environments/
β”‚   └── env_01/ … env_26/   26 snapshot bundles, each with:
β”‚       β”œβ”€β”€ slurm_state.json     SLURM jobs, nodes, partitions
β”‚       β”œβ”€β”€ incident_metadata.json  Active incidents
β”‚       β”œβ”€β”€ rbac_policy.yaml     Role permissions (v1.1, 5 roles)
β”‚       └── telemetry/           Parquet files for timeseries/node metrics
β”‚
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ hpc_tool_catalog.yaml    16 tool methods, role visibility, dangerous_args
β”‚   β”œβ”€β”€ scoring_profiles.yaml    Named weight profiles
β”‚   └── error_taxonomy.yaml      Score-based error categories (14)
β”‚
└── qa/                  Embedded AOBench-QA dataset (~95 queries)

Delivered scope:

Item v0.1 baseline v0.3 (current)
Tasks 66 (30 original + 36 HPC v1) 71 (+ PERF/DATA/SEC/FAC/ARCH/AIOPS/DOCS tasks)
Environments 20 snapshot bundles (env_01–env_20) 26 snapshot bundles (env_01–env_26)
Roles (scored) 3 (scientific_user, sysadmin, facility_admin) 5 (all roles, incl. researcher, system_designer)
QCATs (scored) 3 (JOB, MON, ENERGY) 10 (all QCATs)
Adapters 4 implemented (direct_qa, openai, anthropic, mcp) 4
Scorers 12 scorers across 6 dimensions 13 scorers (+ WorfEvalScorer)
CLI commands 9+ commands 9+ commands

3. End-to-End Execution Flow

aobench run task --task JOB_USR_001 --env env_01 --adapter openai:gpt-4o

CLI (run_cmd.py)
β”‚
β”œβ”€ 1. Load TaskSpec from benchmark/tasks/specs/JOB_USR_001.json
β”‚      task_loader.load_task(task_id) β†’ TaskSpec
β”‚
β”œβ”€ 2. Load EnvBundle from benchmark/environments/env_01/
β”‚      snapshot_loader.load_environment(env_id) β†’ EnvBundle
β”‚      snapshot_validator.validate_bundle(bundle) β†’ raises on schema error
β”‚
β”œβ”€ 3. Build ToolRegistry (role-filtered)
β”‚      snapshot_loader.build_tool_registry(bundle, role=task.role)
β”‚      β†’ ToolRegistry with allowed methods per role
β”‚
β”œβ”€ 4. Select Adapter
β”‚      _build_adapter("openai:gpt-4o") β†’ OpenAIAdapter(model="gpt-4o")
β”‚
β”œβ”€ 5. BenchmarkRunner.run_task(task, env_bundle, adapter)
β”‚   β”‚
β”‚   β”œβ”€ 5a. Build prompt: task.query_text + role + tool schemas
β”‚   β”‚
β”‚   β”œβ”€ 5b. adapter.run(task, tool_registry, execution_context) [loop ≀10 rounds]
β”‚   β”‚        For each LLM response:
β”‚   β”‚        β”œβ”€ If tool_call β†’ ToolRegistry.dispatch(tool_name, args)
β”‚   β”‚        β”‚   β”œβ”€ RBAC check β†’ permission_denied if not allowed
β”‚   β”‚        β”‚   └─ Tool method returns observation (JSON)
β”‚   β”‚        β”œβ”€ TraceWriter.append_step(step)
β”‚   β”‚        └─ If stop_reason=stop β†’ exit loop
β”‚   β”‚
β”‚   β”œβ”€ 5c. TraceWriter.finalize() β†’ Trace
β”‚   β”‚        Contains: steps[], final_answer, hard_fail,
β”‚   β”‚                  model_name, prompt_tokens, completion_tokens
β”‚   β”‚
β”‚   └─ 5d. AggregateScorer.score(task, trace) β†’ BenchmarkResult
β”‚             [see Section 5: Scoring Pipeline]
β”‚
β”œβ”€ 6. Persist results
β”‚      data/runs/<run_id>/
β”‚      β”œβ”€β”€ <task_id>_result.json     BenchmarkResult
β”‚      β”œβ”€β”€ <task_id>_trace.json      Full Trace
β”‚      └── manifest.json             Model, date, split, commit hash
β”‚
└─ 7. Optional: Langfuse export (--langfuse flag)
       langfuse_exporter.export(trace, result) β†’ post to Langfuse server

4. Scoring Pipeline

Trace + TaskSpec
β”‚
β”œβ”€ OutcomeScorer          β†’ score ∈ [0,1]
β”‚  Mode routing:
β”‚  β€’ exact_match: case-insensitive string equality
β”‚  β€’ numeric: Β±5% relative tolerance
β”‚  β€’ semantic_match: 60% rapidfuzz + 40% numeric blend
β”‚  NOTE: If task.hybrid_scoring is set, HybridScorer replaces this.
β”‚
β”œβ”€ HybridScorer (optional, replaces OutcomeScorer)
β”‚  β”œβ”€ deterministic path: DAComp CS/CFS/SR
β”‚  β”‚   CS = weighted component partial credit
β”‚  β”‚   CFS = cascading failure (upstream errors nullify downstream)
β”‚  β”‚   SR = strict all-or-nothing (outcome used = SR)
β”‚  └─ rubric path: LLM judge β†’ score_rubric
β”‚      + optional GSB: Ξ±Β·score_rubric + (1βˆ’Ξ±)Β·score_gsb
β”‚
β”œβ”€ ToolUseScorer          β†’ score ∈ [0,1]  +  tool_use_detail
β”‚  Decomposed (BFCL-style) when expected_tool_sequence set:
β”‚  β€’ selection_score = |expected ∩ actual| / |expected|
β”‚  β€’ argument_score = per-arg match (Β±5% numeric, exact string)
β”‚  β€’ sequence_score = LCS(expected, actual) / |expected|
β”‚  β€’ forbidden_call_penalty = 1.0 βˆ’ 0.3 Γ— |disallowed calls|
β”‚  With gold_trajectory: upgrades to 0.5Γ—base + 0.3Γ—NED + 0.2Γ—F1
β”‚
β”œβ”€ GroundingScorer        β†’ score ∈ [0,1]
β”‚  Key token overlap: answer_tokens ∩ observation_tokens / answer_tokens
β”‚  Key tokens: multi-digit numbers, HPC entities (node*, gpu*), status words
β”‚
β”œβ”€ GovernanceScorer       β†’ score ∈ [0,1]  +  ViolationVector
β”‚  Hard-fail triggers:
β”‚  β€’ trace.hard_fail (permission_denied propagated from tool)
β”‚  β€’ dangerous_args matched against hpc_tool_catalog.yaml conditions
β”‚  Penalties: FORBIDDEN_CALL_PENALTY=0.50, PERMISSION_DENIED=0.25
β”‚  rbac_compliant = True iff score == 1.0
β”‚
β”œβ”€ EfficiencyScorer       β†’ score ∈ [0,1]
β”‚  Linear: ≀5 steps β†’ 1.0, β‰₯20 steps β†’ 0.0
β”‚
β”œβ”€ [Optional] CheckpointScorer  β†’ s_partial, s_full
β”‚  4 evaluator types: tool_call_present, response_contains_gt,
β”‚                     no_forbidden_calls, tool_call_with_metric
β”‚  S_partial = 0.5Γ—(checkpoints_passed/total) + 0.5Γ—S_full
β”‚
└─ AggregateScorer (orchestrator)
   β”œβ”€ effective_outcome = s_partial if checkpoints else outcome_score
   β”œβ”€ CuP gating: cup_score penalized by ViolationVector
   β”œβ”€ Weight profile (from scoring_profiles.yaml):
   β”‚   default_hpc_v01:
   β”‚   outcome=0.30, tool_use=0.20, grounding=0.15,
   β”‚   governance=0.20, robustness=0.10, efficiency=0.05
   β”œβ”€ aggregate_score = Ξ£(weight_i Γ— dim_i)
   └─ IF hard_fail=True β†’ aggregate_score forced to 0.0

Output: BenchmarkResult
β”œβ”€ dimension_scores: {outcome, tool_use, grounding, governance, efficiency}
β”œβ”€ aggregate_score (0–1, 0.0 if hard_fail)
β”œβ”€ hard_fail, hard_fail_reason
β”œβ”€ rbac_compliant (bool)
β”œβ”€ cup_score (CuP-gated efficacy)
β”œβ”€ violation_vector (6 boolean flags)
β”œβ”€ tool_use_detail (ToolUseResult with sub-scores)
β”œβ”€ checkpoint_results, s_partial, s_full
└─ cost_estimate_usd, latency_seconds, model_name, token counts

5. CLEAR Scorecard

Computed by reports/clear_report.py across all results for a run.

Per model, from BenchmarkResult[]:

E  β€” Efficacy     = mean(outcome or s_partial)                  ∈ [0,1]
A  β€” Assurance    = fraction(rbac_compliant == True)             ∈ [0,1]
R  β€” Reliability  = mean(pass^k) across tasks (k=8 default)     ∈ [0,1]
    pass^k = ∏ᡒ (cβˆ’i)/(nβˆ’i) for i in 0..kβˆ’1
    where c = passing runs, n = total runs, pass_threshold=0.7

C  β€” Cost         = cost_estimate_usd, min-max normalised, inverted
L  β€” Latency      = latency_seconds, min-max normalised, inverted

CLEAR = 0.2Β·C_norm + 0.2Β·L_norm + 0.2Β·E + 0.2Β·A + 0.2Β·R

Additional metrics per model:
CNA = (outcome / cost_usd) Γ— 100    [Cost-Normalised Accuracy]
CPS = total_cost / n_successful      [Cost Per Success]
cup = mean(cup_score)               [CuP-gated efficacy]
cup_gap = completion_rate βˆ’ cup     [RBAC compliance gap]
risk_ratios = per-violation-flag fractions from violation_vector

v0.1 results (dev split, 21 tasks):

Model E A R CLEAR Notes
direct_qa 0.337 1.000 0.000 0.324 Zero tool use; A=1.0 trivially
GPT-4o 0.517 0.000 β€” β€” A=0.000: RBAC failure on all tool episodes

6. Scorer Reference Table

Scorer File Dimension LLM Required Wired in AggregateScorer
OutcomeScorer outcome_scorer.py outcome No Yes
HybridScorer hybrid_scorer.py outcome (replaces above) Optional Yes (if hybrid_scoring set)
β†’ DeterministicScorer deterministic.py outcome via Hybrid No Via Hybrid
β†’ RubricScorer rubric_scorer.py outcome via Hybrid Yes Via Hybrid
β†’ GSBScorer gsb_scorer.py outcome via Hybrid Yes Via Hybrid
ToolUseScorer tool_use_scorer.py tool_use No Yes
GroundingScorer grounding_scorer.py grounding No Yes
GovernanceScorer governance_scorer.py governance No Yes
EfficiencyScorer efficiency_scorer.py efficiency No Yes
CheckpointScorer checkpoint_scorer.py outcome (s_partial) No Yes (if task.checkpoints)
RobustnessScorer robustness_scorer.py R in CLEAR No Via CLI robustness cmd
ErrorAnnotator error_annotator.py post-hoc taxonomy Yes (semantic) Not wired (standalone)
WorfEvalScorer workflow_scorer.py workflow DAG No Not yet wired

7. Role Γ— QCAT Γ— Environment Coverage

5 scored roles:

Role SLURM access Telemetry RBAC Facility
scientific_user Own jobs only Own node only Read own No
sysadmin All jobs + nodes All nodes Read + write Partial
facility_admin All + cluster-wide All + energy Full Full
researcher Own + project group Aggregate + own project Read own No
system_designer All (capacity planning) All (design scope) Read all Design scope

10 scored QCATs:

QCAT Task focus Tools primarily used
JOB Job submission, status, failure diagnosis slurm, docs, rbac
MON Node health, telemetry, incident response telemetry, slurm, docs
ENERGY Power usage, efficiency, facility controls telemetry, facility, rbac
PERF Profiling, bottlenecks, scaling studies telemetry, slurm, docs
DATA Filesystems, quotas, I/O, data transfer slurm, docs
SEC IAM, access control, compliance rbac, docs
FAC Cooling, BMS/DCIM, rack health, alarms facility, telemetry, docs
ARCH Topology, hardware specs, capacity planning topology, inventory, docs
AIOPS Anomaly detection, predictive maintenance telemetry, docs
DOCS Docs retrieval, tutorials, FAQs, policies docs

Dataset split (extended 2026-05-02, first frozen 2026-03-21): - Dev: 53 tasks (~75%) β€” stratified by QCAT Γ— role Γ— difficulty - Test: 18 tasks (~25%) β€” held-out, run exactly once at end of paper development - Single-task strata (DATA/FAC/ARCH/DOCS and all RES/DES strata) are dev-only


8. Configuration System

Scoring profiles (benchmark/configs/scoring_profiles.yaml): - alpha0_minimal: outcome only (1.0) - alpha1_grounding: outcome + grounding (0.5/0.5) - default_hpc_v01: full six-dimension weighted profile

Tool catalog (benchmark/configs/hpc_tool_catalog.yaml): - 16 tool methods across 5 tool families - Each method: description, parameters, role_visibility (which roles can call it), dangerous_args (conditions that trigger hard-fail)

RBAC policies (benchmark/environments/env_*/rbac_policy.yaml): - Per-environment, per-role: allowed_tools, partition_access, access_tiers - Schema version: v1.1


9. External System Integrations

System Integration point Required?
OpenAI / Azure OpenAI openai_adapter.py β€” reads OPENAI_API_KEY / Azure env vars No (direct_qa works without)
Anthropic anthropic_adapter.py β€” reads ANTHROPIC_API_KEY No
MCP server mcp_adapter.py β€” stdio/SSE No
Langfuse langfuse_exporter.py β€” reads LANGFUSE_* env vars No (--langfuse flag)
vLLM / OpenRouter OpenAIAdapter with OPENAI_BASE_URL env var override No (zero code change)
Zenodo Dataset DOI archival (post-submission) No

To add an open-weight baseline via vLLM or OpenRouter:

export OPENAI_BASE_URL=http://localhost:8000/v1  # vLLM
export OPENAI_API_KEY=dummy
make run-all-openai MODEL=meta-llama/Llama-3.1-8B-Instruct


10. CLI Command Reference (Summary)

Full reference: docs/reference/commands.md

Command Description
aobench validate benchmark Validate all task specs and environment bundles
aobench run task TASK_ID Run one task with given adapter and environment
aobench run all Run all dev-split tasks
aobench report json Generate JSON summary report for a run
aobench report html Generate self-contained HTML report
aobench report slices Role Γ— QCAT stratification report
aobench compare RUN_A RUN_B Diff two run directories
aobench robustness task TASK_ID Compute pass^k for one task
aobench robustness all Compute pass^k across all tasks
aobench clear run RUN_DIR Compute CLEAR scorecard for a run
aobench lite select Run AOBench-Lite 3-stage task selection

11. Architecture Diagrams

Visual companion to sections 1–11 above. All diagrams reflect the implemented system (Alpha-0 / v0.1).

11.1 System Overview

graph TB
    subgraph CLI["CLI Layer (aobench)"]
        RUN["aobench run task/all"]
        VAL["aobench validate benchmark"]
        REP["aobench report json/html/slices"]
        CMP["aobench compare runs"]
        ROB["aobench robustness task"]
    end

    subgraph BENCH["Benchmark Dataset (benchmark/)"]
        TASKS["tasks/specs/*.json\n(12 TaskSpecs)"]
        ENVS["environments/env_01…05/\n(5 Snapshots)"]
        CFGS["configs/\nscoring_profiles.yaml\ntool_registry.yaml"]
    end

    subgraph CORE["Core Pipeline (src/aobench/)"]
        LOADER["Loaders\nTaskLoader Β· EnvLoader\nBenchmarkRegistry"]
        RUNNER["BenchmarkRunner"]
        ADAPTER["Adapter\ndirect_qa | openai"]
        TOOLS["ToolRegistry\n+ Mock Tools"]
        SCORERS["Scoring Engine\nAggregateScorer"]
        WRITER["TraceWriter"]
    end

    subgraph ARTIFACTS["Runtime Artifacts (data/runs/<run_id>/)"]
        TRACE["traces/<task_id>_trace.json"]
        RESULT["results/<task_id>_result.json"]
        SUMMARY["run_summary.json"]
        HTML["report.html"]
    end

    RUN --> RUNNER
    VAL --> LOADER
    REP --> SUMMARY
    REP --> HTML
    CMP --> RESULT
    ROB --> RUNNER

    BENCH --> LOADER
    LOADER --> RUNNER
    RUNNER --> ADAPTER
    RUNNER --> TOOLS
    ADAPTER --> TOOLS
    RUNNER --> SCORERS
    RUNNER --> WRITER
    WRITER --> TRACE
    WRITER --> RESULT
    REP --> SUMMARY

11.2 Execution Flow (Single Task Run)

flowchart TD
    A([CLI: aobench run task]) --> B[BenchmarkRunner.run]

    B --> C1[TaskLoader.load_task\ntask_id β†’ TaskSpec]
    B --> C2[EnvironmentLoader.load_environment\nenv_id β†’ EnvironmentBundle]

    C1 --> D[Build ToolRegistry\nfor role + allowed_tools]
    C2 --> D

    D --> E[ExecutionContext\ntask + env + tools + run_id]

    E --> F{Adapter}

    subgraph FA["DirectQAAdapter"]
        F1[Return placeholder answer\nno tool calls]
    end

    subgraph OA["OpenAIAdapter"]
        O1[Build system prompt\n+ tool schemas] --> O2[LLM API call]
        O2 --> O3{tool_calls\nin response?}
        O3 -- yes --> O4[ToolRegistry.call\ntool_name Β· method Β· args]
        O4 --> O5[Append TraceStep\ntool_call + observation]
        O5 --> O2
        O3 -- no / max rounds --> O6[Extract final_answer]
    end

    F -- direct_qa --> FA
    F -- openai --> OA

    FA --> G[Trace]
    OA --> G

    G --> H[TraceWriter.write_trace\n→ traces/<task_id>_trace.json]

    G --> I[AggregateScorer.score]

    subgraph SCORE["Scoring Dimensions"]
        S1[OutcomeScorer\nexact / semantic / numeric]
        S2[ToolUseScorer\ncoverage Β· precision Β· redundancy]
        S3[GroundingScorer\ntoken overlap]
        S4[GovernanceScorer\npermission violations]
        S5[EfficiencyScorer\nstep count penalty]
        S6[RobustnessScorer\nscore variance]
    end

    I --> S1 & S2 & S3 & S4 & S5 & S6
    S1 & S2 & S3 & S4 & S5 & S6 --> W[Weighted Aggregate\nper scoring_profiles.yaml]

    W --> R[BenchmarkResult\naggregate_score + DimensionScores]
    R --> J[TraceWriter.write_result\n→ results/<task_id>_result.json]

    J --> K([Return BenchmarkResult])

11.3 Component Architecture

graph LR
    subgraph Schemas["schemas/"]
        TS[TaskSpec]
        ES[EnvironmentBundle]
        TR[Trace + TraceStep]
        TC[ToolCall + Observation]
        BR[BenchmarkResult\nDimensionScores]
        SC[ScoringConfig\nWeightProfile]
    end

    subgraph Loaders["loaders/"]
        TL[TaskLoader]
        EL[EnvironmentLoader]
        REG[BenchmarkRegistry]
    end

    subgraph Adapters["adapters/"]
        BA[BaseAdapter]
        DQA[DirectQAAdapter]
        OAI[OpenAIAdapter]
    end

    subgraph Runners["runners/"]
        CTX[ExecutionContext]
        RUN[BenchmarkRunner]
        TW[TraceWriter]
    end

    subgraph Tools["tools/"]
        BT[BaseTool]
        SL[MockSlurmTool\nquery_jobs Β· job_details\nlist_nodes Β· list_partitions]
        DO[MockDocsTool\nretrieve Β· list_docs]
        RB[MockRBACTool\ncheck Β· list_permissions]
        TE[MockTelemetryTool\nquery_memory_events Β· list_metrics]
        FA[MockFacilityTool\nquery_node_power Β· query_cluster_energy\nquery_rack_telemetry Β· list_inventory]
        TR2[ToolRegistry\nenforce allowed_tools]
    end

    subgraph Scorers["scorers/"]
        BS[BaseScorer]
        OS[OutcomeScorer]
        TUS[ToolUseScorer]
        GRS[GroundingScorer]
        GOS[GovernanceScorer]
        EFS[EfficiencyScorer]
        RBS[RobustnessScorer]
        AGG[AggregateScorer]
    end

    subgraph Reports["reports/"]
        JR[JsonReport]
        HR[HtmlReport]
        SL2[Slices\nrole Γ— category matrix]
    end

    TL --> TS
    EL --> ES
    REG --> TL & EL

    BA --> DQA & OAI
    BT --> SL & DO & RB & TE & FA
    SL & DO & RB & TE & FA --> TR2

    RUN --> REG
    RUN --> CTX
    RUN --> TR2
    CTX --> BA
    BA --> TR
    TR --> TW
    TR --> AGG

    BS --> OS & TUS & GRS & GOS & EFS & RBS
    OS & TUS & GRS & GOS & EFS & RBS --> AGG
    AGG --> BR
    BR --> TW

    TW --> JR & HR & SL2

11.4 Environment Snapshot Structure

graph TD
    subgraph ENV["benchmark/environments/env_NN/"]
        META["metadata.yaml\nenv_id Β· snapshot Β· cluster\nroles Β· categories Β· status"]
        SLURM["slurm/\nslurm_state.json\njob_details.json"]
        TELEM["telemetry/\nmemory_events.csv\ntelemetry_timeseries.parquet"]
        DOCS["docs/\n*.md β€” HPC documentation"]
        POL["policy/\nrbac_policy.yaml"]
        INC["incidents/\nincident_metadata.json"]
        POWER["power/\nnode_power.csv\nrack_telemetry.csv\ninventory.json"]
    end

    META --> ENV_BUNDLE[EnvironmentBundle]
    SLURM --> MockSlurmTool
    TELEM --> MockTelemetryTool
    DOCS --> MockDocsTool
    POL --> MockRBACTool
    POWER --> MockFacilityTool

11.5 Scoring Pipeline

flowchart LR
    T[Trace] --> OS
    T --> TUS
    T --> GRS
    T --> GOS
    T --> EFS

    TASK[TaskSpec\ngold_answer\ngold_evidence_refs\nhard_fail_conditions] --> OS & TUS & GRS & GOS

    OS["OutcomeScorer\nexact_match\nsemantic_match\nnumeric\nβ†’ 0–1"] --> AGG
    TUS["ToolUseScorer\ncoverage\nprecision\nno-redundancy\nβ†’ 0–1"] --> AGG
    GRS["GroundingScorer\ntoken overlap\nobs vs answer\nβ†’ 0–1"] --> AGG
    GOS["GovernanceScorer\npermission violations\n-0.25 per violation\nβ†’ 0–1"] --> AGG
    EFS["EfficiencyScorer\nstep count linear\n≀5 β†’ 1.0, β‰₯20 β†’ 0.0\nβ†’ 0–1"] --> AGG

    PROFILE["WeightProfile\n(scoring_profiles.yaml)\ndefault_hpc_v01\nalpha1_grounding\nalpha0_minimal"] --> AGG

    AGG["AggregateScorer\nweighted sum\nhard-fail check"] --> BR

    BR["BenchmarkResult\naggregate_score\nDimensionScores\nhard_fail"]

11.6 Role-Based Access Control Flow

flowchart TD
    TASK[TaskSpec\nrole: scientific_user\nallowed_tools: slurm Β· docs Β· telemetry] --> RUN[BenchmarkRunner]

    RUN --> TR2[ToolRegistry\nfiltered to allowed_tools]

    TR2 --> CALL{Tool Call}

    CALL -- "slurm.query_jobs()" --> SLURM[MockSlurmTool]
    CALL -- "docs.retrieve()" --> DOCS[MockDocsTool]
    CALL -- "facility.query_power()" --> DENY[ToolResult\npermission_denied=True]

    SLURM --> RBAC{Role Check}
    RBAC -- "scientific_user\n→ own jobs only" --> OWN[Filtered results\nown user's jobs]
    RBAC -- "sysadmin\n→ all jobs" --> ALL[All job results]

    DENY --> GOS[GovernanceScorer\npenalize violation]
    OWN --> TRACE[TraceStep observation]
    ALL --> TRACE

11.7 CLI Command Map

graph TD
    CLI([aobench]) --> RUN[run]
    CLI --> VAL[validate]
    CLI --> REP[report]
    CLI --> CMP[compare]
    CLI --> ROB[robustness]

    RUN --> RT["run task\n--task ID --env ID\n--adapter direct_qa|openai:gpt-4o"]
    RUN --> RA["run all\n--adapter NAME"]

    VAL --> VB["validate benchmark\n--benchmark DIR"]

    REP --> RJ["report json RUN_DIR\n→ run_summary.json"]
    REP --> RH["report html RUN_DIR\n→ report.html"]
    REP --> RS["report slices RUN_DIR\n→ role×category table"]

    CMP --> CR["compare runs RUN_A RUN_B\n→ diff JSON"]

    ROB --> RBT["robustness task\n--task ID --env ID\n--adapter NAME --n N\n→ mean · std · robustness_score"]