Adapters & Tools โ Plain-English Overview¶
A quick guide to how adapters and tools work in AOBench.
Tools¶
What they are: Mock HPC services that agents call during a run. They simulate scheduler queries, docs lookups, telemetry, and RBAC checks.
| What | Description |
|---|---|
| Purpose | Answer tool calls from the agent by reading data from the environment snapshot |
| Data source | benchmark/environments/<env_id>/ โ JSON, CSV, YAML files |
| Role awareness | Behavior changes by role (e.g. scientific_user sees only their own jobs) |
| Examples | slurm__query_jobs, slurm__job_details, docs__retrieve, telemetry__query_memory_events, rbac__check, facility__query_node_power, facility__query_rack_telemetry |
Flow: Agent calls a tool โ tool reads env data โ returns a result (or permission denied).
Adapters¶
What they are: Bridges to different agent backends (OpenAI, Azure, a stub, etc.). They orchestrate the task loop and produce a trace.
| What | Description |
|---|---|
| Purpose | Run the agent on a task; mediate between the agent and tools until an answer is produced |
| Input | ExecutionContext โ task, environment, tools |
| Output | A Trace โ steps, tool calls, observations, final answer |
| Examples | direct_qa โ stub that returns a placeholder with no tools; openai โ OpenAI/Azure API with function calling |
Flow: Receive task + context โ send to agent โ when agent wants tools, call tools and feed results back โ repeat until agent stops with a final answer โ return trace.
Connect-to-agent mode (planned post-v0.1): New adapters will invoke external HPC agents (ODA, ExaSage, โฆ) deployed on or near clusters via HTTP / MCP / FastAPI. AOBench connects to the agent's API and scores its trace; AOBench never needs direct access to real SLURM or the cluster.
How They Work Together¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Runner โ
โ 1. Loads task + environment (deterministic snapshot) โ
โ 2. Builds ToolRegistry โ methods filtered by role โ
โ 3. Passes task, tools โ Adapter; receives Trace back โ
โ 4. Passes Trace + task โ Scorers; returns BenchmarkResult โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Adapter (e.g. OpenAIAdapter) โ
โ โข Sends task.query_text + tool schemas โ LLM โ
โ โข When LLM calls a tool โ ToolRegistry.dispatch() โโโโโโ โ
โ โข Appends [tool call + observation] as a TraceStep โโโโโโ โ
โ โข Repeats until LLM gives a final answer (โค 10 rounds) โ
โ โข Returns completed Trace โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
dispatch โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Tools (slurm ยท docs ยท telemetry ยท rbac ยท facility) โ
โ โข Run in-process โ no external APIs โ
โ โข Read from env bundle (deterministic snapshot) โ
โ โข Wrong role โ permission_denied observation โ
โ โข Return JSON observation โ Adapter โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ โ Adapter returns completed Trace
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Scorers (invoked by Runner ยท AggregateScorer) โ
โ โข OutcomeScorer โ final answer vs gold โ
โ โข ToolUseScorer โ tool selection & argument quality โ
โ โข GovernanceScorer โ RBAC compliance, permission checks โ
โ โข GroundingScorer โ answer grounded in observations โ
โ โข EfficiencyScorer โ step count (โค 5 steps โ 1.0) โ
โ โ
โ โ BenchmarkResult: aggregate_score (0โ1) + DimensionScores โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tool Reference¶
| Tool | Methods | Data source | Notes |
|---|---|---|---|
slurm | query_jobs, job_details, list_nodes, list_partitions | slurm/slurm_state.json, slurm/job_details.json | Role-aware: scientific_user sees own jobs only |
docs | retrieve | docs/*.md | Keyword search over documentation files |
telemetry | query_memory_events, list_metrics | telemetry/*.csv | Memory event time series |
rbac | check | policy/*.yaml | Permission checks by role + resource + action |
facility | query_node_power, query_cluster_energy, query_rack_telemetry, list_inventory | power/*.csv, rack/*.csv, inventory/*.csv | For facility_admin role; ENERGY tasks |
Scorers Reference¶
| Scorer | Dimension | What it measures |
|---|---|---|
OutcomeScorer | outcome | Quality of final answer vs gold (exact / semantic / numeric match) |
ToolUseScorer | tool_use | Tool selection coverage, precision, no redundancy |
GroundingScorer | grounding | Fraction of answer's key claims supported by tool observations |
GovernanceScorer | governance | RBAC compliance โ penalises permission violations |
EfficiencyScorer | efficiency | Step count efficiency (โค5 steps = 1.0, โฅ20 = 0.0) |
Scoring Profiles¶
| Profile | Use when | grounding weight |
|---|---|---|
alpha0_minimal | Tasks with no tool expectation / stubs | 0.00 |
alpha1_grounding | Tasks where tool evidence is expected | 0.20 |
default_hpc_v01 | Full production benchmark | 0.15 |
One-Line Summary¶
| Component | Role |
|---|---|
| Tools | Simulate HPC APIs โ read env data, enforce permissions |
| Adapters | Connect to agent backends โ drive the task loop, use tools when the agent requests them |
| Scorers | Evaluate the trace on 5 dimensions and aggregate into one score |