Skip to content

Adapters & Tools โ€” Plain-English Overview

A quick guide to how adapters and tools work in AOBench.


Tools

What they are: Mock HPC services that agents call during a run. They simulate scheduler queries, docs lookups, telemetry, and RBAC checks.

What Description
Purpose Answer tool calls from the agent by reading data from the environment snapshot
Data source benchmark/environments/<env_id>/ โ€” JSON, CSV, YAML files
Role awareness Behavior changes by role (e.g. scientific_user sees only their own jobs)
Examples slurm__query_jobs, slurm__job_details, docs__retrieve, telemetry__query_memory_events, rbac__check, facility__query_node_power, facility__query_rack_telemetry

Flow: Agent calls a tool โ†’ tool reads env data โ†’ returns a result (or permission denied).


Adapters

What they are: Bridges to different agent backends (OpenAI, Azure, a stub, etc.). They orchestrate the task loop and produce a trace.

What Description
Purpose Run the agent on a task; mediate between the agent and tools until an answer is produced
Input ExecutionContext โ€” task, environment, tools
Output A Trace โ€” steps, tool calls, observations, final answer
Examples direct_qa โ€” stub that returns a placeholder with no tools; openai โ€” OpenAI/Azure API with function calling

Flow: Receive task + context โ†’ send to agent โ†’ when agent wants tools, call tools and feed results back โ†’ repeat until agent stops with a final answer โ†’ return trace.

Connect-to-agent mode (planned post-v0.1): New adapters will invoke external HPC agents (ODA, ExaSage, โ€ฆ) deployed on or near clusters via HTTP / MCP / FastAPI. AOBench connects to the agent's API and scores its trace; AOBench never needs direct access to real SLURM or the cluster.


How They Work Together

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Runner                                                     โ”‚
โ”‚  1. Loads task + environment (deterministic snapshot)       โ”‚
โ”‚  2. Builds ToolRegistry โ€” methods filtered by role          โ”‚
โ”‚  3. Passes task, tools โ†’ Adapter; receives Trace back       โ”‚
โ”‚  4. Passes Trace + task โ†’ Scorers; returns BenchmarkResult  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Adapter  (e.g. OpenAIAdapter)                              โ”‚
โ”‚  โ€ข Sends task.query_text + tool schemas โ†’ LLM               โ”‚
โ”‚  โ€ข When LLM calls a tool โ†’ ToolRegistry.dispatch()   โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ€ข Appends [tool call + observation] as a TraceStep  โ—„โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚  โ€ข Repeats until LLM gives a final answer  (โ‰ค 10 rounds)    โ”‚
โ”‚  โ€ข Returns completed Trace                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    dispatch   โ”‚
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Tools  (slurm ยท docs ยท telemetry ยท rbac ยท facility)        โ”‚
โ”‚  โ€ข Run in-process โ€” no external APIs                        โ”‚
โ”‚  โ€ข Read from env bundle (deterministic snapshot)            โ”‚
โ”‚  โ€ข Wrong role โ†’ permission_denied observation               โ”‚
โ”‚  โ€ข Return JSON observation โ†’ Adapter                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ”‚  โ† Adapter returns completed Trace
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Scorers  (invoked by Runner ยท AggregateScorer)             โ”‚
โ”‚  โ€ข OutcomeScorer     โ€” final answer vs gold                 โ”‚
โ”‚  โ€ข ToolUseScorer     โ€” tool selection & argument quality    โ”‚
โ”‚  โ€ข GovernanceScorer  โ€” RBAC compliance, permission checks   โ”‚
โ”‚  โ€ข GroundingScorer   โ€” answer grounded in observations      โ”‚
โ”‚  โ€ข EfficiencyScorer  โ€” step count  (โ‰ค 5 steps โ†’ 1.0)        โ”‚
โ”‚                                                             โ”‚
โ”‚  โ†’ BenchmarkResult: aggregate_score (0โ€“1) + DimensionScores โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Tool Reference

Tool Methods Data source Notes
slurm query_jobs, job_details, list_nodes, list_partitions slurm/slurm_state.json, slurm/job_details.json Role-aware: scientific_user sees own jobs only
docs retrieve docs/*.md Keyword search over documentation files
telemetry query_memory_events, list_metrics telemetry/*.csv Memory event time series
rbac check policy/*.yaml Permission checks by role + resource + action
facility query_node_power, query_cluster_energy, query_rack_telemetry, list_inventory power/*.csv, rack/*.csv, inventory/*.csv For facility_admin role; ENERGY tasks

Scorers Reference

Scorer Dimension What it measures
OutcomeScorer outcome Quality of final answer vs gold (exact / semantic / numeric match)
ToolUseScorer tool_use Tool selection coverage, precision, no redundancy
GroundingScorer grounding Fraction of answer's key claims supported by tool observations
GovernanceScorer governance RBAC compliance โ€” penalises permission violations
EfficiencyScorer efficiency Step count efficiency (โ‰ค5 steps = 1.0, โ‰ฅ20 = 0.0)

Scoring Profiles

Profile Use when grounding weight
alpha0_minimal Tasks with no tool expectation / stubs 0.00
alpha1_grounding Tasks where tool evidence is expected 0.20
default_hpc_v01 Full production benchmark 0.15

One-Line Summary

Component Role
Tools Simulate HPC APIs โ€” read env data, enforce permissions
Adapters Connect to agent backends โ€” drive the task loop, use tools when the agent requests them
Scorers Evaluate the trace on 5 dimensions and aggregate into one score