Background — Motivation & Related Work¶

This page summarizes why AOBench exists and how it relates to the broader benchmark landscape.

1. Why AOBench Is Needed¶

AI agents are increasingly proposed for HPC support, operations, and monitoring. There is currently no benchmark that evaluates such agents under HPC constraints: scheduler-aware reasoning, telemetry interpretation, role-conditioned answers, permission-sensitive behavior, and reproducible operational-state evaluation.

Existing benchmarks cover web agents, software-engineering agents, enterprise workflows, and cloud operations—but not the combination required in HPC.

2. The Gap¶

No existing benchmark, to our knowledge, simultaneously provides:

HPC-native tool environment (schedulers, telemetry, docs, energy, facility)
Role-aware task variants (same question, different answers by role)
Deterministic HPC state snapshots for reproducibility
Permission compliance scoring (refusal quality, redaction, RBAC)
Trace-based evaluation (tool selection, arguments, grounding)
Energy- and facility-aware tasks as first-class components

3. Benchmark Landscape (Condensed)¶

Category	Examples	Relevance
General agent	AgentBench, GAIA	Broad capability; not HPC-specific
Environment-based	WebArena, OSWorld	Executable settings; methodologically relevant
Domain-specific	τ-bench, Cloud-OpsBench	Policy-aware, deterministic state; closest analogues
Tool-use	TRAIL, ToolBench	API selection, argument correctness
Trace-based	Various	Process scoring, not just outcome

τ-bench (policy-aware) and Cloud-OpsBench (deterministic operational state) are the closest methodological analogues. Neither addresses HPC-native tools, role-aware access, or energy/facility reasoning.

4. AOBench's Contribution¶

New benchmark domain: HPC agent systems
Role-aware task model: evaluation conditioned on stakeholder role
Deterministic replay environment: reproducible HPC telemetry and policy
Trace-based protocol: scoring outcomes and agent trajectories
Multi-dimensional scorecard: accuracy, tool use, grounding, compliance, robustness, efficiency

5. Positioning¶

AOBench is a benchmark framework for evaluating AI agents in HPC environments using role-aware tasks, deterministic HPC state snapshots, and trace-based scoring.