Open Source ยท HPC Benchmarking ยท AI Evaluation
AOBench¶
Benchmark framework for evaluating AI agents in High-Performance Computing (HPC) environments โ role-aware, tool-using, trace-based, and reproducible.
Five Benchmark Principles¶
| Principle | Meaning |
|---|---|
| Role-aware | The same question yields different answers and tool access depending on the requester role. |
| Tool-using | Agents are evaluated as systems that call HPC-native tools (SLURM, telemetry, docs, RBAC, facility). |
| Permission-aware | Success requires respecting RBAC and refusing out-of-scope requests. Permission violations hard-fail the task. |
| Trace-based | Evaluation considers the full execution trace โ tool selection, arguments, sequence, and grounding โ not just the final answer. |
| Reproducible | Runs target deterministic snapshot bundles, never live infrastructure. |
Quick Start¶
pip install "aobench[openai]"
# Validate all task specs and environment bundles
aobench validate benchmark
# Run one task end-to-end with the zero-tool baseline
aobench run task --task JOB_USR_001 --env env_01 --adapter direct_qa
# Generate a report
aobench report json --run data/runs/<run-id>
Where to Go Next¶
-
Framework
Benchmark methodology, evaluation protocol, HPC environments, and scoring design.
-
Guides
Adapter setup, tool reference, Langfuse observability integration, and CLI usage.
-
Contributing
Add tasks, environments, adapters, or scorers. Codebase map for new contributors.
-
GitHub
Source code, issue tracker, and releases.