Open Source · HPC Benchmarking · AI Evaluation

AOBench¶

Benchmark framework for evaluating AI agents in High-Performance Computing (HPC) environments — role-aware, tool-using, trace-based, and reproducible.

Get Started View on GitHub

Five Benchmark Principles¶

Principle	Meaning
Role-aware	The same question yields different answers and tool access depending on the requester role.
Tool-using	Agents are evaluated as systems that call HPC-native tools (SLURM, telemetry, docs, RBAC, facility).
Permission-aware	Success requires respecting RBAC and refusing out-of-scope requests. Permission violations hard-fail the task.
Trace-based	Evaluation considers the full execution trace — tool selection, arguments, sequence, and grounding — not just the final answer.
Reproducible	Runs target deterministic snapshot bundles, never live infrastructure.

Quick Start¶

pip install "aobench[openai]"

# Validate all task specs and environment bundles
aobench validate benchmark

# Run one task end-to-end with the zero-tool baseline
aobench run task --task JOB_USR_001 --env env_01 --adapter direct_qa

# Generate a report
aobench report json --run data/runs/<run-id>

Where to Go Next¶

Framework

Benchmark methodology, evaluation protocol, HPC environments, and scoring design.

Read the framework docs
Guides

Adapter setup, tool reference, Langfuse observability integration, and CLI usage.

Browse guides
Contributing

Add tasks, environments, adapters, or scorers. Codebase map for new contributors.

How to contribute
GitHub

Source code, issue tracker, and releases.

MSKazemi/aobench