Skip to content

Open Source ยท HPC Benchmarking ยท AI Evaluation

AOBench

Benchmark framework for evaluating AI agents in High-Performance Computing (HPC) environments โ€” role-aware, tool-using, trace-based, and reproducible.

Python License Version Tasks Environments


Five Benchmark Principles

Principle Meaning
Role-aware The same question yields different answers and tool access depending on the requester role.
Tool-using Agents are evaluated as systems that call HPC-native tools (SLURM, telemetry, docs, RBAC, facility).
Permission-aware Success requires respecting RBAC and refusing out-of-scope requests. Permission violations hard-fail the task.
Trace-based Evaluation considers the full execution trace โ€” tool selection, arguments, sequence, and grounding โ€” not just the final answer.
Reproducible Runs target deterministic snapshot bundles, never live infrastructure.

Quick Start

pip install "aobench[openai]"

# Validate all task specs and environment bundles
aobench validate benchmark

# Run one task end-to-end with the zero-tool baseline
aobench run task --task JOB_USR_001 --env env_01 --adapter direct_qa

# Generate a report
aobench report json --run data/runs/<run-id>

Where to Go Next

  • Framework


    Benchmark methodology, evaluation protocol, HPC environments, and scoring design.

    Read the framework docs

  • Guides


    Adapter setup, tool reference, Langfuse observability integration, and CLI usage.

    Browse guides

  • Contributing


    Add tasks, environments, adapters, or scorers. Codebase map for new contributors.

    How to contribute

  • GitHub


    Source code, issue tracker, and releases.

    MSKazemi/aobench