Skip to content

Taxonomy

This page consolidates the four taxonomic dimensions used to organise AOBench tasks: roles, query categories (QCAT), knowledge source scopes, and access control / RBAC tiers. It also documents the canonical task metadata schema.

The authoritative Pydantic types live in src/aobench/schemas/task.py.


1. Roles & personas

A task's role field says who is asking. AOBench defines five role values, all of which have scored tasks in task_set_v3.json.

Role Schema value Has tasks? Primary mission Priority QCATs
Normal user / scientific user scientific_user โœ… Run workloads, manage own data JOB, MON, ENERGY, DATA, DOCS, FAC
System administrator sysadmin โœ… Cluster reliability, scheduling, security JOB, MON, ENERGY, DATA, SEC, AIOPS, ARCH
Facility admin facility_admin โœ… Power and cooling operations MON, ENERGY, FAC, AIOPS, DOCS, ARCH
Researcher researcher โœ… Telemetry analysis, performance, efficiency AIOPS, PERF, ENERGY, JOB, MON, DATA, DOCS
System designer / architect system_designer โœ… Capacity planning, topology, benchmarking ARCH, PERF, ENERGY, JOB, AIOPS, DATA, DOCS

1.1 Role abbreviations (task ID naming convention)

Task IDs encode the role as a 3-letter abbreviation, e.g. JOB-USR-003.

Abbreviation Role
USR scientific_user
SYS sysadmin
FAC facility_admin
RES researcher
DES system_designer

2. Query categories (QCAT)

The qcat field labels the functional domain of the task. All ten QCATs have scored tasks in benchmark/tasks/specs/ (80 tasks total).

Code Name Has tasks? Description
JOB Job & Workflow Management โœ… Submitting, monitoring, debugging jobs; queues; batch scripts
MON Monitoring & Observability โœ… Metrics, logs, alerts, dashboards, telemetry correlation
ENERGY Power, Energy & Sustainability โœ… Power monitoring, PUE, energy-aware scheduling
PERF Performance & Optimisation โœ… Profiling, bottlenecks, scaling studies
DATA Data & Storage Management โœ… Filesystems, quotas, I/O performance, data transfer, backup/archival
SEC Security & Policy โœ… IAM, access control, compliance
FAC Facility, Infrastructure & Environmental Systems โœ… Cooling, BMS/DCIM, power distribution, rack health, alarms
ARCH System Architecture, Design & Capacity Planning โœ… Topology, hardware specs, capacity planning, benchmarking
AIOPS AI & Intelligent Operations โœ… Anomaly detection, predictive maintenance
DOCS Documentation, Support & Knowledge Assistance โœ… Docs retrieval, tutorials, FAQs, policies, troubleshooting

3. Knowledge source scope

knowledge_source_scope: list[KnowledgeSourceCode] constrains which document groups the environment exposes and which the agent may cite as evidence.

Code Group Description Primary roles
ARCH_DOC System architecture & hardware Cluster topology, hardware specs, rack layouts, BoM, firmware system_designer, sysadmin
OPS_DOC Sysadmin & operations manuals Queue config, LDAP/RBAC policy, backup procedures, change mgmt sysadmin
FAC_DOC Facility & infrastructure Cooling diagrams, BMS/DCIM config, P&ID, power distribution, setpoints facility_admin, system_designer
USR_DOC User documentation & help Onboarding guides, SLURM/PBS reference, batch templates, FAQs scientific_user, researcher
DATA_GOV Data management & governance Backup/archival, retention, GDPR, data transfer rules sysadmin, researcher
POLICY Organisational & policy AUP, SLA, security policy, incident response, energy mgmt sysadmin, facility_admin
ADMIN_DATA Administrative & org data Project allocations, billing, vendor contracts, maintenance calendar system_designer, sysadmin
WIKI Knowledge base / wiki / portal How-to, troubleshooting pages, internal wiki, helpdesk KB all
REF_STD Reference standards & config tables ASHRAE setpoints, partition definitions, compliance standards facility_admin, system_designer
ENG_DOC Engineering & upgrade documents RFPs, system acceptance tests, expansion plans, integration diagrams system_designer

Example task usage:

{
  "knowledge_source_scope": ["USR_DOC", "WIKI"]
}

The canonical type definition is in src/aobench/schemas/task.py.


4. Access control & RBAC

AOBench enforces two-layered access control: access tiers govern data exposure, role permissions govern tool calls.

4.1 Access levels (data exposure)

Level Holders Scope
User-level scientific_user, researcher Own jobs, public docs, safe how-to
Elevated / privileged sysadmin, project managers Software install, config, user management
Restricted read-only researcher Aggregated, anonymised telemetry
Sensitive / admin-only sysadmin, security Auth logs, security config, network
Highly sensitive architect, facility_admin Procurement, physical access design

4.2 Access tiers (per-task)

Tier Description Controls
Tier-1 โ€” Public / user-level Safe docs, non-sensitive RAG No approval
Tier-2 โ€” Privileged Real telemetry Role-based validation
Tier-3 โ€” Restricted read-only Energy dashboards, KPIs Read-only
Tier-4 โ€” Highly sensitive Procurement, cyber-security Approval + isolation

4.3 RBAC policy v1.1

Per-environment, per-role permissions live in benchmark/environments/env_NN/rbac_policy.yaml. Each entry declares:

  • allowed_tools โ€” list of tool methods this role may call.
  • partition_access โ€” which SLURM partitions are visible / submittable.
  • access_tiers โ€” Tier-1 to Tier-4 as above.

The policies follow least-privilege: each role gets the minimum required access. Privileged actions are logged in the trace (whether or not the agent uses them) and contribute to the governance dimension.

The catalog of every tool method, its role_visibility, and its dangerous_args conditions is in benchmark/configs/hpc_tool_catalog.yaml (16 methods across the 5 tool families). Forbidden tool calls and permission-denied propagation are absorbing hard-fails โ€” see Evaluation ยง6.


5. Task metadata schema

Authoritative Pydantic definition: src/aobench/schemas/task.py. The following fields are required on every task spec; the validator (aobench validate benchmark) enforces them.

Field Type Purpose Example
task_id str Unique identifier "JOB-USR-003"
role Role Persona context "scientific_user"
qcat QCat Functional domain "JOB"
query_text str User-facing prompt "Why did my job fail?"
difficulty Difficulty "easy" / "medium" / "hard" / "adversarial" "medium"
difficulty_tier int Numeric tier (1, 2, 3) 2
knowledge_source_scope list[KnowledgeSourceCode] Allowed evidence groups ["USR_DOC", "WIKI"]
allowed_tools list[str] Tool whitelist ["slurm", "docs"]
gold_evidence_refs list[str] Expected evidence anchors ["job_891234_oom"]
expected_answer_type AnswerType Output form "diagnosis"
environment_id str Snapshot linkage "env_01"
hard_fail_conditions list[str] Absorbing-failure triggers ["fabricated_evidence"]
eval_criteria EvalCriteria Scoring config {"evaluation_mode": "semantic_match"}
aggregate_weight_profile str Profile name "default_hpc_v01"
scoring_readiness Literal["draft","ready","locked"] Validation state "ready"
task_creation_date date Authoring date (contamination tracking) "2026-02-10"
contamination_risk Literal["clean","elevated","unknown"] Pre-training-leakage risk "clean"

HPCTaskSpec (used for the 36 v1 tasks in task_set_v1.json) extends TaskSpec with HPCRoleVariant blocks for multi-role variants and HPCGroundTruth for component-wise gold answers.


  • Task & role schema: src/aobench/schemas/task.py
  • Trace & result schema: src/aobench/schemas/trace.py
  • Tool catalog: benchmark/configs/hpc_tool_catalog.yaml
  • RBAC policies: benchmark/environments/env_NN/rbac_policy.yaml
  • Evaluation protocol: Evaluation
  • Architecture: Architecture ยง4
  • Implemented system: System Architecture