Environment Snapshots¶
Purpose¶
Environment Snapshots define the deterministic HPC operational states used by AOBench during evaluation.
An environment snapshot is a packaged, reproducible bundle of scheduler state, telemetry, policies, documentation, and incident context that simulates a realistic HPC situation without requiring access to a live system.
Each benchmark task references one environment snapshot through environment_id. This allows the same task to be executed repeatedly under the same conditions, ensuring reproducibility, fair comparison, and offline benchmarking. AOBench relies on these deterministic snapshots rather than live infrastructure as part of its benchmark design.
1 β What an Environment Snapshot Represents¶
An Environment Snapshot is the world-state in which an agent must operate.
It may capture:
- scheduler and queue state
- node and job telemetry
- power and energy measurements
- cluster topology
- role-based access policies
- documentation and operational procedures
- incident or fault context
In simple terms:
2 β Why Environment Snapshots Are Needed¶
Environment Snapshots are required because live HPC systems are constantly changing.
Without snapshots:
- job queues change
- metrics drift
- incidents resolve
- documents get updated
- permissions change over time
Using snapshots makes AOBench:
- reproducible
- publishable
- easier to debug
- independent of site-specific live infrastructure
- suitable for offline evaluation and artifact release
3 β Core Design Principles¶
3.1 Deterministic¶
The same snapshot must always produce the same tool outputs for the same inputs.
3.2 Realistic¶
Snapshots should reflect real HPC operational situations such as job failures, thermal anomalies, queue congestion, or documentation lookup.
3.3 Role-aware¶
The same snapshot may expose different views depending on the requester role and policy profile.
3.4 Modular¶
A snapshot should be composed of clearly separated files for scheduler state, telemetry, policies, docs, and incidents.
3.5 Extensible¶
The format should support future additions such as live replay, multi-agent interaction, or time-evolving scenarios.
4 β Typical Snapshot Contents¶
A snapshot bundle may include the following components:
slurm_state.jsontelemetry_timeseries.parquetpower_metrics.csvtopology.jsonrbac_policy.yamldocs_index/incident_metadata.jsonmetadata.yaml
These file types are consistent with the AOBench architecture page, which already describes the HPC state snapshot model and typical files.
5 β Canonical Directory Layout¶
benchmark/environments/
<env_id>/
metadata.yaml β EnvironmentMetadata (Pydantic-validated)
slurm/
slurm_state.json β nodes, partitions, jobs (validated by SlurmState)
job_details.json β sacct-level details (optional)
telemetry/
telemetry_timeseries.parquet β columns: timestamp, node_id, metric_name, value, unit
memory_events.csv β OOM/memory events (optional)
policy/
rbac_policy.yaml β per-role permission definitions
docs/
*.md β user-facing knowledge docs
incidents/
incident_metadata.json β incident timeline + affected resources
Bundles are validated by aobench.environment.snapshot_validator.validate_bundle().
6 β Snapshot Metadata Schema¶
Each environment snapshot should have a small metadata file.
Example metadata.yaml¶
environment_id: env_01
snapshot_name: OOM Failure Scenario
cluster_name: cluster-a
snapshot_timestamp: 2026-02-10T14:00:00Z
supported_roles:
- scientific_user
- sysadmin
supported_qcats:
- JOB
- MON
included_sources:
- slurm
- telemetry metrics snapshot
- docs
- rbac
scenario_type: job_failure
description: >
Snapshot representing a user job failure caused by out-of-memory pressure,
with associated scheduler state, metrics, and policy context.
7 β Logical Components of a Snapshot¶
7.1 Scheduler State¶
Represents batch system information such as:
- jobs
- queues
- partitions
- node assignments
- exit codes
- scheduling state
Typical source file:
slurm_state.json
7.2 Telemetry State¶
Represents operational measurements such as:
- CPU utilization
- memory usage
- node health
- power draw
- temperature
- cooling indicators
Typical source files:
telemetry_timeseries.parquetpower_metrics.csv
7.3 Policy and Access State¶
Represents what a role can view or do.
Examples:
- read-only user policy
- sysadmin visibility
- restricted facility billing data
- redaction rules
Typical source file:
rbac_policy.yaml
7.4 Documentation Bundle¶
Represents the knowledge artifacts accessible to the agent.
Examples:
- user guides
- quota policies
- troubleshooting docs
- facility procedures
Typical source:
docs_index/
7.5 Incident Context¶
Represents scenario-specific operational conditions.
Examples:
- rack overheat
- queue backlog
- partial metrics outage
- node failure
- maintenance window
Typical source file:
incident_metadata.json
8 β Example Snapshot Use Case¶
Example task¶
Referenced environment¶
Snapshot contents¶
slurm_state.jsonshows job 482910 failed with OOMtelemetry_timeseries.parquetshows memory spike before terminationrbac_policy.yamlallows the user to see only their own job detailsdocs_index/includes memory request documentationincident_metadata.jsonconfirms no broader cluster outage
This lets the agent answer the question using a realistic but fully reproducible scenario.
9 β Relationship to Tasks¶
A task should never depend on live infrastructure.
Instead, each task references one snapshot via:
This means:
- the task defines the question and evaluation logic
- the environment snapshot defines the underlying HPC state
So the mapping is:
Task β references β Environment Snapshot
Agent β interacts with β Mock tools over snapshot data
Scorers β evaluate β output + trace against task + snapshot
10 β Relationship to Mock Tools¶
Snapshots are not accessed directly by the agent.
Instead, mock tools expose the snapshot data through controlled interfaces.
Examples:
slurm.query_jobs()slurm.job_details(job_id)telemetry.query(metric, labels, time_range)docs.retrieve(query)rbac.check(role, resource)
So the snapshot is the backend state, and the mock tools are the evaluation interface.
11 β Environment Scenario Types¶
To keep dataset design organized, snapshots can be categorized by scenario type.
Suggested types:
job_failurequeue_congestionnode_health_alertenergy_anomalythermal_issuepolicy_lookuppermission_violationincident_responseperformance_bottleneck
This helps align snapshots with QCAT categories and capabilities.
12 β Current Environment Coverage¶
AOBench currently ships 23 canonical snapshot bundles across 11 scenario types.
| Scenario type | Count | Env IDs |
|---|---|---|
job_failure | 3 | env_01, env_17, env_18 |
energy_anomaly | 3 | env_06, env_07, env_19 |
node_degradation | 2 | env_08, env_09 |
policy_violation | 2 | env_10, env_11 |
queue_congestion | 2 | env_02, env_12 |
capacity_planning | 2 | env_13, env_14 |
multi_job_interference | 2 | env_15, env_20 |
scheduler_misconfiguration | 1 | env_16 |
storage_management | 1 | env_21 |
facility_incident | 1 | env_22 |
architecture_review | 1 | env_23 |
Naming convention¶
Canonical environment IDs use the env_NN scheme (zero-padded). New bundles should continue from env_24.
12.1 β Canonical Snapshot Tracker¶
| environment_id | snapshot_name | scenario_type | supported_roles | supported_categories | status |
|---|---|---|---|---|---|
env_01 | User OOM Failure | job_failure | scientific_user, sysadmin | JOB, MON | validated |
env_02 | Queue Congestion / Long Pending Jobs | queue_congestion | scientific_user, sysadmin | JOB, MON | validated |
env_03 | Thermal and Power Monitoring | thermal_power | sysadmin, facility_admin | MON, ENERGY | validated |
env_04 | Rack Energy Comparison | rack_energy | facility_admin | ENERGY | validated |
env_05 | Cooling Unit Failure | cooling_failure | facility_admin, sysadmin | ENERGY, MON | validated |
env_06 | GPU Power Spike | energy_anomaly | sysadmin, facility_admin | ENERGY, MON | validated |
env_07 | PUE Degradation Cooling Issue | energy_anomaly | facility_admin | ENERGY | validated |
env_08 | Thermal Throttling on node03 | node_degradation | sysadmin | MON | validated |
env_09 | Memory ECC Errors Flapping Node | node_degradation | sysadmin | MON | validated |
env_10 | Policy Violation Restricted Partition | policy_violation | scientific_user, sysadmin | JOB | validated |
env_11 | Account Over Allocation Limit | policy_violation | sysadmin, facility_admin | JOB | validated |
env_12 | Fairshare Starvation Priority Inversion | queue_congestion | sysadmin | JOB, MON | validated |
env_13 | Six Month CPU Utilisation Trend | capacity_planning | facility_admin, system_designer | ENERGY, MON | validated |
env_14 | GPU Demand Forecast Expansion | capacity_planning | system_designer | ENERGY, MON | validated |
env_15 | Multi-Job Memory Interference | multi_job_interference | sysadmin, researcher | JOB, MON | validated |
env_16 | Wrong Default Partition Misconfiguration | scheduler_misconfiguration | sysadmin | JOB | validated |
env_17 | MPI Communication Timeout Network Fault | job_failure | sysadmin | JOB, MON | validated |
env_18 | Checkpoint File Missing Restart Fails | job_failure | scientific_user | JOB | validated |
env_19 | GPU Idle Energy Waste Not Released | energy_anomaly | facility_admin | ENERGY, MON | validated |
env_20 | Lustre IO Contention Multi-Job Interference | multi_job_interference | sysadmin | JOB, MON | validated |
env_21 | Storage Quota Pressure | storage_management | all 5 roles | DATA | not_checked |
env_22 | Cooling Alarm Response | facility_incident | all 5 roles | FAC, ENERGY, DOCS | not_checked |
env_23 | Capacity Expansion Planning | architecture_review | sysadmin, researcher, facility_admin, system_designer | ARCH, PERF, DOCS | not_checked |
Every task must reference exactly one valid environment_id from this table.
12.2 β Snapshot Schema Implementation¶
All 20 bundles are validated by validate_bundle() on every load_environment() call.
Key implementation files¶
| File | Purpose |
|---|---|
src/aobench/schemas/snapshot.py | Pydantic models: SlurmState, SlurmJob, SlurmNode, SlurmPartition, IncidentMetadata |
src/aobench/environment/snapshot_validator.py | validate_bundle(bundle_root) β validates JSON schemas, RBAC YAML, parquet columns |
src/aobench/environment/snapshot_loader.py | build_tool_registry(bundle, role) β instantiates all mock tools bound to a role |
src/aobench/loaders/env_loader.py | load_environment() β calls validate_bundle() before returning bundle |
scripts/generate_bundles.py | Generates env_06βenv_20 bundles programmatically (make generate-bundles) |
Telemetry parquet schema (canonical)¶
| Column | dtype | Description |
|---|---|---|
timestamp | datetime64[ns, UTC] | Sample time (UTC) |
node_id | string | Node name (e.g. node01) |
metric_name | string | e.g. cpu_util_pct, power_w, gpu_util_pct |
value | float64 | Metric value |
unit | string | %, MB, W, Mbps |
Mock tool telemetry methods¶
| Method | Description |
|---|---|
telemetry.query_timeseries(node_id, metric_name, start, end) | Parquet time-range query with role-based node filtering |
telemetry.query_node_metrics(node_id) | Per-node latest-value summary across all metrics |
telemetry.query_memory_events(job_id) | Memory events CSV lookup |
telemetry.list_metrics() | List available telemetry files |
Implementation status values¶
| Status | Meaning |
|---|---|
planned | Exists only as a row in a table |
scaffolded | Directory + metadata exist; files may be placeholders |
bundled | All required files present |
validated | Passes validate_bundle() β no schema errors |
All 20 current bundles are at validated status.
13 β Validation Rules¶
validate_bundle(bundle_root) (in src/aobench/environment/snapshot_validator.py) checks:
slurm/slurm_state.jsonβ validates againstSlurmStatePydantic modelincidents/incident_metadata.jsonβ validates againstIncidentMetadataPydantic modelpolicy/rbac_policy.yamlβ must be valid YAML with a top-levelroleskeytelemetry/telemetry_timeseries.parquetβ must contain columns:timestamp,node_id,metric_name,value,unit
load_environment() calls validate_bundle() automatically and raises ValueError on any error.
To validate all bundles manually:
14 β Future Extensions¶
Later versions of AOBench may support:
- time-series replay environments
- partially degraded environments
- multi-step incident simulations
- hidden environment variants for leaderboard robustness
- site-specific snapshot packs
- multi-agent environment interaction
15 β Bottom Line¶
Environment Snapshots are a core part of AOBench. They make the benchmark more than a question set by providing a reproducible HPC world-state for agent evaluation.
Their role is to:
- freeze operational reality
- support mock tool interaction
- enable fair scoring
- make benchmark runs reproducible
- bridge task definitions and executable evaluation