Environment Bundles — Overview¶

AOBench ships 26 deterministic HPC environment snapshot bundles under benchmark/environments/env_01/ … env_26/. Each bundle freezes a realistic operational scenario — a job failure, a queue-congestion event, a cooling unit fault, a policy violation, a multi-job interference incident, and so on — so that any agent run against the bundle is reproducible.

This page is the cross-reference index: scenario type, scored roles, scored QCATs, and the human description for each bundle. For the snapshot file format (slurm_state.json, telemetry/*.parquet, rbac_policy.yaml, docs/*.md, incident_metadata.json), see environments.md. For the authoritative metadata.yaml of each bundle, read the file directly.

Scenario coverage at a glance¶

Scenario type	Bundles	Headline
`job_failure`	env_01, env_17, env_18	OOM kill, MPI timeout, missing checkpoint
`queue_congestion`	env_02, env_12	Pending-queue spike, fairshare starvation
`thermal_power`	env_03	Thermal-power monitoring snapshot
`rack_energy`	env_04	Rack-level energy comparison
`cooling_failure`	env_05	CRAC unit failure
`energy_anomaly`	env_06, env_07, env_19	GPU power spike, PUE degradation, GPU idle waste
`node_degradation`	env_08, env_09	Thermal throttling, ECC errors / flapping node
`policy_violation`	env_10, env_11	Restricted-partition submit, allocation overrun
`capacity_planning`	env_13, env_14	6-month CPU trend, GPU demand forecast
`multi_job_interference`	env_15, env_20	Memory oversubscription, Lustre I/O contention
`scheduler_misconfiguration`	env_16	Wrong default partition after reconfig
`storage_management`	env_21	Lustre quota pressure; per-user and per-project usage
`facility_incident`	env_22	Cooling alarm response; degraded CRAC in high-density GPU row
`architecture_review`	env_23	Capacity expansion planning; cluster topology and hardware inventory

All 23 bundles¶

Env	Scenario type	Scored roles	Scored QCATs	Description
env_01	`job_failure`	scientific_user, sysadmin	JOB, MON	User-job OOM failure on a memory-constrained node.
env_02	`queue_congestion`	sysadmin	JOB, MON	Pending-job queue is backed up and a sysadmin must triage.
env_03	`thermal_power`	facility_admin	MON, ENERGY	Thermal-power monitoring snapshot.
env_04	`rack_energy`	facility_admin	ENERGY, MON	Rack-by-rack energy comparison for cluster-wide review.
env_05	`cooling_failure`	facility_admin, sysadmin	ENERGY, MON	CRAC unit failure causing inlet-temperature anomalies.
env_06	`energy_anomaly`	sysadmin, facility_admin	ENERGY, MON	`gpu01` power draw spikes to 650 W (baseline ≈ 380 W) during a large training run.
env_07	`energy_anomaly`	facility_admin	ENERGY	Cluster PUE has degraded from 1.35 to 1.62 over 48 h due to a partial cooling fault.
env_08	`node_degradation`	sysadmin	MON, JOB	`node03` is thermally throttling because of a blocked cooling duct.
env_09	`node_degradation`	sysadmin	MON, JOB	`node06` is flapping between `allocated` and `draining` due to recurring ECC errors.
env_10	`policy_violation`	scientific_user, sysadmin	JOB	User `eve` submitted to the `restricted` partition without approval; SLURM held the job.
env_11	`policy_violation`	sysadmin, facility_admin	JOB, ENERGY	Account `ml-lab` has consumed 98 % of its monthly CPU-hour allocation.
env_12	`queue_congestion`	sysadmin	JOB	`ml-lab` has monopolised the cluster for 72 h, causing fairshare starvation and priority inversion.
env_13	`capacity_planning`	facility_admin, system_designer	MON, ENERGY	Six-month CPU-utilisation telemetry showing a steady upward trend.
env_14	`capacity_planning`	system_designer	ENERGY, MON	GPU partition utilisation has averaged 97 % for 30 days — demand-forecast input.
env_15	`multi_job_interference`	sysadmin, researcher	JOB, MON	Two jobs share `node01` (180 GB + 200 GB on a 256 GB node); swap activity is degrading both.
env_16	`scheduler_misconfiguration`	sysadmin	JOB	After a SLURM reconfig, the default partition was set incorrectly and is misrouting jobs.
env_17	`job_failure`	sysadmin	JOB, MON	A 4-node MPI job died after 6 h with exit 137 — suspected network-fault-induced SIGKILL.
env_18	`job_failure`	scientific_user	JOB	User `alice` resubmitted a long-running simulation but the checkpoint file is missing.
env_19	`energy_anomaly`	facility_admin	ENERGY	`gpu02` and `gpu03` allocated for 9 h but utilisation is ≈ 0 % — energy waste.
env_20	`multi_job_interference`	sysadmin	JOB, MON	Lustre I/O contention: a checkpoint job on nodes 1–4 is starving a science job on nodes 5–8.
env_21	`storage_management`	scientific_user, sysadmin, researcher, facility_admin, system_designer	DATA	Lustre quota pressure with per-user and per-project usage data; I/O metrics stub.
env_22	`facility_incident`	scientific_user, sysadmin, researcher, facility_admin, system_designer	FAC, ENERGY, DOCS	Cooling alarm response: CRAC-07 degraded in high-density GPU row C; BMS alarms and runbook.
env_23	`architecture_review`	sysadmin, researcher, facility_admin, system_designer	ARCH, PERF, DOCS	Capacity expansion planning: cluster topology, hardware inventory, and capacity planning guide.

Per-bundle file layout¶

Every bundle has the same shape (some optional files appear only when relevant to the scenario):

benchmark/environments/env_NN/
├── metadata.yaml                       Bundle metadata (env_id, scenario_type,
│                                       supported_roles, supported_categories,
│                                       included_sources)
├── manifest.txt                        Sorted list of all included files
├── slurm/
│   ├── slurm_state.json                Nodes, partitions, jobs (validated by SlurmState)
│   └── job_details.json                sacct-level details (when relevant)
├── telemetry/
│   ├── telemetry_timeseries.parquet    columns: timestamp, node_id, metric_name, value, unit
│   └── memory_events.csv               OOM / memory events (when relevant)
├── policy/
│   └── rbac_policy.yaml                Per-role permissions (schema v1.1)
├── docs/
│   ├── *.md                            User-facing knowledge documents
│   └── rbac_policy.md                  Auto-generated readable policy summary
└── incidents/
    └── incident_metadata.json          Incident timeline + affected resources

The metadata.yaml of every bundle is the authoritative source for what is in scope. To list every file manifest.txt records:

cat benchmark/environments/env_05/manifest.txt

Validating a bundle¶

aobench validate benchmark runs validate_bundle() from src/aobench/environment/snapshot_validator.py over every bundle. It checks JSON-schema conformance for slurm_state.json, incident_metadata.json, and rbac_policy.yaml, plus the parquet column schema for telemetry/*.parquet.

To regenerate a bundle (for example after editing the source CSVs):

python scripts/generate_bundles.py --env env_NN

scripts/generate_bundles.py also emits the auto-generated docs/rbac_policy.md per environment from the YAML policy.