PhD · AI Systems · HPC + Cloud

Mohsen
Seyedkazemi
Ardebili

I build autonomous AI systems that act on infrastructure — not just explain it.

About

Platform engineer and researcher at the University of Bologna specialising in MLOps, AIOps, and Kubernetes-native infrastructure. I architect production-grade LLM agent systems, anomaly detection pipelines, and hybrid HPC/cloud orchestration frameworks for datacenter-scale environments.

Seven years of hands-on systems and network engineering in mission-critical industrial environments (power plants, enterprise OT networks) before a PhD in High-Performance Computing gives me a lens most ML researchers lack: I care about uptime, observability, and correctness in production — not just benchmark performance.

My current focus is autonomous infrastructure control: systems that detect failures, reason about root causes, and propose or execute remediation with human approval gates.

7 Years critical infra ops
PhD HPC Systems · UniBo
22 Papers published
179 Scholar citations

Projects

GRAAFE

Graph anomaly anticipation framework for exascale HPC — topology-aware node failure prediction on Tier-0 CINECA supercomputer (Marconi100).

0.91AUC (1h forecast)
1000+Production nodes
  • Rack-level Graph Convolutional Networks (GCN)
  • Integrates ExaMon telemetry (IPMI, Ganglia, Nagios)
  • <0.2% CPU overhead, ~116 full-rack inferences/hour
PythonGCNPrometheusHPC

HazardNet

Thermal hazard prediction framework for datacenters — multi-modal deep learning for sub-100ms thermal failure forecasting at minimal runtime overhead.

<100msInference latency
F1 0.99Detection accuracy
  • Temporal Convolutional Networks + LSTM ensemble
  • LIME/SHAP explainability for operator adoption
  • Validated on thousands of production HPC nodes
PythonPyTorchTCNLSTM

AOBench

Agent Operations Benchmark — evaluates AI agents on HPC operational tasks with role-aware, permission-enforced, trace-based scoring against deterministic environment snapshots. No live cluster required.

80Benchmark tasks
26Snapshot environments
  • 10 HPC task categories × 5 operator roles (SLURM, telemetry, RBAC, energy)
  • Trace-based scoring across 6 dimensions with 12 scorers
  • OpenAI, Anthropic, and MCP adapters — reproducible & publishable results
PythonLLM EvalHPCFastAPIMCP

kube_q

CLI and Python SDK for interacting with KubeIntellect from the terminal — streaming responses, Rich TUI, and pipeline-friendly output formats.

  • Full KubeIntellect API coverage via CLI
  • Streaming output with Rich formatting
  • Scriptable for CI/CD and automation workflows
PythonRichCLI

Research

179 citations
h-index 7
22 publications
Google Scholar ↗
EU Projects DECICE Graph-Massivizer EUROPEAN PILOT REGALE EPI SGA1 SEANERGYS
SC'23 Workshops · 2023 21 citations

PM100: A Job Power Consumption Dataset of a Large-Scale Production HPC System

Future Generation Computer Systems · 2024

HazardNet: Thermal Hazard Prediction Framework for Datacenters

ACM Computing Frontiers · 2022

Multi-level Anomaly Prediction in Tier-0 Datacenter

Academic Service

Program Committee

PDP 2025 · PDP 2026 · AsHES 2026

Reviewer

IEEE TCAD · FGCS · Journal of Grid Computing · SC · ACM CF · DATE · PDP · AsHES

Supervision

2 PhD co-advisees (ongoing) · 5 MSc theses · Lab of Big Data Architectures, UniBo (2020–2024)

Stack

Platform & Infrastructure

KubernetesHelmTerraform AzureDockerLinuxNGINX

AI / ML

PythonPyTorchLangGraph FastAPIKubeflowMLflowONNX

HPC

SlurmApptainerMPI OpenMPPrefect

Observability

PrometheusGrafanaLoki OpenTelemetryAlertmanager

Get in Touch

Open to research collaborations, open-source contributions, and industry partnerships in AI infrastructure and HPC systems.