Research Fellow · University of Bologna

Mohsen Seyedkazemi
Ardebili

I build autonomous AI that acts on infrastructure.

Not a chatbot that explains your cluster — systems that observe it, reason about failures, and execute remediation, with a human at the gate. LLM agents for Kubernetes & HPC, anomaly detection on Tier-0 supercomputers, and the MLOps to run it in production.

See the systems GitHub Scholar

LOC Bologna, IT
EXP 7 yrs critical infra → PhD HPC
OSS KubeIntellect · AOBench · YazSes

Deep dives KubeIntellect NovaFabric AOBench

control_loop.svc live

telemetry in → root-cause reasoning → gated action → back to telemetry

// profile

available for collaboration

From a power plant's networks to the control plane of a datacenter.

I spent seven years as the IT & Network Administrator of a 1,000+ MW combined-cycle power plant — running the enterprise IT and network infrastructure across eleven operational zones, where downtime is not an abstraction. Then I did a PhD in High-Performance Computing at the University of Bologna.

That path gives me a lens most ML researchers don't have: I care about uptime, observability, and correctness in production — not just benchmark numbers. Today I design autonomous control for infrastructure: systems that detect failures, reason about root cause, and propose or execute remediation behind human-approval gates.

I'm a Postdoctoral Research Fellow at DEI, University of Bologna, working across EU Horizon projects (I led the UNIBO contribution to DECICE; I now lead SEANERGYS work on ExaMLOps). And I ship — open-source tools and products that real people run.

// systems — research-grade work, in the open

Things that act, with the receipts.

KubeIntellect flagship details → code ↗ paper ↗ live ↗

A modular, LLM-orchestrated multi-agent framework for end-to-end Kubernetes operations — root-cause analysis, diagnosis, and human-gated cluster actions across the full API surface (read, write, exec, delete, RBAC, lifecycle). A stateful LangGraph supervisor coordinates domain agents; a Code-Generator agent synthesises and validates new tools at runtime. Accepted to the Journal of Grid Computing.

Stateful LangGraph supervisor + PostgreSQL checkpoints
Human-in-the-loop approval on every mutating operation
Runtime tool synthesis with AST validation & sandboxing
Deployed on Azure AKS — OpenAI-compatible FastAPI backend

PythonLangGraphFastAPIKubernetesPostgreSQLAzure AKS

93%query resolution

81.8%tool-synthesis success

+25ppover tool-less GPT-4o

$ kq "why is payments-api crashlooping?"
→ inspecting pods, events, logs…
root cause: OOMKilled — memory
  limit 256Mi exceeded on restart.
proposed fix: raise limit to 512Mi
approve? [y/N] ▍

AOBench details → code ↗ docs ↗

Agent Operations Benchmark — the first trace-driven, role-aware, RBAC-enforced benchmark for LLM agents in HPC facilities. Verdict: no system evaluated is deployment-ready for autonomous operations.

80 tasks23 snapshots16 systems

5 operator roles × 10 task categories × 3 difficulty tiers
Completion-under-Policy scorer hard-fails on any RBAC violation
CLEAR scorecard: Cost · Latency · Efficacy · Assurance · Reliability

PythonLLM EvalMCPSLURM

GRAAFE code ↗

Graph anomaly-anticipation for exascale HPC — topology-aware node-failure prediction running in production on CINECA's Tier-0 Marconi100, published in FGCS.

0.91 AUC49 racks~1000 nodes

Rack-level Graph Convolutional Networks over live telemetry
Integrates ExaMon (IPMI, Ganglia, Nagios) at <0.2% CPU overhead

PythonGCNPrometheusHPC

HazardNet code ↗

Thermal-hazard prediction for datacenters — multi-modal deep learning forecasting thermal failures fast enough to act, with explanations operators trust.

<100 ms infer0.99 F1<0.2% CPU

Temporal Convolutional Network + LSTM ensemble
LIME / SHAP explainability for operator adoption

PythonPyTorchTCNLSTM

ExaMLOps SEANERGYS · lead

An end-to-end MLOps platform for HPC workload management — a model zoo with auto-discovery training, multi-stage lifecycle, and multi-model serving, all behind an operator approval gate.

Prefect training · MLflow lifecycle · Ray Serve + MinIO serving
Observability: Prometheus, Grafana, Loki, Tempo
REST control plane, web dashboard, exa CLI, NL agent

PrefectMLflowRay ServeKubernetes

kube_q code ↗

The operator's companion for KubeIntellect — a CLI and Python SDK (kq) exposing the full agent API with streaming output and a human-approval UX, built for CI/CD.

Full KubeIntellect API coverage from the terminal
Streaming Rich TUI · pipeline-friendly output

PythonRichCLISDK

// ships — products under NovaFabric

Research is half of it. I also ship.

Privacy-first, local-first tools — built to be installed and used, not just cited.

NovaFabric replay fabric details → code ↗ live ↗

The reproducibility and trust layer for AI systems — an open-source, self-hosted toolkit that turns any agent or model run into a portable, signed, replayable evidence capsule, captured with no code changes. Observability tells you what happened; NovaFabric tells you what would happen if you ran it again, today.

Zero-instrumentation capture · four honest replay modes (exact / mocked / semantic / forensic)
Cryptographic seal: DSSE signature + RFC 3161 timestamp + append-only Merkle log
Structured run-to-run diff as a CI regression gate; signed evidence bundles verify without it installed
Runs offline from a laptop to an HPC cluster — no cloud, no account

PythonGoOpenTelemetry GenAISigstoreOPA/RegoApache-2.0

YazSes formerly NovaVoice shipping · v1.2.0 code ↗

Hold a key, speak, release — fully on-device voice dictation that types into any app and runs voice commands. No cloud, no subscription, no data leaving your machine. Shipped and maintained across multiple releases.

Dual STT: Moonshine for fast commands, Whisper for long-form accuracy
20 built-in tools via local LLM (Qwen3) with grammar-constrained tool-calling
Editor context (Neovim, VS Code); accessibility & screen-reader aware
Linux · macOS · Windows — APT, Homebrew, winget, Snap, AUR, pipx

Pythonfaster-whisperllama.cppApache-2.0

vision2prod code ↗

A meta-framework for taking a vague idea to production without losing context, evidence, or decision rationale — for humans and AI agents alike.

4-stage pipeline657 tests

VisionForge → SOTAForge → DesignForge → BuildForge
Deterministic verification gates: schema, evidence, traceability — no LLM in the gate
v2p CLI + FastAPI/HTMX governance portal

PythonFastAPIHTMXpytest

// research

Peer-reviewed, in real venues.

23+publications

Google Scholar ↗ ORCID ↗

EU Horizon DECICE lead SEANERGYS lead Graph-Massivizer EUROPEAN PILOT REGALE EPI SGA1

Journal of Grid Computing · 2026 accepted KubeIntellect: A Modular LLM-Orchestrated Agent Framework for End-to-End Kubernetes Management
Nature Scientific Data · 2023 M100 ExaData: A Data Collection Campaign on CINECA's Marconi100 Tier-0 Supercomputer
SC'23 Workshops · 2023 PM100: A Job Power Consumption Dataset of a Large-Scale Production HPC System
Future Generation Computer Systems · 2024 GRAAFE: Graph Anomaly Anticipation Framework for Exascale HPC Systems
Future Generation Computer Systems · 2024 HazardNet: Thermal Hazard Prediction Framework for Datacenters
ACM Computing Frontiers · 2022 Multi-level Anomaly Prediction in Tier-0 Datacenter

Program committee

PDP 2025 · PDP 2026 · AsHES 2026

Reviewer

IEEE TCAD · FGCS · J. Grid Computing · SC · ACM CF · DATE · PDP · AsHES

Supervision

2 PhD co-advisees · 5 MSc theses · Lab of Big Data Architectures, UniBo

// stack

The toolchain behind the work.

Agentic AI / ML

PythonPyTorchLangGraphLangChain MCPA2AAnthropic SDKFastAPI

MLOps & serving

PrefectMLflowRay ServeMinIO KubeflowGitHub ActionsGitLab CI

Cloud-native & infra

KubernetesHelmTerraformDocker Azure AKSNGINXLinux

Observability

PrometheusGrafanaLokiTempo OpenTelemetryAlertmanager

HPC

SLURMMPIOpenMPApptainerExaMon

Foundations

NetworkingVMwareActive Directory SDNBashGit

// contact

Let's build infrastructure that runs itself.

Open to research collaborations, open-source work, and industry partnerships in AI infrastructure, HPC, and autonomous operations.

Email LinkedIn GitHub Scholar ORCID UNIBO