Skip to content

Chat with your Kubernetes cluster in plain English

KubeIntellect is an AI-powered Kubernetes management platform. Describe a problem — a CrashLoopBackOff, a pending pod, an RBAC error — and a multi-agent LLM system diagnoses it, proposes a fix, shows you a dry-run diff, and waits for your approval before touching anything.

License: AGPL v3 Python 3.12+ Docker Build

Get Started Architecture GitHub


Why KubeIntellect?

  • Root Cause Analysis


    Correlates logs, metrics, events, and resource config in parallel — not just a log dump. Surfaces the actual cause, not a list of things to check.

  • Human-in-the-Loop by Design


    Every write operation produces a dry-run diff and pauses for your explicit approval. No unreviewed changes on live clusters — ever.

  • Multi-Agent Orchestration


    14 specialized agents (Logs, RBAC, Metrics, Security, Lifecycle, CodeGenerator, …) routed by a Supervisor LLM via LangGraph StateGraph.

  • Dynamic Tool Generation


    Need a capability that doesn't exist yet? KubeIntellect generates Python tools, sandboxes them, and registers them for reuse — all with your approval.

  • Persistent Memory


    Per-user context: sticky namespace, routing lessons, 30 pre-seeded failure patterns, and preference learning across sessions.

  • LLM-Agnostic


    Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, Ollama, LiteLLM — swap providers with a single env var.


See It in Action

> Why is my payment-api pod crashing in the prod namespace?

Fetching logs, events, and resource config in parallel...

Root cause: OOMKilled — container hit the 256Mi memory limit.
Last 3 events: BackOff restarts (5m ago, 3m ago, 1m ago).

Recommendation: increase memory limit to 512Mi.
Dry-run diff ready — confirm to apply? [approve / deny]
> Who has cluster-admin access?

3 service accounts hold cluster-admin:
  • default/tiller           (inactive — consider revoking)
  • kube-system/admin-sa     (system — expected)
  • kubeintellect/core-sa    (this platform — expected)

Tiller is a known attack surface. Show remediation steps?
> I need a tool that shows pods sorted by restart count

No matching tool in registry. Generating...

[HITL] Review generated code before registration? [approve / deny]
[approve]

Tool registered as 'list_pods_by_restart_count'. Running now:
  pod/api-6d4f9b       14 restarts
  pod/worker-2          3 restarts
  pod/scheduler-1       0 restarts
> Scale down all staging deployments to 0 replicas overnight

I'll scale the following (dry-run diff attached):
  api      3 → 0 replicas
  worker   2 → 0 replicas
  cron     1 → 0 replicas

Confirm? [approve / deny]

Architecture at a Glance

User query
  → Memory Orchestrator (reflections + failure hints + user prefs + registered tools ≤ 550 tokens)
  → Supervisor LLM     (LangGraph StateGraph routing)
  → Specialized agents (ReAct loops → Kubernetes API)
  → HITL gate          (diff + approval for all write operations)
  → Streaming SSE response
Layer Technology
Orchestration LangGraph StateGraph
API Server FastAPI + Server-Sent Events
State & Checkpoints PostgreSQL (LangGraph checkpointer)
Chat History MongoDB (LibreChat)
Dynamic Tools PVC + PostgreSQL registry
Observability Langfuse · Prometheus · Loki
Frontend LibreChat

Full Architecture Diagrams


Explore the Documentation

  • Getting Started


    Deploy KubeIntellect locally with Kind or to Azure AKS. Full prerequisites, credentials, and Helm walkthrough.

  • Architecture


    Deep-dive into the multi-agent system, the supervisor routing logic, tool design patterns, and all storage layers.

  • Flowcharts


    Interactive Mermaid diagrams — system overview, supervisor flow, CodeGenerator pipeline, and complete workflow topology.

  • Operations


    Deployment runbooks, known issues, troubleshooting guides, observability stack setup, and backup / restore procedures.

  • Security


    CodeGenerator sandbox (AST + exec timeout + SHA-256), RBAC model, secret handling policy, and GDPR compliance.

  • HITL Workflow


    How human-in-the-loop approval works — breakpoints, checkpoint/resume cycle, and the API contract.

  • Observability


    Langfuse LLM tracing, Prometheus metrics, Loki log aggregation, and self-hosted stack configuration.


Quick Start

# 1. Clone & configure
git clone https://github.com/MSKazemi/kubeintellect
cd kubeintellect
cp .env.example .env       # fill in LLM credentials

# 2. Deploy to local Kind cluster (full setup in one command)
make kind-kubeintellect-clean-deploy

# 3. Access the UI
make port-forward-librechat  # → http://localhost:3080

Fastest path

Run make kind-kubeintellect-clean-deploy — it creates the Kind cluster, generates secrets from .env, builds the image, and deploys via Helm. Total time: ~5 minutes on first run.

See Installation for Azure AKS, N1, or other targets.