Agent Behaviors¶
The KubeIntellect coordinator implements five additive behaviors that shape how it investigates Kubernetes issues. Each is feature-flagged in Configuration → Agent behavior flags.
| Behavior | Default |
|---|---|
| kubectl error interpreter | on |
| Snapshot sufficiency gate | lenient |
| Gather-then-conclude discipline | always on |
| Playbook library | on |
| Visible investigation plan | on |
kubectl error interpreter¶
When kubectl exits non-zero, the tool layer scans stderr for known patterns
(NotFound, Forbidden, connection refused, missing CRD, immutable field, …) and
appends a single-line hint after the original error. The LLM sees both — the
raw error is never replaced.
Why. Stops the agent from looping on errors it could have skipped. Example:
Error from server (NotFound): pods "payments-1" not found
→ Pod may have been rescheduled — re-run `kubectl get pods -n <ns>` to find the new name.
Disable: KUBECTL_ERROR_HINTS_ENABLED=false.
Snapshot sufficiency gate¶
context_fetcher runs at the start of every turn and pre-fetches a cluster
snapshot (pod list + Warning events). C2 adds a soft prompt bias: when the
snapshot is healthy and the user asks a list-shaped, read-only question, the
coordinator is encouraged to answer from the snapshot without an extra
kubectl get pods.
Always falls back to fresh data when:
- The question targets a specific named pod / deployment / service.
- The user asks about logs, metrics, history, "yesterday", "last N hours".
- The coordinator just performed a mutation (verifies via fresh get).
- The query contains now / right now / currently.
- The snapshot is older than SNAPSHOT_FRESHNESS_SECONDS (default 30s).
Why a soft bias, not a hard gate? Pod state changes fast in Kubernetes; a hard gate could return stale answers. The soft bias only fires for clean snapshots and list-shaped questions, with explicit always-fetch escape hatches.
Modes: off (no bias — pre-C2 behavior), lenient (default — bias only
when truly applicable), strict (aggressive bias — opt in for trusted
deployments).
Set: SNAPSHOT_SUFFICIENCY_MODE=off|lenient|strict,
SNAPSHOT_FRESHNESS_SECONDS=30.
Gather-then-conclude discipline¶
A prompt-only directive: when tools are needed, follow PLAN → FETCH → SYNTHESIZE. Emit all independent tool calls in a single response (parallel), then synthesize once. Never interleave partial answers with more tool calls.
Exception: sequential dependencies (e.g. find a pod's name → describe that pod) are allowed; even then, gather everything else in parallel at each step.
This is always on — it is part of the coordinator's core system prompt.
Playbook library¶
For the top recurring Kubernetes failure modes, KubeIntellect ships a YAML
playbook with a deterministic investigation sequence. When context_fetcher
detects a matching pattern in the snapshot, the coordinator's system prompt
includes the playbook(s) inline — guiding it to follow proven steps before
improvising.
Playbooks shipped (10):
CrashLoopBackOffOOMKilledImagePullBackOff/ErrImagePullPendingInsufficientResourcesPendingSchedulingConstraints(taints / affinity / nodeSelector)CreateContainerConfigErrorContainerCreatingStuck(volume / CSI)TerminatingStuck(finalizers)ReadinessProbeFailing(also covers liveness)ServiceUnreachable
Schema (drop a YAML file into app/agent/playbooks/):
name: <unique pattern name>
triggers:
- pod_status_regex: "<regex on STATUS column>"
- event_reason_regex: "<regex on Warning event REASON>"
- event_message_regex: "<regex on Warning event MESSAGE>"
investigation_steps:
- "<imperative step 1>"
- "<imperative step 2>"
expected_evidence:
- "<what to look for>"
recommended_fix_template: |
<multi-line fix template; placeholders welcome>
A playbook matches if any of its triggers matches. The coordinator still has agency — it can deviate when the situation warrants — but the playbook gives it a strong default.
Disable: PLAYBOOKS_ENABLED=false.
Visible investigation plan¶
For queries requiring three or more tool calls, the coordinator writes its plan as the first line of the response:
INVESTIGATION_PLAN:
- Check pod status in default namespace
- Describe the crashing pod
- Query Loki for errors in the last 30m
- Propose a fix
The plan block is parsed out of the message body and emitted as a structured
PlanEvent on the SSE stream. UI clients (kube-q, browsers) can render it as a
checklist; Langfuse traces show it for post-mortem review.
Why. Makes multi-step investigations transparent and gives the agent an anchor to stay on-track. Trivial single-call queries skip the plan (threshold: ≥ 3 steps).
Disable: INVESTIGATION_PLAN_ENABLED=false.
How they compose¶
A typical investigation of a CrashLoopBackOff pod, with all behaviors on:
- Snapshot gate —
context_fetcherbuilds the snapshot, sees the unhealthy pod, setssnapshot_has_issues=trueand matches theCrashLoopBackOffplaybook. - The coordinator system prompt now includes the snapshot, the playbook details (describe → previous logs → events), and the snapshot sufficiency block (which won't fire because issues are present — we always fetch when unhealthy).
- Investigation plan — for a 3+ step query, the coordinator emits
INVESTIGATION_PLAN: …first; UI shows the checklist. - Parallel discipline — coordinator emits all independent tool calls in one response.
- Error interpreter — if any kubectl call returns a known error pattern, the hint is appended before the LLM sees it, avoiding retry loops.
- Final answer references each plan step and proposes a fix from the
playbook's
recommended_fix_template.
Each phase can be flipped independently if you need to roll one back.