Langfuse Integration Guide¶

Step-by-step plan for adding Langfuse observability to AOBench. Follow the phases in order; each phase is independently testable.

This page is the single Langfuse reference for AOBench. It covers what Langfuse is, the local docker-compose stack under docker/langfuse/, and the runtime exporter wired through the --langfuse flag.

Status¶

Phase	Title	Status
1	Stand up Langfuse locally	✅ Done — running at http://localhost:3000
2	Add SDK as optional dependency	✅ Done
3	Create `BaseExporter` ABC	✅ Done
4	Implement `LangfuseExporter`	✅ Done
5	Wire exporter into `BenchmarkRunner`	✅ Done
6	Add `--langfuse` CLI flag	✅ Done
7	Update Makefile & COMMANDS.md	✅ Done
8	Smoke-test end-to-end	☐ Todo — add `.env` keys then run `make run-langfuse`

Phase 1 — Stand up Langfuse locally¶

Goal: all 6 Langfuse v3 services running, UI reachable at http://localhost:3000, API keys in hand.

What gets started¶

Langfuse v3 is a multi-service stack (see docker/langfuse/docker-compose.yml):

Container	Role
`langfuse-web`	UI + REST API on port 3000
`langfuse-worker`	Background job processor
`postgres`	Project metadata, users, API keys
`clickhouse`	Trace + observation analytics store
`redis`	Task queue
`minio`	S3-compatible blob storage

Step 1.1 — Start the stack¶

make langfuse-up
# equivalent to: docker compose -f docker/langfuse/docker-compose.yml up -d

If a pull times out (Docker Hub is flaky on large images), just run make langfuse-up again — already-downloaded layers are cached, only missing ones are retried.

Step 1.2 — Wait for healthy status¶

First startup takes 1–2 minutes while services initialize and ClickHouse runs migrations.

# Watch until langfuse-web shows (healthy)
docker compose -f docker/langfuse/docker-compose.yml ps

Expected output when ready:

NAME                     STATUS
langfuse-web             Up X minutes (healthy)
langfuse-worker          Up X minutes
langfuse-clickhouse      Up X minutes (healthy)
langfuse-postgres        Up X minutes (healthy)
langfuse-redis           Up X minutes (healthy)
langfuse-minio           Up X minutes (healthy)

If langfuse-web keeps restarting, check its logs:

make langfuse-logs
# or: docker logs langfuse-web --tail 30

Step 1.3 — Create account and project¶

Open http://localhost:3000 in a browser
Click Sign up — the first user becomes admin
Create a new project — name it aobench
Go to Settings → API Keys → Create new key
Copy the Public Key (pk-lf-...) and Secret Key (sk-lf-...)

Step 1.4 — Add credentials to `.env`¶

In the AOBench repo root, copy .env.example to .env (if you haven't already) and fill in:

LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=http://localhost:3000

Verify: open http://localhost:3000 — you should see your aobench project dashboard.

Useful commands¶

make langfuse-down        # stop all containers (data preserved)
make langfuse-logs        # stream logs from all containers
make langfuse-reset       # stop + wipe ALL data (fresh start)

Troubleshooting¶

TLS handshake timeout during image pull¶

Symptom: docker compose up fails with TLS handshake timeout on some images, even though curl https://registry-1.docker.io/v2/ works from the shell.

Cause: A WireGuard VPN (wg0, MTU 1420) is active. Docker's bridge defaults to MTU 1500. Packets larger than the VPN MTU get fragmented, breaking TLS inside the daemon.

Fix: Lower Docker's MTU to 1400 and set reliable DNS:

sudo tee /etc/docker/daemon.json <<'EOF'
{
  "mtu": 1400,
  "dns": ["8.8.8.8", "1.1.1.1"]
}
EOF
sudo systemctl restart docker
make langfuse-up

Verify: docker pull redis:7-alpine should complete without timeout.

`langfuse-web` / `langfuse-worker` keep restarting — ClickHouse error¶

Symptom:

error: failed to open database: code: 139, message: There is no Zookeeper configuration
CREATE TABLE schema_migrations ON CLUSTER default ... Engine=ReplicatedMergeTree

Cause: Langfuse v3 creates tables using ReplicatedMergeTree ON CLUSTER default, which requires ZooKeeper or ClickHouse Keeper. A plain ClickHouse image has neither by default.

Fix: The AOBench compose mounts docker/langfuse/clickhouse-config.xml into ClickHouse, which enables the built-in ClickHouse Keeper (no ZooKeeper container needed) and defines the default single-node cluster. This file is already in the repo and mounted automatically.

If you see this error despite the config, wipe volumes and restart:

make langfuse-reset   # stops containers and deletes volumes
make langfuse-up

`ENCRYPTION_KEY` validation error¶

Symptom:

ENCRYPTION_KEY must be 256 bits, 64 string characters in hex format

Cause: Langfuse v3 validates the encryption key strictly. All-zero or placeholder keys are rejected.

Fix: The key in docker/langfuse/docker-compose.yml is already a valid random 256-bit hex key generated with openssl rand -hex 32. If you customise the compose file, generate a new key with:

openssl rand -hex 32

Replace the ENCRYPTION_KEY value in both langfuse-web and langfuse-worker environment sections with the same generated value.

Phase 2 — Add SDK as optional dependency¶

Files to change: pyproject.toml

Add langfuse to [project.optional-dependencies]:

[project.optional-dependencies]
langfuse = ["langfuse>=2.0"]

Install locally:

pip install -e ".[langfuse]"

Verify:

python -c "import langfuse; print(langfuse.__version__)"

Phase 3 — Create `BaseExporter` ABC¶

New file: src/aobench/exporters/base_exporter.py

The base class defines the contract all exporters must implement:

from abc import ABC, abstractmethod
from aobench.schemas.result import BenchmarkResult
from aobench.schemas.task import TaskSpec
from aobench.schemas.trace import Trace

class BaseExporter(ABC):
    @abstractmethod
    def export(self, trace: Trace, result: BenchmarkResult, task: TaskSpec) -> None:
        """Export one completed task run."""

    def flush(self) -> None:
        """Flush any buffered data (optional)."""

Also create src/aobench/exporters/__init__.py (empty or re-export BaseExporter).

Verify: python -c "from aobench.exporters.base_exporter import BaseExporter" succeeds.

Phase 4 — Implement `LangfuseExporter`¶

New file: src/aobench/exporters/langfuse_exporter.py

Data mapping¶

AOBench field	Langfuse call	Notes
`trace.trace_id`	`lf.trace(id=...)`	Reuses AOBench ID
`trace.task_id`	`trace.name`	Human-readable name in UI
`trace.run_id`	`trace.session_id`	Groups all tasks in one run
`trace.role`	`trace.user_id`	Role as user identifier
`trace.adapter_name`, `model_name`	`trace.metadata`	Key-value dict
Each `TraceStep`	`trace.span(...)`	One span per step
`step.tool_call`	span `metadata`	Tool name + arguments
`step.observation`	span `output`	Tool result or error
LLM tokens (from Trace)	`trace.generation(...)`	One generation per run
`result.dimension_scores.*`	`trace.score(name, value)`	6 scores
`result.aggregate_score`	`trace.score("aggregate", value)`	Summary score

Implementation outline¶

class LangfuseExporter(BaseExporter):
    def __init__(self, public_key, secret_key, host=None):
        from langfuse import Langfuse
        self._lf = Langfuse(public_key=public_key, secret_key=secret_key, host=host)

    def export(self, trace, result, task):
        lf_trace = self._lf.trace(
            id=trace.trace_id,
            name=trace.task_id,
            session_id=trace.run_id,
            user_id=trace.role,
            metadata={...},
            tags=[trace.role, task.qcat, task.difficulty],
        )

        # One span per agent step
        for step in trace.steps:
            span = lf_trace.span(name=f"step-{step.step_id}", ...)
            span.end(output=..., metadata=...)

        # One generation for the overall LLM call
        if trace.total_tokens:
            lf_trace.generation(
                name="llm",
                model=trace.model_name,
                usage={"input": trace.prompt_tokens, "output": trace.completion_tokens},
            )

        # Attach scores
        for dim, value in result.dimension_scores.model_dump().items():
            if value is not None:
                lf_trace.score(name=dim, value=value)
        if result.aggregate_score is not None:
            lf_trace.score(name="aggregate", value=result.aggregate_score)

    def flush(self):
        self._lf.flush()

Verify: unit test with a mock Langfuse object.

Phase 5 — Wire exporter into `BenchmarkRunner`¶

File to change: src/aobench/runners/runner.py

Add optional exporter parameter to BenchmarkRunner.__init__:

def __init__(self, adapter, benchmark_root, output_root, exporter=None):
    ...
    self._exporter = exporter

After step 7 (write result to disk), add:

# 8. Export to observability backend (optional)
if self._exporter is not None:
    self._exporter.export(trace, result, task)

Call exporter.flush() after all tasks in run_all.

Verify: run with DirectQAAdapter + LangfuseExporter and check trace appears in UI.

Phase 6 — Add `--langfuse` CLI flag¶

File to change: src/aobench/cli/run_cmd.py

Add to both run_task and run_all command signatures:

langfuse: Annotated[bool, typer.Option("--langfuse/--no-langfuse",
    help="Export traces and scores to Langfuse")] = False,

In the command body, before constructing BenchmarkRunner:

exporter = None
if langfuse:
    from aobench.exporters.langfuse_exporter import LangfuseExporter
    import os
    exporter = LangfuseExporter(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.environ.get("LANGFUSE_HOST"),
    )

runner = BenchmarkRunner(..., exporter=exporter)

After run_all loop, flush:

if exporter:
    exporter.flush()

Verify:

aobench run task --task JOB_USR_001 --env env_01 --adapter direct_qa --langfuse
# Trace should appear in Langfuse UI

Phase 7 — Update Makefile & COMMANDS.md¶

Makefile — add a convenience target:

run-langfuse: ## Run a task and export to Langfuse
    aobench run task \
        --task $(TASK) \
        --env $(ENV) \
        --adapter $(ADAPTER) \
        --langfuse \
        --no-report

docs/COMMANDS.md — document the new --langfuse option under run task and run all.

Phase 8 — Smoke-test end-to-end¶

Checklist:

[ ] docker compose ps — all Langfuse services healthy
[ ] .env contains the three Langfuse variables
[ ] pip install -e ".[langfuse]" succeeds
[ ] aobench run task --task JOB_USR_001 --env env_01 --adapter direct_qa --langfuse
[ ] Trace appears in Langfuse UI (http://localhost:3000)
[ ] Trace has correct session_id = run_id, name = task_id
[ ] Spans visible for each agent step
[ ] Six dimension scores attached to trace
[ ] aobench run all --adapter openai --langfuse — all tasks appear under one session
[ ] Unit tests pass: make test

Environment Variables Reference¶

Variable	Required	Default	Description
`LANGFUSE_PUBLIC_KEY`	Yes (if --langfuse)	—	Project public key from Langfuse UI
`LANGFUSE_SECRET_KEY`	Yes (if --langfuse)	—	Project secret key from Langfuse UI
`LANGFUSE_HOST`	No	`https://cloud.langfuse.com`	Override for self-hosted instance

File Manifest¶

Files created or modified by this integration:

File	Action	Notes
`docker/langfuse/docker-compose.yml`	Create	Full v3 stack: postgres, clickhouse, redis, minio, web, worker
`docker/langfuse/clickhouse-config.xml`	Create	Enables ClickHouse Keeper + single-node cluster for Langfuse migrations
`src/aobench/exporters/__init__.py`	Create	Package init
`src/aobench/exporters/base_exporter.py`	Create	`BaseExporter` ABC
`src/aobench/exporters/langfuse_exporter.py`	Create	Langfuse implementation
`src/aobench/runners/runner.py`	Modify	Added `exporter=` param, step 8 calls `exporter.export()`
`src/aobench/cli/run_cmd.py`	Modify	`--langfuse/--no-langfuse` flag on `run task` and `run all`
`pyproject.toml`	Modify	`langfuse = ["langfuse>=2.0"]` optional dep
`Makefile`	Modify	`langfuse-up/down/logs/reset` + `run-langfuse` + `run-all-langfuse` targets
`.env.example`	Modify	Added Langfuse env var template
`docs/reference/commands.md`	Modify	`--langfuse` flag + Langfuse Makefile targets documented
`docs/guides/langfuse-integration.md`	Create	This guide (also serves as the Langfuse reference)
`tests/unit/test_langfuse_exporter.py`	Create	10 unit tests (all passing)