PostHog: Debugging AI Agents

Learn how to find and fix bugs in AI features by observing agent output, evaluating performance, and surfacing issues in production.

PostHog AI Observability PostHog Evaluations Prompt Management Self-Driving Inbox MCP PostHog's AI Observability Evaluations

Overview

My apps have AI features, which I’m tracking in my observability tool, but they have bugs - how do I find them and fix them?

I’ll show you what our workflow is, at PostHog, to observe our AI features/agents, evaluate their output, surface problems, and fix them. We’ll see how we find issues that we might have never found otherwise, in both our products and agents.

Tech stack

PostHog AI Observability

PostHog AI Observability tracks LLM traces, spans, token costs, and latency directly alongside your core product analytics.

Engineering teams use PostHog AI Observability to debug and optimize LLM-powered features without managing a separate, expensive data silo. The platform captures every prompt, response, and tool call as a standard PostHog event, giving you instant visibility into model latency, errors, and token costs across providers like OpenAI and Anthropic. Because this data lives in your main product analytics suite, you can easily correlate LLM performance with real user behavior: linking high-latency responses directly to drop-offs in your conversion funnels.

https://posthog.com/product/ai-observability

View projects
PostHog Evaluations

Automated LLM-as-a-judge and code-based checks to score and monitor the quality of your generative AI outputs.

PostHog Evaluations gives product engineers a systematic way to grade generative AI outputs directly inside their existing analytics stack. By running automated LLM-as-a-judge templates (covering relevance, helpfulness, jailbreaks, hallucinations, and toxicity) alongside custom code-based Hog checks, teams can continuously score their LLM generations. It eliminates the need for separate, disconnected AI observability platforms by combining automated pass/fail testing with real-world user interaction data. This allows developers to catch bad responses, run targeted human reviews, and verify prompt changes without losing context.

https://posthog.com/docs/ai-engineering/evaluations

View projects
Prompt Management

A centralized, open-source LLMOps platform to version, test, and deploy AI prompts without redeploying application code.

Prompt Management treats prompts as decoupled, version-controlled software assets rather than hardcoded strings. By utilizing dedicated registries like Langfuse or Agenta, engineering teams can collaboratively design, test, and update prompts in production via API or SDK integrations. This setup eliminates the need for full CI/CD deployment cycles for minor phrasing tweaks: developers can hot-swap prompt templates, run parallel A/B tests across multiple LLMs (such as GPT-4o and Claude 3.5 Sonnet), and monitor prompt performance in real time while maintaining strict latency-optimized caching.

https://langfuse.com

View projects
Self-Driving Inbox

An open-source AI personal assistant that automates your email, blocks cold spam, and drafts context-aware replies in your personal style.

Managing email manually is a massive drag on daily productivity. Inbox Zero solves this by acting as an autonomous copilot for your email client: it reads incoming messages, automatically drafts replies based on your calendar availability, and files attachments directly to Google Drive or OneDrive. You write your instructions in plain English (for example: "Archive all cold pitches but flag urgent client issues"), and the system executes them. It is SOC 2 Type II certified, fully open-source, and supports self-hosting for teams that require complete control over their data privacy.

https://getinboxzero.com

View projects
MCP

MCP is the open-source standard for securely connecting AI agents (like LLMs) to external tools, data, and enterprise workflows.

The Model Context Protocol (MCP) functions as a standardized integration layer: think of it as a USB-C port for AI applications. Developed and open-sourced by Anthropic, this protocol allows large language models (LLMs) to access real-time context and execute actions via external tools like GitHub, Jira, or proprietary databases . It uses a simple JSON-RPC interface to define tools, schemas, and endpoints, which enables AI agents to perform complex, state-changing tasks—such as creating a GitHub issue or running a test script—rather than just generating text . MCP is essential for building agentic AI systems that can autonomously pursue goals and operate within defined safety and permission boundaries .

https://modelcontextprotocol.io/

View projects
PostHog's AI Observability

Track and analyze LLM applications with real-time tracing, cost calculation, and latency monitoring directly inside your core product analytics stack.

PostHog's AI Observability gives engineers x-ray vision into LLM applications by capturing traces, generations, and spans directly alongside user behavior data (no separate, siloed tools required). The platform automatically tracks prompt inputs, model outputs, token usage, and API costs across major providers like OpenAI and Anthropic. Because this data lives in PostHog, teams can instantly correlate LLM latency or hallucinations with real-world business impact: linking a slow model response to user drop-off in a conversion funnel, or jumping straight from a failed LLM call to a visual Session Replay of the exact user experience.

https://posthog.com/docs/ai-observability

View projects
Evaluations

DeepEval is the open-source LLM evaluation framework: it functions as a Pytest-like unit testing tool for validating large language model outputs with programmatic rigor.

Evaluations, specifically via the DeepEval framework, provide the necessary structure for systematic LLM testing. This open-source tool integrates directly into your CI/CD pipeline, acting like a specialized Pytest for AI applications. It leverages over 50 research-backed metrics—including G-Eval, RAGAS, and Hallucination checks—to score model performance on specific criteria. Developers define test cases, run the evaluation, and receive concrete metrics to prevent regressions, ensuring model reliability before deployment.

https://deepeval.com/

View projects