Finding problems with your agents in production | Montreal .

Members-Only

Recent Talks & Demos are for members only

Exclusive feed

You must be an AI Tinkerers active member to view these talks and demos.

June 17, 2026 · Montreal

PostHog: Debugging AI Agents

Learn how to find and fix bugs in AI features by observing agent output, evaluating performance, and surfacing issues in production.

Overview
Tech stack
  • PostHog AI Observability
    PostHog AI Observability tracks LLM traces, spans, token costs, and latency directly alongside your core product analytics.
    Engineering teams use PostHog AI Observability to debug and optimize LLM-powered features without managing a separate, expensive data silo. The platform captures every prompt, response, and tool call as a standard PostHog event, giving you instant visibility into model latency, errors, and token costs across providers like OpenAI and Anthropic. Because this data lives in your main product analytics suite, you can easily correlate LLM performance with real user behavior: linking high-latency responses directly to drop-offs in your conversion funnels.
  • PostHog Evaluations
    Automated LLM-as-a-judge and code-based checks to score and monitor the quality of your generative AI outputs.
    PostHog Evaluations gives product engineers a systematic way to grade generative AI outputs directly inside their existing analytics stack. By running automated LLM-as-a-judge templates (covering relevance, helpfulness, jailbreaks, hallucinations, and toxicity) alongside custom code-based Hog checks, teams can continuously score their LLM generations. It eliminates the need for separate, disconnected AI observability platforms by combining automated pass/fail testing with real-world user interaction data. This allows developers to catch bad responses, run targeted human reviews, and verify prompt changes without losing context.
  • Prompt Management
    A centralized, open-source LLMOps platform to version, test, and deploy AI prompts without redeploying application code.
    Prompt Management treats prompts as decoupled, version-controlled software assets rather than hardcoded strings. By utilizing dedicated registries like Langfuse or Agenta, engineering teams can collaboratively design, test, and update prompts in production via API or SDK integrations. This setup eliminates the need for full CI/CD deployment cycles for minor phrasing tweaks: developers can hot-swap prompt templates, run parallel A/B tests across multiple LLMs (such as GPT-4o and Claude 3.5 Sonnet), and monitor prompt performance in real time while maintaining strict latency-optimized caching.
  • Self-Driving Inbox
    An open-source AI personal assistant that automates your email, blocks cold spam, and drafts context-aware replies in your personal style.
    Managing email manually is a massive drag on daily productivity. Inbox Zero solves this by acting as an autonomous copilot for your email client: it reads incoming messages, automatically drafts replies based on your calendar availability, and files attachments directly to Google Drive or OneDrive. You write your instructions in plain English (for example: "Archive all cold pitches but flag urgent client issues"), and the system executes them. It is SOC 2 Type II certified, fully open-source, and supports self-hosting for teams that require complete control over their data privacy.
  • MCP
    MCP is the open-source standard for securely connecting AI agents (like LLMs) to external tools, data, and enterprise workflows.
    The Model Context Protocol (MCP) functions as a standardized integration layer: think of it as a USB-C port for AI applications. Developed and open-sourced by Anthropic, this protocol allows large language models (LLMs) to access real-time context and execute actions via external tools like GitHub, Jira, or proprietary databases . It uses a simple JSON-RPC interface to define tools, schemas, and endpoints, which enables AI agents to perform complex, state-changing tasks—such as creating a GitHub issue or running a test script—rather than just generating text . MCP is essential for building agentic AI systems that can autonomously pursue goals and operate within defined safety and permission boundaries .
  • PostHog's AI Observability
    Track and analyze LLM applications with real-time tracing, cost calculation, and latency monitoring directly inside your core product analytics stack.
    PostHog's AI Observability gives engineers x-ray vision into LLM applications by capturing traces, generations, and spans directly alongside user behavior data (no separate, siloed tools required). The platform automatically tracks prompt inputs, model outputs, token usage, and API costs across major providers like OpenAI and Anthropic. Because this data lives in PostHog, teams can instantly correlate LLM latency or hallucinations with real-world business impact: linking a slow model response to user drop-off in a conversion funnel, or jumping straight from a failed LLM call to a visual Session Replay of the exact user experience.
  • Evaluations
    DeepEval is the open-source LLM evaluation framework: it functions as a Pytest-like unit testing tool for validating large language model outputs with programmatic rigor.
    Evaluations, specifically via the DeepEval framework, provide the necessary structure for systematic LLM testing. This open-source tool integrates directly into your CI/CD pipeline, acting like a specialized Pytest for AI applications. It leverages over 50 research-backed metrics—including G-Eval, RAGAS, and Hallucination checks—to score model performance on specific criteria. Developers define test cases, run the evaluation, and receive concrete metrics to prevent regressions, ensuring model reliability before deployment.