WebRTC Powered Agents

Explore a live-video WebRTC pipeline for LLM agents and ML models, featuring real-time object detection, segmentation, and depth detection for interactive applications.

LiveKit WebRTC Gemini Live models Google Cloud Platform Python Next GCP hosted (and/or self-hostable) python Livekit Agents

Overview

A simple live-video WebRTC pipeline you can hook LLM agents and ML models into.

I will show code and basic architecture for:

Setting up a WebRTC pipeline.
Conversational agent interactions. (Query against video footage, manipulating app state, etc…)
Promptable DINO-based realtime object detection.
Realtime SAM2 segmentation.
Realtime depth detection.

Links

https://github.com/kanawish/content/tree/main/260523_WebRTC_powered...
A GitHub repository demonstrating WebRTC-powered agents.

Tech stack

LiveKit WebRTC

LiveKit is an open-source, high-performance WebRTC framework designed to build and scale real-time audio, video, and multimodal AI applications.

LiveKit eliminates the complexity of WebRTC infrastructure by providing a production-ready Selective Forwarding Unit (SFU) written in Go, paired with robust client SDKs for every major platform. Developers use it to power everything from multi-user video conferences to sub-100ms voice interactions with AI agents. By handling critical real-time challenges like dynamic codec switching, automatic bandwidth estimation, and seamless reconnection, LiveKit lets teams focus on building features rather than managing complex media servers.

https://livekit.io

View projects
Gemini Live models

Gemini Live models power real-time, bidirectional voice and vision interactions through a low-latency, stateful WebSocket connection.

Built for high-speed, conversational AI, Gemini Live models process continuous streams of audio, images, and text to deliver immediate, human-like spoken responses. Operating over a stateful WebSocket protocol, this technology supports key real-time features including user barge-in (interrupting the model mid-sentence), multilingual support across 70 languages, and affective dialog that adapts tone to match user input. Developers can leverage these models via the Gemini Live API to build responsive voice agents, interactive gaming NPCs, and hands-free interfaces for smart devices.

https://ai.google.dev/gemini-api/docs/live-api

View projects
Google Cloud Platform

GCP delivers Google's global infrastructure (Compute Engine, BigQuery) for secure, scalable cloud solutions and AI/ML innovation.

Google Cloud Platform (GCP) provides the core infrastructure and services for modern digital transformation. The platform leverages Google's global network, spanning 39 regions and 118 zones, to host critical workloads securely. Key services include Compute Engine (IaaS), Google Kubernetes Engine (GKE) for container orchestration, and BigQuery (serverless data warehouse) for petabyte-scale analytics. GCP integrates advanced AI/ML capabilities via Vertex AI, allowing developers to build and deploy models fast. Security is paramount: the platform uses Google's multi-layered security model, protecting data and applications with zero-trust principles. New customers can utilize the free tier and $300 in credits to deploy their next project.

https://cloud.google.com

View projects
Python

Python: The high-level, general-purpose language built for readability, powering everything from web backends to advanced machine learning models.

Python is the high-level, general-purpose language prioritizing clear, readable syntax (via significant indentation), ensuring rapid development for any team . Its ecosystem is massive: use it for robust web development with frameworks like Django and Flask, or leverage its power in data science with libraries such as Pandas and NumPy . The Python Package Index (PyPI) provides thousands of community-contributed modules, offering immediate solutions for tasks from network programming to GUI creation . The language is actively maintained by the Python Software Foundation (PSF), with the stable release currently at Python 3.14.0 (as of November 2025) .

https://python.org

View projects
Next

Next.js is the full-stack React framework: it delivers high-performance web applications via hybrid rendering and powerful, Rust-based tooling.

This is the React Framework for production: Next.js enables you to build full-stack web applications with zero configuration and maximum efficiency. It supports a hybrid rendering approach (Server-Side Rendering, Static Site Generation, and Incremental Static Regeneration) for optimal speed and SEO performance. Key features include React Server Components, Server Actions for running server code directly, and the App Router for advanced routing and nested layouts. Developed by Vercel, it leverages Rust-based tools like Turbopack and the Speedy Web Compiler for the fastest possible builds and a superior developer experience.

https://nextjs.org/

View projects
GCP hosted (and/or self-hostable) python Livekit Agents

Build and deploy ultra-low latency, real-time Python voice and video AI agents on your own infrastructure or GCP using LiveKit's open-source framework.

LiveKit Agents provides a production-grade Python framework for orchestrating real-time, multimodal AI assistants with sub-second latency. By combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) into a unified pipeline, it handles complex tasks like automatic turn detection and instant interruption management out of the box. You can run the entire stack locally during development using the `uv` package manager, and then seamlessly deploy your containerized agents to Google Cloud Platform (using Google Cloud Run or GKE) while connecting back to your self-hosted LiveKit media servers or LiveKit Cloud.

https://github.com/livekit/agents

View projects