Creator Magic

GPT-Realtime-2 Voice Agent: I Build It Live

⏱ 56 min video · 3 min read18 Jun 2026

TL;DR

The creator live-builds a voice AI assistant called Jarvis using GPT Realtime 2 (OpenAI's model released ~6 weeks prior) and the Tank framework, connecting it to Zapier MCP for tool access and Imagen via Nano Banana 2 for real-time image generation. The session is raw and experimental, revealing both the promise and current limitations of voice-driven AI agents.

Key points

GPT Realtime 2 was released by OpenAI roughly 6 weeks before this stream and features a 128,000 token context window, GPT-5 class reasoning, and multimodal support (text, image, audio) with function calling.

The Tank framework (creator's own tool) orchestrated Claude Code to scaffold a LAN-hosted Jarvis voice assistant from a single prompt in under 6 minutes, including a Caddy reverse proxy for microphone HTTPS access.

Zapier MCP integration partially worked — the voice agent could detect available Notion actions — but failed to actually create pages, highlighting real-world friction with MCP-connected voice agents.

Swapping to Nano Banana 2 (Google Imagen) for image generation via voice worked impressively: the user described scenes verbally and Jarvis generated images in real time, useful for thumbnail ideation.

Tank now supports Fusion mode — synthesizing answers from multiple models simultaneously (e.g. local Qwen 3, GPT-OSS 120B, and GPT 5.5 via API) — and has added scheduled tasks, keyboard shortcuts, Linux/Proxmox support, and community pull requests.

Actionable insights

→

To use browser microphone access on a LAN-hosted app, route it through HTTPS via a reverse proxy like Caddy — plain HTTP blocks mic permissions in modern browsers.

→

When building a voice agent with GPT Realtime 2, add function-calling definitions rather than relying on MCP connections in the playground, as MCP wiring is unreliable in that environment.

→

For rapid visual brainstorming (e.g. YouTube thumbnails), a voice-to-image pipeline using GPT Realtime 2 + an image model is already usable today — describe a scene verbally and iterate hands-free.

Notable quotes

“I kept telling me it can't make an image of Mario, but it made an image of Mario. So there's definitely a wrinkle there somewhere.”

“Just in the last day or so, Claude Opus 4.8 has got a lot dumber. Like, it's taking a lot more time, it's making errors and it's having to go around again a couple of times before it gets things correct.”

“It shows you how easy it is even to this very day with GPT 5.5 to jailbreak even inadvertently and get a result.”

Worth watching?

⏭️

Worth watching the full video?

Skip the full video unless you enjoy unedited live-coding streams — the key findings, limitations, and build steps are all captured here, and the video contains significant dead time waiting for Claude to code.

Topics

AI & Tech OpenAI

Explore more summaries on these topics →

Saved you some time? The creator still deserves a like.

Watch on YouTube →

More like this