summree
GPT-Realtime-2 Voice Agent: I Build It Live
OpenAI
Creator Magic

GPT-Realtime-2 Voice Agent: I Build It Live

⏱ 56 min video · 3 min read18 Jun 2026
TL;DR
The creator live-builds a voice AI assistant called Jarvis using GPT Realtime 2 (OpenAI's model released ~6 weeks prior) and the Tank framework, connecting it to Zapier MCP for tool access and Imagen via Nano Banana 2 for real-time image generation. The session is raw and experimental, revealing both the promise and current limitations of voice-driven AI agents.
Key points
1
GPT Realtime 2 was released by OpenAI roughly 6 weeks before this stream and features a 128,000 token context window, GPT-5 class reasoning, and multimodal support (text, image, audio) with function calling.
2
The Tank framework (creator's own tool) orchestrated Claude Code to scaffold a LAN-hosted Jarvis voice assistant from a single prompt in under 6 minutes, including a Caddy reverse proxy for microphone HTTPS access.
3
Zapier MCP integration partially worked — the voice agent could detect available Notion actions — but failed to actually create pages, highlighting real-world friction with MCP-connected voice agents.
4
Swapping to Nano Banana 2 (Google Imagen) for image generation via voice worked impressively: the user described scenes verbally and Jarvis generated images in real time, useful for thumbnail ideation.
5
Tank now supports Fusion mode — synthesizing answers from multiple models simultaneously (e.g. local Qwen 3, GPT-OSS 120B, and GPT 5.5 via API) — and has added scheduled tasks, keyboard shortcuts, Linux/Proxmox support, and community pull requests.
Actionable insights
To use browser microphone access on a LAN-hosted app, route it through HTTPS via a reverse proxy like Caddy — plain HTTP blocks mic permissions in modern browsers.
When building a voice agent with GPT Realtime 2, add function-calling definitions rather than relying on MCP connections in the playground, as MCP wiring is unreliable in that environment.
For rapid visual brainstorming (e.g. YouTube thumbnails), a voice-to-image pipeline using GPT Realtime 2 + an image model is already usable today — describe a scene verbally and iterate hands-free.
Notable quotes

I kept telling me it can't make an image of Mario, but it made an image of Mario. So there's definitely a wrinkle there somewhere.

Just in the last day or so, Claude Opus 4.8 has got a lot dumber. Like, it's taking a lot more time, it's making errors and it's having to go around again a couple of times before it gets things correct.

It shows you how easy it is even to this very day with GPT 5.5 to jailbreak even inadvertently and get a result.

Worth watching?
⏭️
Worth watching the full video?
Skip the full video unless you enjoy unedited live-coding streams — the key findings, limitations, and build steps are all captured here, and the video contains significant dead time waiting for Claude to code.
Topics
AI & TechOpenAI

Explore more summaries on these topics →

Saved you some time? The creator still deserves a like.

Watch on YouTube →
More like this

Want this for your own channels?

Add the channels you follow. Every new video summarised and in your inbox the moment it drops. From £4/month.

Try it free