Creator Magic

Apple Pulled This... I Run AI On It

⏱ 20 min video · 3 min read8 May 2026Worth watching

TL;DR

Mike Russell sets up an Apple Mac Studio M3 Ultra (256GB unified memory) as a fully local AI inference server to replace cloud-based frontier AI like Claude and ChatGPT for routine agent tasks. He walks through the complete headless server setup, installs Ollama and LM Studio, and benchmarks local models including Qwen3 (35B), GPT-4o-OSS (120B), MiniMax M2, and Gemma 4 (31B) against real production workloads.

Key points

Apple quietly removed the 256GB Mac Studio M3 Ultra configuration from its store — the exact machine featured in this video — with no announcement.

The Mac Studio M3 Ultra (256GB unified memory, 819GB/s bandwidth, 60-core GPU) can run 120B parameter models at near-frontier quality on just 35W of power draw.

Qwen3 35B-A3B and GPT-4o-OSS 120B performed best in testing; Qwen3 matched Claude Opus quality in the creator's assessment, while MiniMax M2 and Gemma 4 struggled with the coding task.

The full headless server setup covers: local user creation, auto-login, screen sharing, SSH key auth, Homebrew, Ollama, and LM Studio with MLX models (15-30% faster on Apple Silicon).

The machine was successfully integrated into OpenClaw and Hermes AI agent stacks to serve real agent traffic locally, with no cloud API calls or token costs.

Actionable insights

→

Set up the Mac as a headless inference server: disable sleep, enable auto-login, turn off FileVault, enable Screen Sharing and Remote Login, and assign a static hostname like gaia.local for LAN access.

→

Install Ollama via Homebrew for API-compatible local inference, and LM Studio for exploring MLX-format models — MLX gives 15-30% faster token speeds on Apple Silicon over GGUF alternatives.

→

For routine agent tasks (summarisation, classification, extraction, rewriting), Qwen3 35B-A3B and GPT-4o-OSS 120B are strong enough to replace cloud APIs entirely, saving ongoing subscription and token costs.

Notable quotes

“256GB of unified memory. 819GB per second of memory bandwidth. This thing can run models previously that required a data center in your living room, silently, and for about 80W of power.”

“We have a near frontier level model running on some code right now, and only 35W of power draw. This definitely beats running big Nvidia GPUs on a PC.”

“For the routine stuff, absolutely — I can run that on Gaia behind me now. For the hard stuff, I will probably keep paying for Anthropic and OpenAI, but for everything in between, Gaia handles it.”

Worth watching?

✅

Worth watching the full video?

Watch if you are seriously considering building a local AI inference server — the step-by-step setup and live model benchmarks save hours of research, though the key config steps and model verdicts are all captured here.

Topics

AI & Tech Ollama

Explore more summaries on these topics →

Saved you some time? The creator still deserves a like.

Watch on YouTube →

More like this