How to Run Gemma 4 Locally With LM Studio and Claude Code

LM Studio's new headless CLI lets you run local models as an API endpoint. Pair it with Claude Code and you get an AI coding agent running entirely on your machine - no cloud billing, no API keys, no data leaving your device.

A post on Hacker News this week (254 points) showed a setup that a lot of developers have wanted but few have actually configured: running Google's Gemma 4 model locally through LM Studio, then connecting Claude Code to use that local model instead of Anthropic's cloud API. The result is a capable AI coding agent that runs entirely on your machine with no per-token billing.

Here's what's involved and how to set it up.

What Gemma 4 is

Gemma 4 is Google's open-weight model family, released in early 2026. Open-weight means the model weights are publicly available - you can download and run them locally, unlike closed models like Claude or GPT-4o which only run on the vendor's servers.

Gemma 4 comes in several sizes: 1B, 4B, 12B, and 27B parameters. The 12B model fits comfortably on a Mac with 16GB of unified memory and delivers coding performance that's genuinely useful for everyday tasks - not as capable as Claude Sonnet or GPT-4o on hard problems, but solid for autocomplete, code review, simple refactors, and explaining code.

The 27B model requires more RAM (at least 32GB) but competes more seriously with cloud models on complex tasks.

What LM Studio's headless CLI does

LM Studio is a desktop app for running local AI models that has been around for a while. What's new is the headless CLI - a command-line interface that lets you run LM Studio as a background server without the GUI, and exposes the running model as an OpenAI-compatible API endpoint on localhost.

This matters because most AI coding tools - Claude Code, Goose, Cursor, and others - can connect to any OpenAI-compatible API endpoint. Once LM Studio is running a model as a local API, these tools can use it as their backend just as easily as they use the real OpenAI API.

Setting it up

The setup has three steps.

Step 1: Install LM Studio and download Gemma 4. Download LM Studio from lmstudio.ai. Open the app, go to the model search, and download your preferred Gemma 4 variant. The 12B model is a good starting point for most Mac setups. The download is several gigabytes - plan accordingly.

Step 2: Start the LM Studio server. In the LM Studio CLI, run the server with the headless flag. This starts an OpenAI-compatible API at http://localhost:1234/v1 by default. The server stays running in the background while you work.

lms server start --model gemma-4-12b

Step 3: Point Claude Code at the local endpoint. Claude Code (and most AI coding agents) support connecting to custom API endpoints via environment variables or configuration files. Set the API base URL to your local LM Studio server and set any API key (LM Studio accepts any value since there's no authentication locally).

export ANTHROPIC_API_BASE_URL=http://localhost:1234/v1
export ANTHROPIC_API_KEY=local

Launch Claude Code and it will now use your local Gemma 4 model as the backend. The interface is identical - you're just running a different model underneath.

What to expect

Performance depends heavily on your hardware. On a MacBook Pro M4 Max with 48GB of unified memory, the 27B model runs at 30-40 tokens per second - fast enough for comfortable interactive use. On a MacBook Air M3 with 16GB, the 12B model runs at similar speeds but you're limited to the smaller model.

Quality-wise: expect solid performance on routine coding tasks - writing functions from docstrings, explaining code, simple refactors, finding obvious bugs. Expect weaker performance on complex architectural reasoning, obscure frameworks, and tasks that require deep contextual knowledge across a large codebase.

The practical use case is cost control. If you're a heavy Claude Code user hitting the $200/month ceiling, routing lower-stakes tasks through a local model while reserving cloud API access for harder problems can meaningfully reduce costs without sacrificing output quality where it matters.

Privacy implications

Everything runs locally. No data leaves your machine - not your prompts, not your code, not the model's responses. For developers working on proprietary codebases with data residency requirements, this setup provides a privacy guarantee that cloud-based AI tools cannot match.

This is also why Goose has been gaining traction among enterprise developers - it supports local model backends with the same architecture, meaning you get autonomous agentic coding without any cloud exposure.

The model landscape for local use

Gemma 4 is not the only option. Other models worth running locally via LM Studio:

Mistral Small 3 - Excellent multilingual performance, strong for code, 22B parameters. Comfortable on 32GB machines.

DeepSeek Coder V3 - Purpose-built for coding tasks, competitive with Gemma 4 on most benchmarks, slightly smaller footprint.

Phi-4 - Microsoft's 14B model that punches above its weight on reasoning tasks. Fast on consumer hardware.

LM Studio handles all of these with the same server setup. The choice of model is a matter of testing on your specific hardware and use cases - benchmarks tell part of the story, but real-world performance on your actual codebase is what matters.