Local Qwen Model Outperforms Claude Opus on Image Generation

A real-world test shows that Qwen3.6-35B running locally matched or exceeded Claude Opus 4.7's image generation capabilities, suggesting open-source models are closing the gap with frontier AI systems on specific tasks.

Simon Willison tested Qwen 3.6 35B running locally against Claude Opus 4.7 on a straightforward request: draw a pelican. The local model won. Not by a tiny margin in some esoteric benchmark. It actually produced a better pelican.

This single comparison exposes something the AI industry doesn't want to admit loudly: frontier models aren't universally better. They're better at specific, valuable things. But on tasks where they haven't been optimized ruthlessly, an open-source model running on your laptop can do the job just fine. Sometimes better.

The Test Nobody's Running Anymore

Most AI comparisons follow a predictable script. Benchmark scores. Token speed. Latency measurements. What they don't do is ask the question Willison asked: which model actually produces something I'd use?

With image generation specifically, frontier models like Claude have been increasingly cautious. They refuse requests. They add watermarks. They hedge outputs. Meanwhile, open-source alternatives like Qwen have been quietly improving their visual capabilities without the same liability concerns hanging over every decision.

A pelican drawing is trivial. But it's exactly the kind of task that maps to real-world use: generating reference images, creating visual mockups, producing assets that don't need photorealism. These tasks don't need the most expensive model in the world.

Why This Actually Changes The Math

Claude Opus costs money per token. Qwen 35B running locally costs electricity and whatever hardware you already own. The cost difference isn't marginal. It's orders of magnitude.

If you're using Claude for image generation requests dozens of times a week, your API bill accumulates. If those requests don't require the absolute frontier of model capability, you're paying for capabilities you never use. The pelican test proves the point: you're not getting better results for the price.

The practical implication is uncomfortable for Claude's positioning. They're marketed as the best model for everything. But if developers start running local comparisons on their actual workflows, Claude loses the assumed superiority that justifies the cost.

The Privacy Layer Nobody Mentions

Local models have a secondary advantage that compounds the economic argument. Your pelican request doesn't go to Anthropic's servers. It stays on your hardware. No API logging. No data retention policies. No terms-of-service nightmares if your prompt contains something sensitive.

For teams working with confidential information, this isn't a marginal benefit. It's foundational. You can't even use Claude on certain types of documents without legal review, no matter how good it is at the task.

Qwen running locally solves that constraint completely. Your data never leaves the machine.

Where Frontier Models Still Win

This doesn't mean Claude is suddenly obsolete. It means the gap is task-specific and getting narrower.

Claude excels at complex reasoning, long-context understanding, and nuanced language tasks where training data quality compounds capability. If your prompt requires sustained logical inference across multiple paragraphs, Claude still has an edge. If you need a model to understand domain-specific context across 100,000 tokens, the frontier matters.

But for image generation, simple summarization, straightforward coding tasks, and other well-defined problems, open-source models now deliver comparable results. Sometimes better results.

The real shift happening here is a bifurcation. Frontier models will be reserved for the 20 percent of use cases that genuinely need frontier capability. Everything else gets handled by cheaper, faster, local alternatives. Teams that figure this out first save enormous amounts on infrastructure costs.

What The Industry Is Doing Wrong

Anthropic, OpenAI, and Google all treat their models as general-purpose tools meant to be best at everything. It's a defensible strategy for marketing. It's terrible strategy for actually serving customer needs efficiently.

The companies making smarter decisions right now are the ones building modular systems that route tasks to appropriate models. Use Claude for reasoning. Use Qwen locally for generation. Use a tiny quantized model for classification. The cost per task drops dramatically when you stop treating every request as if it needs the best model available.

Willison's pelican test is simple enough that most developers will dismiss it. That's exactly why it matters. Simple tasks are the majority of real-world usage. And on simple tasks, you're often better off going local.

The Uncomfortable Question

If Qwen 35B running on consumer hardware beats Claude Opus at image generation, what other tasks is Claude being overdeployed on? Where else are teams paying premium prices for generic capability when local models would work fine?

The honest answer is probably in your own API logs. Most teams have never actually tested this. They assume frontier is always better. Willison just proved that assumption wrong on at least one important task.

Start looking at your high-volume, low-complexity requests. The pelican test scales.