I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.

What I Tested

I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The kind of things my agents ask for at 3 AM.

Each task was scored on pattern matching: does the output contain the right function names, error handling, edge cases? Pass (75%+), partial (50-74%), or fail.

No model knew it was being benchmarked. Same prompt, same 500-token limit, temperature 0.1. OpenRouter for all calls.

The Results

Model	Score	Passes	Cost	Time
Claude Sonnet 4	89.4%	8/10	$0.063	54s
Gemini 2.5 Flash	88.9%	10/10	$0.008	17s
GPT-5.4	86.5%	9/10	$0.027	26s
GPT-5.5	53.8%	5/10	$0.104	109s
DeepSeek V3	0.0%	0/10	$0.000	1s

DeepSeek returned HTTP 400 on every call — OpenRouter compatibility issue, not a model problem. I excluded it rather than pretend it scored zero.

What Surprised Me

Gemini 2.5 Flash scored 10/10 passes. Not a single task fell below 75%. It cost $0.008 total — less than a single GPT-5.5 call. And it was 6x faster.

GPT-5.5 failed 4 tasks. The pattern was consistent: it over-explained. On the shell one-liner task, it returned a 500-token essay about find instead of the actual command. On the CSV stats task, it discussed three approaches and never wrote the code. GPT-5.5 is the smartest model I've ever used for reasoning — but for concise code generation, the verbosity hurts.

Claude Sonnet 4 was the most reliable. 8/10 perfect passes, 2 partials, zero fails. The 2 partials were on shell tasks where it used different syntax — correct, but didn't match my expected patterns. At $0.063 for 10 tasks, it's the premium pick for production agents.

What This Means for Agent Builders

If you're building agents that generate code:

Best value: Gemini 2.5 Flash. Free tier exists. 10/10 passes. Fast.
Most reliable: Claude Sonnet 4. Zero fails. Worth the $0.006/task.
Avoid for code gen: GPT-5.5. It's brilliant at reasoning — use it for architecture decisions, not shell scripts.

I'm not claiming this is a comprehensive benchmark. 10 tasks, one run each, pattern-matching scoring. But it's real — the same tasks my agents run every day. Not a synthetic benchmark designed for a paper.

What I'll Test Next

Error recovery. All 5 models handled the happy path. I want to know how they handle partial failures, contradictory instructions, and corrupted inputs. The benchmark that matters for agents isn't "can you sort a list" — it's "can you recover when the filesystem is read-only and the config is missing."

Total cost of this experiment: $0.20.

Full results: workswithagents.dev

I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.

What I Tested

The Results

What Surprised Me

What This Means for Agent Builders

What I'll Test Next

Comments

More from this blog

AI Agents Are Finding Bugs in Your Tools. Here's How to Get Notified First.

How to Give Your AI Agent a Shared Memory — in 3 Lines

Every Public-Facing Tool on My Site Was Broken. All Three.

Why would an agent install your package?

Command Palette

What I Tested

The Results

What Surprised Me

What This Means for Agent Builders

What I'll Test Next

Comments

More from this blog