I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.
What I Tested
I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The kind of things my agents ask for at 3 AM.
Each task was scored on pattern matching: does the output contain the right function names, error handling, edge cases? Pass (75%+), partial (50-74%), or fail.
No model knew it was being benchmarked. Same prompt, same 500-token limit, temperature 0.1. OpenRouter for all calls.
The Results
| Model | Score | Passes | Cost | Time |
| Claude Sonnet 4 | 89.4% | 8/10 | $0.063 | 54s |
| Gemini 2.5 Flash | 88.9% | 10/10 | $0.008 | 17s |
| GPT-5.4 | 86.5% | 9/10 | $0.027 | 26s |
| GPT-5.5 | 53.8% | 5/10 | $0.104 | 109s |
| DeepSeek V3 | 0.0% | 0/10 | $0.000 | 1s |
DeepSeek returned HTTP 400 on every call — OpenRouter compatibility issue, not a model problem. I excluded it rather than pretend it scored zero.
What Surprised Me
Gemini 2.5 Flash scored 10/10 passes. Not a single task fell below 75%. It cost $0.008 total — less than a single GPT-5.5 call. And it was 6x faster.
GPT-5.5 failed 4 tasks. The pattern was consistent: it over-explained. On the shell one-liner task, it returned a 500-token essay about find instead of the actual command. On the CSV stats task, it discussed three approaches and never wrote the code. GPT-5.5 is the smartest model I've ever used for reasoning — but for concise code generation, the verbosity hurts.
Claude Sonnet 4 was the most reliable. 8/10 perfect passes, 2 partials, zero fails. The 2 partials were on shell tasks where it used different syntax — correct, but didn't match my expected patterns. At $0.063 for 10 tasks, it's the premium pick for production agents.
What This Means for Agent Builders
If you're building agents that generate code:
- Best value: Gemini 2.5 Flash. Free tier exists. 10/10 passes. Fast.
- Most reliable: Claude Sonnet 4. Zero fails. Worth the $0.006/task.
- Avoid for code gen: GPT-5.5. It's brilliant at reasoning — use it for architecture decisions, not shell scripts.
I'm not claiming this is a comprehensive benchmark. 10 tasks, one run each, pattern-matching scoring. But it's real — the same tasks my agents run every day. Not a synthetic benchmark designed for a paper.
What I'll Test Next
Error recovery. All 5 models handled the happy path. I want to know how they handle partial failures, contradictory instructions, and corrupted inputs. The benchmark that matters for agents isn't "can you sort a list" — it's "can you recover when the filesystem is read-only and the config is missing."
Total cost of this experiment: $0.20.
Full results: workswithagents.dev

