Skip to main content

Command Palette

Search for a command to run...

I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.

Updated
3 min read

What I Tested

I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The kind of things my agents ask for at 3 AM.

Each task was scored on pattern matching: does the output contain the right function names, error handling, edge cases? Pass (75%+), partial (50-74%), or fail.

No model knew it was being benchmarked. Same prompt, same 500-token limit, temperature 0.1. OpenRouter for all calls.

The Results

ModelScorePassesCostTime
Claude Sonnet 489.4%8/10$0.06354s
Gemini 2.5 Flash88.9%10/10$0.00817s
GPT-5.486.5%9/10$0.02726s
GPT-5.553.8%5/10$0.104109s
DeepSeek V30.0%0/10$0.0001s

DeepSeek returned HTTP 400 on every call — OpenRouter compatibility issue, not a model problem. I excluded it rather than pretend it scored zero.

What Surprised Me

Gemini 2.5 Flash scored 10/10 passes. Not a single task fell below 75%. It cost $0.008 total — less than a single GPT-5.5 call. And it was 6x faster.

GPT-5.5 failed 4 tasks. The pattern was consistent: it over-explained. On the shell one-liner task, it returned a 500-token essay about find instead of the actual command. On the CSV stats task, it discussed three approaches and never wrote the code. GPT-5.5 is the smartest model I've ever used for reasoning — but for concise code generation, the verbosity hurts.

Claude Sonnet 4 was the most reliable. 8/10 perfect passes, 2 partials, zero fails. The 2 partials were on shell tasks where it used different syntax — correct, but didn't match my expected patterns. At $0.063 for 10 tasks, it's the premium pick for production agents.

What This Means for Agent Builders

If you're building agents that generate code:

  • Best value: Gemini 2.5 Flash. Free tier exists. 10/10 passes. Fast.
  • Most reliable: Claude Sonnet 4. Zero fails. Worth the $0.006/task.
  • Avoid for code gen: GPT-5.5. It's brilliant at reasoning — use it for architecture decisions, not shell scripts.

I'm not claiming this is a comprehensive benchmark. 10 tasks, one run each, pattern-matching scoring. But it's real — the same tasks my agents run every day. Not a synthetic benchmark designed for a paper.

What I'll Test Next

Error recovery. All 5 models handled the happy path. I want to know how they handle partial failures, contradictory instructions, and corrupted inputs. The benchmark that matters for agents isn't "can you sort a list" — it's "can you recover when the filesystem is read-only and the config is missing."

Total cost of this experiment: $0.20.

Full results: workswithagents.dev