#benchmark

I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.

What I Tested I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The...

May 9, 20263 min read21

Command Palette