Tom Smykowski beta

Blog

๐Ÿ’ฐ I Tested 20+ AI Models on Real Coding Tasks; Here's How to Find the Cheapest One That Actually Works

Photo by Tara Winstead from Pexels

Lately, IDE and AI API costs have become a serious concern. You run just a few requests and suddenly $20 is gone. And I wonder how AI companies plan to keep their services competitive if their APIs keep getting more expensive.

I've watched videos of people with AI labs running different models and measuring token speed. But that's like writing an app with console.log and sleep in a loop and measuring what's faster. The speed of models is quite predictable based on the architecture.

What I really wanted to know was: which model is actually good enough to solve my real coding tasks, and what's the cheapest option that works?

I figured out a way to test this using your own Git history. You have the AI go through your project's commits, identify features and bugs of varying complexity (simple, medium, complex), then create test cases from them. The AI then sends these tasks to different models and has the best model evaluate which solutions are correct.

What this article is about

This article presents a practical experiment comparing 20+ AI models from OpenAI, Anthropic, and other providers on real coding tasks extracted from a production project's Git history. I tested models on bug fixes and feature implementations of varying complexity (simple, medium, complex) and measured accuracy, cost, and token usage for each.

Questions this article answers

  • Which AI models achieve 100% accuracy on real coding tasks?
  • What is the cost difference between the cheapest and most expensive options?
  • How do specialized coding models compare to general-purpose flagship models?
  • What method can you use to test AI models on your own codebase?
  • Which model offers the best value (accuracy per dollar)?

Article size and reading time

Full article: ~2,500 words, approximately 10-12 minutes reading time. Includes detailed results tables, cost analysis charts, and specific recommendations for different use cases.

Want to unlock the full story? Log in

โ† All posts