GPT-5 is overrated but still good

August 9, 2025

I've been testing the model since its release on various tasks:

Complex agent workflow for Voice AI: This application has various chains of prompts and advanced techniques to guide users through their meal order. Model performs okay; it seems to be on par with Claude 4, but still gets about 16% of the orders wrong. Claude was scoring around the same.👌
Edit an existing game: I prompted Claude 3 times to fix a particular bug with collision detection. You basically tap floating objects, and some pop to reveal surprises. One of the issues was touch sensitivity; on some objects, you have to tap 3-4 times before it marks the object as collected. Claude Sonnet could never fix this issue after 3-4 prompts, GPT-5 one-shot fixed it✅.
General coding tasks: "Randomly select a product from a list of products and build a modal popup, showing the product name, price, discount banner (make this stylish, red, and a ribbon on the top right of the modal). Show other essential info like the price and photo." ~ Simple task, the modal puts the closing "X" on the discount banner, and the image doesn't display. For the title, it used "Special discount" instead of the product title. ❌ - Claude Sonnent 4 one-shot got the modal mostly right, except the discount ribbon didn't look like a ribbon (just a badge), but that's no biggie.
GPT-5 mini and nano models: classification, attribute extraction, summarization, etc. These models seem to perform better than Gemini Flash / Flash-Lite, although they are just a few percentage points better. ✅

EDIT: After some time, I reverted GPT-5 mini back to Gemini flash 2.5 pro / 2.0 Flash. Maybe it's because of the hype right now, but the response times are not great consistently. The intelligence difference isn't that large, so Gemini still outperforms based on speed and reliability. I used both the OpenAI API directly and OpenRouter. OpenRouter for Gemini works well.

Conclusion

The biggest selling point for the flagship model is its price at $1.25(input) and $10.00(output), which is much cheaper than Sonnet 4 ($3/$15). To be fair, I'm using the Codex CLI with API credits, which works out more expensive than Claude Pro for prolonged usage (I'm not sure at this stage if the ChatGPT Pro sub includes Codex CLI usage).

Beyond the price difference, the model seems to be on par (or better) with Sonnet 4 in some cases, and others not so much.

I would classify this model as a welcome improvement on previous generations, but it's nothing groundbreaking. The only benefit is that GPT-5 provides a "mixture of experts" and various intelligence level settings that allow for better control of costs and speed.

I must also add, working with Codex CLI sucks. It's very "clunky" and ugly (keeps asking me Yes/No questions and prints ugly, unformatted text everywhere!), and also slower than Claude code, thus Claude still has the edge when it comes to terminal tools. I don't use Visual editors like Cursor or Windsurf, so I can't comment on those.