(2025-04-18) Zvim O3 Will Use Its Tools For You

Zvi Mowshowitz: o3 Will Use Its Tools For You. OpenAI has finally introduced us to the full o3 along with o4-mini.
Greg Brockman (OpenAI): Just released o3 and o4-mini! These models feel incredibly smart. We’ve heard from top scientists that they produce useful novel ideas. Excited to see their positive impact on people’s daily lives and humanity’s hardest problems! Sam Altman: we expect to release o3-pro to the pro tier in a few weeks.
By all accounts, this upgrade is a big deal. They are giving us a modestly more intelligent model, but more importantly giving it better access to tools and ability to discern when to use them, to help get more practical value out of it. The tool use, and the ability to string it together and persist, is where o3 shines.

This post covers all things o3: Its capabilities and tool use, where to use it and where not to use it, and also the model card and o3’s alignment concerns. o3 is on the cusp of having dangerous capabilities, and while o3 is highly useful it hallucinates remarkably often by today’s standards and engages in an alarmingly high amount of deceptive and hostile behaviors. As usual, I am incorporating a wide variety of representative opinion and feedback, although in this case there was too much to include everything.

What’s In a Name

OpenAI’s naming continues to be a train wreck that doesn’t get better

My Current Model Use Heuristics

After reading everyone’s reactions, I think all three major models have their uses now.

If I need good tool use including image generation, with or without reasoning, and o3 has the tool in question, or I expect Gemini to be a jerk here, I go to o3.
If I need to dump a giant file or video into context, Gemini 2.5 is still available. I’ll also use it if I need strong reasoning and I’m confident I don’t need tools.
If Claude Sonnet 3.7 is definitely up for the task, especially if I want the read-only Google integration, then great, we get to still relax and use Claude. Deep Research is available as its own thing, if you want it, which I basically don’t.

In practice, I expect to use o3 most of the time right now. I do worry about the hallucination rate, but in the contexts I use AI for I consider myself very good at handling that. If I was worried about that, I’d have Gemini or Claude check its work.

Based on what people say, o3 is an excellent architect and finder of bugs, but you’ll want to use a different model for the coding itself, either Sonnet 3.7, Gemini 2.5 or GPT-4.1 depending on your preferences.

Huh, Upgrades

Use All the Tools

OpenAI: For the first time, our reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images.

Tool use is seamless. It all goes directly into the chain of thought. Intelligent dynamic tool use is a really huge deal. What tools do we have so far?

If you are using the API in the future, you can then add your own custom tools. I am very excited and curious to see what happens when people start adding additional tools. If you add Zapier or other ways to access and modify your outside contexts, do you suddenly get a killer executive assistant?

Search the Web

I love o3’s web browsing, which in my experience is a step upgrade. I’m very happy that spontaneous web browsing is becoming the new standard. It was a big upgrade when Claude got browsing a few weeks ago, and it’s another big upgrade that o3 seems much better than previous model at seeking out information, especially key details, and avoiding getting fooled. And of course now it is standard that you are given the sources, so you can check.

On Your Marks

The o3 and o4-mini System Card

Tests o3 Aced

Hallucinations

This seems like a pretty big deal? Hallucination rate is vastly higher now. They explain that this is a product of o3 making more claims overall. But that’s terrible. It shouldn’t be making more claims, the new scores are obviously worse.

Instruction Hierarchy

High Praise

Syncopathy

o3 Offers Mundane Utility

It loves to help. Please let it help?
David Shapiro: o3 really really really wants to build stuff. It’s like “bro can you PLEASE just let me code this up for you already???” It’s like an over-eager employee. I’m not complaining. I’m not even kidding. “Come on bro, it’ll only take 10 minutes bro”

o3 Doesn’t Offer Mundane Utility

o4-mini Also Exists

I shouldn’t be dismissing it. It is quite a bit cheaper than o3. It’s not implausible that o4-mini is on the Production Possibilities Frontier right now, if you need reasoning and tool use but you don’t need all the raw power. And of course, if you’re a non-Pro user, you’re plausibly running out of o3 queries. I still haven’t queried o4-mini, not even once. With the speed of o3, there’s seemed to be little point. So I don’t have much to say about it. What feedback I’m seeing is mixed.

Colin Fraser Dumb Model Watch

o3 as Forecaster

Is This AGI?

Some sources say yes! Tyler Cowen: Basically it wipes the floor with the humans, pretty much across the board.

David Khoo: [Tyler you are] like Gary Kasparov after he lost to Deep Blue, or Lee Sedol after he lost to AlphaGo. The computer has exceeded your ability in the narrow areas you regard yourself strong and you find interesting and worthwhile. We get it. But that’s completely different from artificial GENERAL intelligence

I find it fascinating to think about Tyler’s implied definition of AGI from this statement. Tyler correctly doesn’t care much about what the AI can’t do, or how you can trick it. What is important is what it can do, and how you can use it. I am however not so impressed by these examples, even if they are not cherry picked, although I suspect this is more the mismatch in interests, for my proposes Tyler is not asking good questions. This highlights a difference between what Tyler finds interesting and impressive, and his ability to absorb and value unholy quantities of detail, he seems as if he lives for that stuff. Versus what I find interesting and impressive, which is the logical structure behind things.

Edited: 2025-04-27 00:00:00 | Tweet this! | Search Twitter for discussion

Bill Seitz