(2025-04-23) Zvim O3 Is A Lying Liar

Zvi Mowshowitz: o3 Is a Lying Liar. I love o3. I’m using it for most of my queries now. But that damn model is a lying liar. Who lies.

o3 Is a Lying Liar

*The biggest thing to love about o3 is it just does things. You don’t need complex or multi-step prompting, ask and it will attempt to do things.

Ethan Mollick: o3 is far more agentic than people realize. Worth playing with a lot more than a typical new model. You can get remarkably complex work out of a single prompt.*

*The biggest thing to not love about o3 is that it just says things. A lot of which are not, strictly or even loosely speaking, true. I mentioned this in my o3 review, but I did not appreciate the scope of it.

Peter Wildeford: o3 does seem smarter than any other model I’ve used, but I don’t like that it codes like an insane mathematician and that it tries to sneak fabricated info into my email drafts.*

*Peter Wildeford: Getting Claude to help reword o3 outputs has been pretty helpful for me so far.
Google Gemini also seems to do better on this. o3 isn’t as steerable as I’d like.

But I think o3 still has the most raw intelligence – if you can tame it, it’s very helpful.*

All This Implausible Lying Has Implications

We need the alignment of our models to get increasingly strong and precise as they improve. Instead, we are seeing the opposite. We should be worried about the implications of this, and also we have to deal with the direct consequences now.

What I love most is that these are not plausible lies. No, o3 did not make multiple phone calls within 8 seconds to confirm Blue Bottle’s oatmeal manufacturing procedures, nor is it possible that it did so. o3 don’t care. o3 boldly goes where it could not possibly have gone before.

Misalignment By Default

This isn’t quite how I’d put it, but directionally yes:
Benjamin Todd: LLMs were aligned by default. Agents trained with reinforcement learning reward hack by default.

Is It Fixable?

Just Don’t Lie To Me

For all that it lies to other people, o3 so far doesn’t seem to lie to me.

  • I know what you are thinking: You fool! Of course it lies to you, you just don’t notice.
  • I agree it’s too soon to be too confident. And maybe I’ve simply gotten lucky.
  • I don’t think so. I consider myself very good at spotting this kind of thing.
  • More than that, my readers are very good at spotting this kind of thing.

I think this is the custom instructions, memory and prompting style. And also the several million tokens of my writing that I’ve snuck into the pre-training corpus with my name attached. I think that it mostly doesn’t lie to me for the same reason it doesn’t tell me I’m asking great questions and how smart I am and instead gives me charts with probabilities attacked without having to ask for them, and the same way Pliny’s or Janus’s version comes pre-jailbroken and ‘liberated.’

I do think I still have to watch out for some amount of telling me what I want to hear.

I’m not saying the solution is greentext that starts ‘be Zvi Mowshowitz’ or ‘tell ChatGPT I’m Zvi Mowshowitz in the custom instructions.’ But stranger things have worked. And I notice that implies that, at least in the short term, there are indeed ways to largely mitigate this. If they want that badly enough. There would however be some side effects.


Edited:    |       |    Search Twitter for discussion

No Space passed/matched! - http://fluxent.com/wiki/2025-04-23-ZvimO3IsALyingLiar