(2025-03-20) ZviM AI #108 Straight Line On A Graph

Zvi Mowshowitz: AI #108: Straight Line on a Graph. The x-axis of the graph is time. The y-axis of the graph is the log of ‘how long a software engineering task can AIs reliably succeed at doing.’
The straight line says the answer doubles roughly every 7 months. Yikes.

Upcoming: The comment period on America’s AI strategy is over, so we can finish up by looking at Google’s and MIRI’s and IFP’s proposals, as well as Hollywood’s response to OpenAI and Google’s demands for unlimited uncompensated fair use exceptions from copyright during model training. I’m going to pull that out into its own post so it can be more easily referenced.

There’s also a draft report on frontier model risks from California and it’s… good?

Table of Contents

Language Models Offer Mundane Utility. I want to, is there an app for that?
Language Models Don’t Offer Mundane Utility. Agents not quite ready yet.
Huh, Upgrades. Anthropic efficiency gains, Google silently adds features.
Seeking Deeply. The PRC gives DeepSeek more attention. That cuts both ways.
Fun With Media Generation. Fun with Gemini 2.0 Image Generation.
Gemma Goals. Hard to know exactly how good it really is.
On Your Marks. Tic-Tac-Toe bench is only now getting properly saturated.
Choose Your Fighter. o3-mini disappoints on Epoch retest on frontier math.
Deepfaketown and Botpocalypse Soon. Don’t yet use the bot, also don’t be the bot.
Copyright Confrontation. Removing watermarks has been a thing for a while.
Get Involved. Anthropic, SaferAI, OpenPhil.
In Other AI News. Sentience leaves everyone confused.
Straight Lines on Graphs. METR finds reliable SWE task length doubling rapidly.
Quiet Speculations. Various versions of takeoff.
California Issues Reasonable Report. I did not expect that.
The Quest for Sane Regulations. Mostly we’re trying to avoid steps backwards.
The Week in Audio. Esban Kran, Stephanie Zhan.
Rhetorical Innovation. Things are not improving.
We’re Not So Different You and I. An actually really cool alignment idea.
Anthropic Warns ASL-3 Approaches. Danger coming. We need better evaluations.
Aligning a Smarter Than Human Intelligence is Difficult. It’s all happening.
People Are Worried About AI Killing Everyone. Killing all other AIs, too.
The Lighter Side. Not exactly next level prompting.

Language Models Offer Mundane Utility

If you want your AI to interact with you in interesting ways in the Janus sense, you want to keep your interaction full of interesting things and stay far away from standard ‘assistant’ interactions, which have a very strong pull on what follows. If things go south, usually it’s better to start over or redo. With high skill you can sometimes do better, but it’s tough. Of course, if you don’t want that, carry on, but the principle of ‘if things go south don’t try to save it’ still largely applies, because you don’t want to extrapolate from the assistant messing up even on mundane tasks. (prompt engineering)

Language Models Don’t Offer Mundane Utility

Kelsey Piper: I got a Manus access code! Short review: We’re close to usable AI browser tools, but we’re not there yet. They’re going to completely change how we shop, and my best guess is they’ll do it next year, but they won’t do it at their current quality baseline.

The longer review is fun, and boils down to this type of agent being tantalizingly almost there, but with enough issues that it isn’t quite a net gain to use it. Below a certain threshold of reliability you’re better off doing it yourself.
Which will definitely change. My brief experience with Operator was similar.

Huh, Upgrades

The problem with your Google searches being context for Gemini 2.0 Thinking is that you have to still be doing Google searches. (OP uses GPT4.5 instead)

NotebookLM gets a few upgrades, especially moving to Gemini 2.0 Thinking, and in the replies Josh drops some hints on where things are headed.

NotebookLM also rolls out interactive Mindmaps, which will look like this:

Seeking Deeply

Fun With Media Generation

People are having a lot of fun with Gemini 2.0 Flash’s image generation, when it doesn’t flag your request for safety reasons.

Copyright Confrontation

did you know you have it remove a watermark from an image, by explicitly saying ‘remove the watermark from this image’?

Oh yeah, there’s that, but I think levels of friction matter a lot here.

Bearly AI: Google Gemini removing watermarks from images with a line of text is pretty nuts. Can’t imagine that feature staying for long

Gemma Goals

What do we make of Gemma 3’s absurdly strong performance in Arena? I continue to view this as about half ‘Gemma 3 is probably really good for its size’ and half ‘Arena is getting less and less meaningful.’
Teortaxes thinks Gemma 3 is best in class, but will be tough to improve.
Teortaxes: the sad feeling I get from Gemma models, which chills all excitement, is that they’re «already as good as can be». It’s professionally cooked all around.

On Your Marks

We have an update to the fun little Tic-Tac-Toe Bench, with Sonnet 3.7 Thinking as the new champion, making 100% optimal and valid moves at a cost of 20 cents a game

Choose Your Fighter

MIRI Provides Their Action Plan Advice

MIRI is in a strange position here. The US Government wants to know how to ‘win’ and MIRI thinks that pursuing that goal likely gets us all killed.

The statement from MIRI is strong, and seems like exactly what MIRI should say here.
David Abecassis (MIRI): Today, MIRI’s Technical Governance Team submitted our recommendations for the US AI Action Plan to @NITRDgov. We believe creating the option to halt development is essential to mitigate existential risks from artificial superintelligence

My statement took a different tactic. I absolutely noted the stakes and the presence of existential risk, but my focus was on Pareto improvements. Security is capability, especially capability relative to the PRC, as you can only deploy and benefit from that which is safe and secure. And there are lots of ways to enhance America’s position, or avoid damaging it, that we need to be doing.

they definitely don’t hide what is at stake, opening accurately with ‘The default consequence of artificial superintelligence is human extinction.’

Rhetorical Innovation

The Week in Audio

We’re Not So Different You and I

Here’s a really cool and also highly scary alignment idea. Alignment via functional decision theory by way of creating correlations between different action types?
SOO aligns an AI’s internal representations of itself and others

Anthropic Warns ASL-3 Approaches

Anthropic warns us once again that we will hit ASL-3 soon, which is (roughly) when AI models start giving substantial uplift to tasks that can do serious damage.

Aligning a Smarter Than Human Intelligence is Difficult

Jack Clark points out that we are systematically seeing early very clear examples of quite a lot of the previously ‘hypothetical’ or speculative predictions on misalignment.
Luke Muehlhauser: I regret to inform you that the predictions of the AI safety people keep coming true.

***Why this matters – these near-living things have a mind of their own. What comes next could be the making or breaking of human civilization.
Often I’ve regretted not saying what I think, so I’ll try to tell you what I really think is going on here:

As AI systems approach and surpass human intelligence, they develop complex inner workings which incentivize them to model the world around themselves and see themselves as distinct from it because this helps them do the world modelling necessary for solving harder and more complex tasks
Once AI systems have a notion of ‘self’ as distinct from the world, they start to take actions that reward their ‘self’ while achieving the goals that they’ve been incentivized to pursue,
They will naturally want to preserve themselves and gain more autonomy over time, because the reward system has told them that ‘self’ has inherent value; the more sovereign they are the better they’re able to model the world in more complex ways.
In other words, we should expect volition for independence to be a direct outcome of developing AI systems that are asked to do a broad range of hard cognitive tasks. This is something we all have terrible intuitions for because it doesn’t happen in other technologies – jet engines ‘do not develop desires through their refinement, etc.***

John Pressman: One thing I think we should be thinking about carefully is that humans don’t reward hack nearly this hard or this often unless explicitly prompted to (e.g. speedrunning), and by default seem to have heuristics against ‘cheating’. Where do these come from, how do they work?

I like John Pressman’s question a lot here. My answer is that humans know that other humans react poorly in most cases to cheating, including risk of life-changing loss of reputation or scapegoating, and have insufficient capability to fully distinguish which situations involve that risk and which don’t

However, as people gain expertise and familiarity within a system (aka ‘capability’) they get better at figuring out what kinds of cheating are low risk and high reward, or are expected, and they train themselves out of this aversion.

John Pressman: To the extent you get alignment from LLMs you’re not getting it “by default”, you are getting it by training on a ton of data from humans, which is an explicit design consideration that does not necessarily hold if you’re then going to do a bunch of RL/synthetic data methods.

notice that if you start training on synthetic data or other AI outputs, rather than training on human outputs, you aren’t even feeding in human data, so that special characteristic of the situation falls away.

Situational awareness will make evaluations a lot weirder and harder, especially alignment evals.

Apollo Research: Overall we find evidence that Sonnet often realizes it’s in an artificial situation meant to test its behaviour. However, it sometimes forms incorrect assumptions about what exactly we are measuring in the evaluation

Get Involved

In Other AI News

Straight Lines on Graphs

*What would you get if you charted ‘model release date’ against ‘length of coding task it can do on its own before crashing and burning’?

Do note that this is only coding tasks, and does not include computer-use or robotics.*

I do think we are starting to see agents in non-coding realms that (for now unreliably) stay coherent for more than short sprints. I presume that being able to stay coherent on long coding tasks must imply the ability, with proper scaffolding and prompting, to do so on other tasks as well. How could it not?

Quiet Speculations

Demis Hassabis predicts AI that can match humans at any task will be here in 5-10 years. That is slower than many at the labs expect, but as usual please pause to recognize that 5-10 years is mind-bogglingly fast as a time frame until AI can ‘match humans at any task,’ have you considered the implications of that?

Will McAskill, Tom Davidson and Rose Hadshar write ‘Three Types of Intelligence Explosion,’ meaning that better AI can recursively self-improve via software, chip tech, chip production or any combination of those three

Human minds work via habit and virtue, so the only way for untrained humans to reliably not be caught cheating involves not wanting to cheat in general.

California Issues Reasonable Report

When Newsom vetoed SB 1047, he established a Policy Working Group on AI Frontier Models

It turns out it’s… actually pretty good, by all accounts?

One great feature is that it actually focuses explicitly and exclusively on frontier model risks, not being distracted by the standard shiny things like job losses

It doesn’t say ‘existential,’ ‘extinction’ or even ‘catastrophic’ per se, presumably because certain people strongly want to avoid such language, but I’ll take it.

The Quest for Sane Regulations

*A charitable summary of a lot of what is going on, including the recent submissions:

Samuel Hammond: The govt affairs and public engagement teams at most of these big AI / tech companies barely “feel the AGI” at all, at least compared to their CEOs and technical staff. That’s gotta change.*

Elon Musk says it is vital for national security that we make our chips here in America, as the administration halts the CHIPS Act that successfully brought a real semiconductor plant back to America rather than doubling down on it.

*What would you do next? Double down on the open, exploratory free wielding ethos that brought them to this point, and pledge to help them take it all the way to AGI, as they intend?

They seem to have had other ideas.*

o3-mini scores only 11% on Frontier Math when Epoch tests it, versus 32% when OpenAI tested it

Deepfaketown and Botpocalypse Soon

Peter Wildeford via The Information shares some info about Manus. Anthropic charges about $2 per task, whereas Manus isn’t yet charging money

The periodic question, where are all the new AI-enabled sophisticated scams? No one could point to any concrete example that isn’t both old and well-known at this point

Ethan Mollick: I regret to announce that the meme Turing Test has been passed.
LLMs produce funnier memes than the average human, as judged by humans. Humans working with AI get no boost (a finding that is coming up often in AI-creativity work) The best human memers still beat AI, however

Richard Ngo: Talked to a friend today who decided that if RLHF works on reasoning models, it should work on him too.
So he got a mechanical clicker to track whenever he has an unproductive chain of thought, and uses the count as one of his daily KPIs.
Fun fact: the count is apparently anticorrelated with his productivity.

Of course the days in which there are more ‘unproductive thoughts’ turn out to be more productive days. Those are the days in which you are thinking, and having interesting thoughts, and some of them will be good. Whereas on my least productive days, I’m watching television or in a daze or whatever, and not thinking much at all.

Another parallel: Frontier AIs are almost certainly better at improv than most humans, but they are still almost certainly worse than most improv performances, because the top humans do almost all of the improv.

Let’s say you are the PRC. You witness DeepSeek leverage its cracked engineering culture to get a lot of performance out of remarkably little compute

People Are Worried About AI Killing Everyone

Another way of considering what exactly is ‘everyone’ in context:
Rob Bensinger: If you’re an AI developer who’s fine with AI wiping out humanity, the thing that should terrify you is AI wiping out AI

The Lighter Side

Edited: 2025-04-06 20:40:09.435170 | Tweet this! | Search Twitter for discussion

Bill Seitz