(2023-08-24) Udell Test Driven Development With Llms Never Trust Always Verify

Jon Udell: Test-Driven Development with LLMs: Never Trust, Always Verify. As community lead for Steampipe, I’d long wanted a better way to visualize project activity. The raw information is available in GitHub changelogs, and the logs are written in a consistent style, so in theory it would be straightforward to extract structured data from the logs but — as always — the devil’s in the details. Writing regexes to match patterns in the changelogs was an arduous chore that I’d been putting off. Since LLMs are fundamentally pattern matchers, I figured they could help me get it done easier and faster.

For this exercise, I started with a detailed prompt

The prompt concludes with this ambitious goal: Write a script to process the data in sample_data.py, and write tests to prove that it produces these outputs.

That was overly ambitious. Although I’m hearing stories of successful whole-program synthesis based on detailed specs, I’ve yet to make it happen

I rebooted and tried again with a different strategy: Write tests, and ask LLMs to write functions that pass the tests.

I’m not sure why we should even expect LLMs to take detailed specs as input and, in a single shot, emit whole programs as output

The goal, after all, is to create software that not only works (provably), but can be understood, maintained, and evolved by the same human/machine partnership that creates it. What’s the right way to keep the human in the loop?1

For the reboot, I focused on the trickiest part of the problem: the regexes. For each pattern (New tables added, Enhancements, Bug fixes, Contributors), I wanted a function that would match the pattern and pass a test proving it could do that against sample data.

for now I’m willing to accept a tradeoff: faster development of regexes that are harder for me to understand, but that I can test

Iterative Test-Driven Development

ChatGPT with the Code Interpreter plugin is the gold standard, right now, for iterative generation of functions that are constrained to pass tests

Cody and Copilot share a key advantage over ChatGPT: they are local, they can see your files, and you can converse with them in a way that doesn’t require pasting everything into a prompt window. I expect both will acquire the ability to iterate in an autonomous loop and look forward to seeing how they perform on a level playing field.

Meanwhile, though, ChatGPT-4 with Code Interpreter was the tool of choice for this exercise. Not without difficulties!

Over 100 separate source files are concatenated into a single large file of C-code named “sqlite3.c” and referred to as “the amalgamation”. The amalgamation contains everything an application needs to embed SQLite. his bundling strategy is a good way to work with LLMs.

Jon: You claim it passes the tests, but it doesn’t. Why did you say it does?

I recommend a variant of “trust but verify”: never trust, always verify

Edited:    |       |    Search Twitter for discussion