(2022-12-16) Shipper I Built An Ai Chatbot Based On My Favorite Podcast

Dan Shipper: I Built an AI Chatbot Based On My Favorite Podcast. In the future, any time you look up information you’re going to use a chatbot. This applies to every piece of information you interact with day to day: personal, organizational, and cultural.

I love listening to the Huberman Lab podcast, a neuroscience podcast by Stanford neurobiologist Andrew Huberman

It was simple to build, and it can already answer questions plausibly well. I can ask questions about topics that the podcast has covered in the past, and it answers them using transcripts of old episodes as an information source.

it still leaves a few things to be desired. For one, it gets things subtly wrong. For another, sometimes it’s not specific enough to answer the question, and I have to ask follow-ups to get the right answer.

The principles behind the Huberman bot are simple:
It ingests and makes searchable all of the transcripts from the Huberman Lab podcasts.
When a user asks a question, it searches through all of the transcripts it has available and finds sections that are relevant to the query.
Then, it takes those sections of text and sends them to GPT-3 with a prompt that looks something like:
Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know." [ relevant sections of Huberman Lab transcripts ]
Q: What is task bracketing?
A:

I built it mostly using this code example from OpenAI with a bunch of custom modifications for my use case

the length of the prompt you can send to the model is capped at 4,000 tokens—where a token is roughly equivalent to ¾ of a word. So you’re limited in terms of how much context you can feed.*

You have to hope that your search algorithm (in this case a vector similarity using OpenAI’s embeddings search) found the most relevant pieces of transcript such that the answer to the question exists in what you’re providing to the model.

This often works, but it fails just as frequently. The bot is subtly wrong a lot or is not specific enough to fully answer the question

It’s easy to dismiss this technology given these shortcomings. But most of them are immediately solvable.

The answers will get a lot better if I clean up the data used to generate them. Right now, they’re based on raw transcripts of podcast episodes. When humans talk, they don’t tend to talk in crisp sentences so the answers to a lot of the questions I might ask are spread out around the episode and aren’t clearly spelled out in the transcript. If I cleaned up the transcripts to make sure that, for example, every term was clearly defined in a single paragraph of text, it would make for much better answers.

If every time it answered a question it told me its source—e.g., where in the episode I could go to find more information—it wouldn't matter as much if the answer was vague or slightly wrong because I could check its work.

in between building this bot and writing this article, OpenAI released a new version of its embeddings search that will significantly improve the results and lower the cost to get them by 99%

There are hundreds of thousands of copyrighted text, audio, and video corpuses that can be valuable chatbots today

For a long time I’ve been enamored with the idea that every company should have a librarian—someone who is tasked with writing down tacit knowledge

Where power settles in this ecosystem

I think power will settle in at least four places:
The operating system layer
The browser layer
The layer of models that are willing to return risky results to users
The copyright layer.


Edited:    |       |    Search Twitter for discussion