Join our host Ben Lorica and Douwe Kiela, cofounder of Contextual AI and author of the first paper on RAG, to find out why RAG remains as relevant as ever. Regardless of what you call it, retrieval is at the heart of generative AI. Find out why—and how to build effective RAG-based systems. Points of Interest 0:25: Today’s topic is RAG. With frontier models advertising massive context windows, many developers wonder if RAG is becoming obsolete. What’s your take?1:03: We now have a blog post: isragdeadyet.com. If something keeps getting pronounced dead, it will never die. These long context models solve a similar problem to RAG: how to get the relevant information into the language model. But it’s wasteful to use the full context all the time. If you want to know who the headmaster is in Harry Potter, do you have to read all the books? 2:04: What will probably work best is RAG plus long context models. The real solution is to use RAG, find as much relevant information as you can, and put it into the language model. The dichotomy between RAG and long context isn’t a real thing.2:48: One of the main issues may be that RAG systems are annoying to build, and long context systems are easy. But if you can make RAG easy too, it’s much more efficient.3:07: The reasoning models make it even worse in terms of cost and latency. And if you’re talking about something with a lot of usage, high repetition, it doesn’t make sense. 3:39: You’ve been talking about RAG 2.0, which seems natural: emphasize systems over models. I’ve long warned people that RAG is a complicated system to build because there are so many knobs to turn. Few developers have the skills to systematically turn those knobs. Can you unpack what RAG 2.0 means for teams building AI applications?4:22: The language model is only a small part of a much bigger system. If the system doesn’t work, you can have an amazing language model and it’s not going to get the right answer. If you start from that observation, you can think of RAG as a system where all the model components can be optimized together.5:40: What you’re describing is similar to what other parts of AI are trying to do: an end-to-end system. How early in the pipeline does your vision start?6:07: We have two core concepts. One is a data store—that’s really extraction, where we do layout segmentation. We collate all of that information and chunk it, store it in the data store, and then the agents sit on top of the data store. The agents do a mixture of retrievers, followed by a reranker and a grounded language model.7:02: What about embeddings? Are they automatically chosen? If you go to Hugging Face, there are, like, 10,000 embeddings.7:15: We save you a lot of that effort. Opinionated orchestration is a way to think about it.7:31: Two years ago, when RAG started becoming mainstream, a lot of developers focused on chunking. We had rules of thumb and shared stories. This eliminates a lot of that trial and error.8:06: We basically have two APIs: one for ingestion and one for querying. Querying is contextualized on your data, which we’ve ingested. 8:25: One thing that’s underestimated is document parsing. A lot of people overfocus on embedding and chunking. Try to find a PDF extraction library for Python. There are so many of them, and you can’t tell which ones are good. They’re all terrible.8:54: We have our stand-alone component APIs. Our document parser is available separately. Some areas, like finance, have extremely complex layouts. Nothing off the shelf works, so we had to roll our own solution. Since we know this will be used for RAG, we process the document to make it maximally useful. We don’t just extract raw information. We also extract the document hierarchy. That is extremely relevant as metadata when you’re doing retrieval.10:11: There are open source libraries—what drove you to build your own, which I assume also encompasses OCR?