
Join our host Ben Lorica and Douwe Kiela, cofounder of Contextual AI and author of the first paper on RAG, to find out why RAG remains as relevant as ever. Regardless of what you call it, retrieval is at the heart of generative AI. Find out why—and how to build effective RAG-based systems.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
Points of Interest
- 0:00: Introduction to Douwe Kiela, cofounder and CEO of Contextual AI.
- 0:25: Today’s topic is RAG. With frontier models advertising massive context windows, many developers wonder if RAG is becoming obsolete. What’s your take?
- 1:03: We now have a blog post: isragdeadyet.com. If something keeps getting pronounced dead, it will never die. These long context models solve a similar problem to RAG: how to get the relevant information into the language model. But it’s wasteful to use the full context all the time. If you want to know who the headmaster is in Harry Potter, do you have to read all the books?
- 2:04: What will probably work best is RAG plus long context models. The real solution is to use RAG, find as much relevant information as you can, and put it into the language model. The dichotomy between RAG and long context isn’t a real thing.
- 2:48: One of the main issues may be that RAG systems are annoying to build, and long context systems are easy. But if you can make RAG easy too, it’s much more efficient.
- 3:07: The reasoning models make it even worse in terms of cost and latency. And if you’re talking about something with a lot of usage, high repetition, it doesn’t make sense.
- 3:39: You’ve been talking about RAG 2.0, which seems natural: emphasize systems over models. I’ve long warned people that RAG is a complicated system to build because there are so many knobs to turn. Few developers have the skills to systematically turn those knobs. Can you unpack what RAG 2.0 means for teams building AI applications?
- 4:22: The language model is only a small part of a much bigger system. If the system doesn’t work, you can have an amazing language model and it’s not going to get the right answer. If you start from that observation, you can think of RAG as a system where all the model components can be optimized together.
- 5:40: What you’re describing is similar to what other parts of AI are trying to do: an end-to-end system. How early in the pipeline does your vision start?
- 6:07: We have two core concepts. One is a data store—that’s really extraction, where we do layout segmentation. We collate all of that information and chunk it, store it in the data store, and then the agents sit on top of the data store. The agents do a mixture of retrievers, followed by a reranker and a grounded language model.
- 7:02: What about embeddings? Are they automatically chosen? If you go to Hugging Face, there are, like, 10,000 embeddings.
- 7:15: We save you a lot of that effort. Opinionated orchestration is a way to think about it.
- 7:31: Two years ago, when RAG started becoming mainstream, a lot of developers focused on chunking. We had rules of thumb and shared stories. This eliminates a lot of that trial and error.
- 8:06: We basically have two APIs: one for ingestion and one for querying. Querying is contextualized on your data, which we’ve ingested.
- 8:25: One thing that’s underestimated is document parsing. A lot of people overfocus on embedding and chunking. Try to find a PDF extraction library for Python. There are so many of them, and you can’t tell which ones are good. They’re all terrible.
- 8:54: We have our stand-alone component APIs. Our document parser is available separately. Some areas, like finance, have extremely complex layouts. Nothing off the shelf works, so we had to roll our own solution. Since we know this will be used for RAG, we process the document to make it maximally useful. We don’t just extract raw information. We also extract the document hierarchy. That is extremely relevant as metadata when you’re doing retrieval.
- 10:11: There are open source libraries—what drove you to build your own, which I assume also encompasses OCR?
- 10:45: It encompasses OCR; it has VLMs, complex layout segmentation, different extraction models—it’s a very complex system. Open source systems are good for getting started, but you need to build for production, not for the demo. You need to make it work on a million PDFs. We see a lot of projects die on the way to productization.
- 12:15: It’s not just a question of information extraction; there’s structure inside these documents that you can leverage. A lot of people early on were focused on chunking. My intuition was that extraction was the key.
- 12:48: If your information extraction is bad, you can chunk all you want and it won’t do anything. Then you can embed all you want, but that won’t do anything.
- 13:27: What are you using for scale? Ray?
- 13:32: For scale, we’re just using our own systems. Everything is Kubernetes under the hood.
- 13:52: In the early part of the pipeline, what structures are you looking for? You mention hierarchy. People are also excited about knowledge graphs. Can you extract graphical information?
- 14:12: GraphRAG is an interesting concept. In our experience, it doesn’t make a huge difference if you do GraphRAG the way the original paper proposes, which is essentially data augmentation. With Neo4j, you can generate queries in a query language, which is essentially text-to-SQL.
- 15:08: It presupposes you have a decent knowledge graph.
- 15:17: And that you have a decent text-to-query language model. That’s structure retrieval. You have to first turn your unstructured data into structured data.
- 15:43: I wanted to talk about retrieval itself. Is retrieval still a big deal?
- 16:07: It’s the hard problem. The way we solve it is still using a hybrid: mixture of retrievers. There are different retrieval modalities you can choose. At the first stage, you want to cast a wide net. Then you put that into the reranker, and those rerankers do all the smart stuff. You want to do fast first-stage retrieval, and rerank after that. It makes a big difference to give your reranker instructions. You might want to tell it to prefer recency. If the CEO wrote it, I want to prioritize that. Or I want it to observe data hierarchies. You need some rules to capture how you want to rank data.
- 17:56: Your retrieval step is complex. How does it impact latency? And how does it impact explainability and transparency?
- 18:17: You have observability on all of these stages. In terms of latency, it’s not that bad because you narrow the funnel gradually. Latency is one of many parameters.
- 18:52: One of the things a lot of people don’t understand is that RAG does not completely shield you from hallucination. You can give the language model all the relevant information, but the language model might still be opinionated. What’s your solution to hallucination?
- 19:37: A general purpose language model needs to satisfy many different constraints. It needs to be able to hallucinate—it needs to be able to talk about things that aren’t in the ground-truth context. With RAG you don’t want that. We’ve taken open source base models and trained them to be grounded in the context only. The language models are very good at saying, “I don’t know.” That’s really important. Our model cannot talk about anything it doesn’t have context on. We call it our grounded language model (GLM).
- 20:37: Two things have happened in recent months: reasoning and multimodality.
- 20:54: Both are super important for RAG in general. I’m very happy that multimodality is finally getting the attention that it observes. A lot of data is multimodal. Videos and complex layouts. Qualcomm is one of our customers; their data is very complex: circuit diagrams, code, tables. You need to extract the information the right way and make sure the whole pipeline works.
- 22:00: Reasoning: I think people are still underestimating how much of a paradigm shift inference-time compute is. We’re doing a lot of work on domain-agnostic planners and making sure you have agentic capabilities where you can understand what you want to retrieve. RAG becomes one of the tools for the domain-agnostic planner. Retrieval is the way you make systems work on top of your data.
- 22:42: Inference-time compute will be slower and more expensive. Is your system engineered so you only use that when you need to?
- 22:56: We are a platform where people can build their own agents, so you can build what you want. We have “think mode,” where you use the reasoning model, or the standard RAG mode, where it just does RAG with lower latency.
- 23:18: With reasoning models, people seem to become much more relaxed about latency constraints.
- 23:40: You describe a system that’s optimized end to end. That implies that I don’t need to do fine-tuning. You don’t have to, but you can if you want.
- 24:02: What would fine-tuning buy me at this point? If I do fine-tuning, the ROI would be small.
- 24:20: It depends on how much a few extra percent of performance is worth to you. For some of our customers, that can be a huge difference. Fine-tuning versus RAG is another false dichotomy. The answer has always been both. The same is true of MCP and long context.
- 25:17: My suspicion is with your system I’m going to do less fine-tuning.
- 25:20: Out of the box, our system will be pretty good. But we do help our customers squeeze out max performance.
- 25:37: Those still fit into the same kind of supervised fine-tuning: Here’s some labeled examples.
- 25:52: We don’t need that many. It’s not labels so much as examples of the behavior you want. We use synthetic data pipelines to get a good enough training set. We’re seeing pretty good gains with that. It’s really about capturing the domain better.
- 26:28: “I don’t need RAG because I have agents.” Aren’t deep research tools just doing what a RAG system is supposed to do?
- 26:51: They’re using RAG under the hood. MCP is just a protocol; you would be doing RAG with MCP.
- 27:25: These deep research tools—the agent is supposed to go out and find relevant sources. In other words, it’s doing what a RAG system is supposed to do, but it’s not called RAG.
- 27:55: I would still call that RAG. The agent is the generator. You’re augmenting the G with the R. If you want to get these systems to work on top of your data, you need retrieval. That’s what RAG is really about.
- 28:33: The main difference is the end product. A lot of people use these to generate a report or slide data they can edit.
- 28:53: Isn’t the difference just inference-time compute, the ability to do active retrieval as opposed to passive retrieval? You always retrieve. You can make that more active; you can decide from the model when and what you want to retrieve. But you’re still retrieving.
- 29:45: There’s a class of agents that don’t retrieve. But they don’t work yet, but that’s the vision of an agent moving forward.
- 30:11: It’s starting to work. The tool used in that example is retrieval; the other tool is calling an API. What these reasoners are doing is just calling APIs as tools.
- 30:40: At the end of the day, Google’s original vision is what matters: organize all the world’s information.
- 30:48: A key difference between the old approach and the new approach is that we have the G: generative answers. We don’t have to reason over the retrievals ourselves any more.
- 31:19: What parts of your platform are open source?
- 31:27: We’ve open-sourced some of our earlier work, and we’ve published a lot of our research.
- 31:52: One of the topics I’m watching: I think supervised fine-tuning is a solved problem. But reinforcement fine-tuning is still a UX problem. What’s the right way to interact with a domain expert?
- 32:25: Collecting that feedback is very important. We do that as a part of our system. You can train these dynamic query paths using the reinforcement signal.
- 32:52: In the next 6 to 12 months, what would you like to see from the foundation model builders?
- 33:08: It would be nice if longer context actually worked. You will still need RAG. The other thing is VLMs. VLMs are good, but they’re still not great, especially when it comes to fine-grained chart understanding.
- 33:43: With your platform, can you bring your own model, or do you supply the model?
- 33:51: We have our own models for the retrieval and contextualization stack. You can bring your own language model, but our GLM often works better than what you can bring yourself.
- 34:09: Are you seeing adoption of the Chinese models?
- 34:13: Yes and no. DeepSeek was a very important existence proof. We don’t deploy them for production customers.
Podcast
Radar