In this episode, Gabriel (Founder) and Caine (CTO, first employee) discuss the history of our search engine, why now is the right time to build a full web search index, and how our scale makes us uniquely positioned to ship, learn and iterate quickly. Disclaimers: (1) The audio, video (above), and transcript (below) are unedited and may contain minor inaccuracies or transcription errors. (2) This website is operated by Substack. This is their privacy policy. Gabriel: Hello, welcome back to Duck Tales. I haven’t been here in a while. And I am Gabriel Weinberg, the founder of DuckDuckGo. And I have with me someone who I don’t think has been on Duck Tales at all yet, but you should know, Caine Tighe, who I know very well, who’s the first employee of DuckDuckGo and now our CTO. Caine. Caine: Hi Gabe. Gabriel: We’ve been working together for a very long time. And we’re here today to talk about something we’ve both been working on. Caine more than me, but I’m working on it some, which is our web search index. So as some background, first some background. DuckDuckGo started as a search engine, as many people know, and it was actually started by me. I was by myself for a few years. And the first thing I did was start crawling the web and building a web index. Caine: Yeah, for sure. Gabriel: But you know, I soon realized that that is very expensive, especially as one person. And there were other places to get a web index at the time. And what was more interesting was maybe adding value on top of the web index. So building other indexes, this was a time, this is the mid 2000s, you know, there weren’t, there obviously wasn’t AI, but there wasn’t even really many instant answers on search engines. Caine: I mean, that’s what we were working on together at the very, very beginning. Like we were working on, you know, you had the knowledge graph. It wasn’t called a knowledge graph at a time, but you were doing all the structured content from Wikipedia and otherwise. We worked on some other smaller indices. So yeah. And then actually fun fact, in hiring our backend project is still based on some of the original spam and content farm crawling, like one of the projects is based on some of the spam and content farm crawlers that you originally wrote. So that lives on 15 years. Gabriel: Yeah. So we were doing lots of indexing and lots of crawling. Yeah, exactly. Just not, you know, we started, but then we stopped doing a full web index, but just as examples, right? We started like the code that you were talking about indexing Wikipedia, which became our knowledge graph, you know, which is, powers a lot of answers, which also we used when we started working on AI answers. We’ve been doing local indexing for, you know, over a decade, local businesses and things like that. You know, then all sorts of kind of niche indexes that involve some crawling like lyrics and things like that. So indexing technology is not new to us, despite what some people say about it. Sometimes we do lots of search indexing, but we hadn’t been doing a full web index until relatively recently, last few years-ish. But now we are. And so the question is, the questions and why you’re here, and we’ll talk about it for a few minutes, is kind of why, what’s going on, how, all the main questions, which we’re obviously not all gonna answer today, but we can start with, I think, kind of the why, but why are we well positioned to work on this? So to speak, and you’re kind of at the center of it, so I think you’re a good person to ask. Caine: Yeah. I mean, I think, the why now is a mixture of like our needs. Like we want to support our own AI use cases. That’s we have two primary agent, agent driven products out. Search Assist, which is on the SERP, search engine results page, duckduckgo.com. And then we have Duck AI, which is our chatbot. And both of those products are hungry for this kind of data. So it, yeah, it just makes sense for that. Gabriel: Yeah, in particular, right. You could maybe talk percentages, but like there’s some percentage of search results now, what is it like 25%, I think that have Search Assist answers. And then, you know, the percentage better made for Duck AI, but some significant percent call the web, 15% maybe. Caine: Yeah, I always do. I do, I have my numbers based on, absolutes make more sense. Yeah, yeah, just, yeah. Gabriel: You know what? Bad question. Ignore numbers. Doesn’t matter. Good percentage of queries and Duck AI prompts require web search. And so we need a web index for it, essentially, right? Caine: Yeah. I mean, I think on the chatbot side, it’s really good, like to ground. If you’re deciding whether or not to ground and you’re on the line, you should probably use RAG, retrieval augmented grounding, and go out to a third party data set. For us, raising the standard of trust online. We want to do that because the more that you ground, it’s known empirically, the answers are better. So we err on the side of grounding where I think maybe not everybody does. So it’s really good that we need to build our own index in order to be able to accommodate that. So that’s kind of some of the why now. And again, it’s on Search Assist and it’s on Duck AI. One of my favorite parts about this whole thing is like, we’re very used to working for customers, like our end user. For the search index, duckduckgo.com itself is the customer. A very nuanced, unique thing for us to be able to serve ourselves, which creates this really tight feedback loop internally. So it’s been cool to like use our own and we are live for, you know, some amount of the traffic today. That’s just growing day over day for these use cases. Gabriel: That’s a good point because like, I think in terms of like our position, well, positioned to do it is, you know, being live, you know, maybe we talk about that a little bit, but like that creates a feedback loop that we have that a lot of people don’t have because we have, you know, many, many millions of people using our search engine and now Duck AI. And so we’re getting constant feedback about the relevancy of the search results that we’re serving, not to mention the fact that we have almost 20 years of evaluating relevancy ourself on our own search engine. Caine: Exactly. Yeah, exactly. So like, humans are unsurprisingly and appropriately more, more critical of results than agents are. So it kind of creates this higher fidelity feedback loop because, you know, through our, through anonymization and whatever else, like we can privately understand what is most relevant on the internet for customers and users. And that really helps us to, positions us to be pretty competitive in the space quickly. So like, I think that’s kind of interesting and it’s exciting and like the true DuckDuckGo way as you and I know well, like we like to ship stuff. And so it’s been really cool to, yeah, like it’s just been really cool to be using it already and in production, our own index. And it creates that flywheel and we could, you know, use buzzwords like reinforcement learning and this, that, and the other thing. But at the end of the day, it’s just really the relationship of consuming your own internal API product. That’s the flywheel and allows you to like establish relative priority really quickly and be like, I ran this experiment. Like we really think this query set is going to be well suited to our own index. And then we tried and we’re like, we’re not working that well on that. Let’s move to this other one. And then it just changes the game for how quickly you can iterate, which has been really exciting for, and I know the team’s really excited about it too, because engineers like to ship things. So that’s been cool. Gabriel: Perfect, I think that’s a good intro. But let’s do, to your point about buzzwords, let’s do a few more buzzwords in terms of like, just give us the broad tech flow, like, you know, without getting too deep into anything, but just to give people a sense of kind of how it works and then maybe. Caine: Yeah, so kind of the way that I think about this is like a little pipe or a train or whatever. You have your frontier that kind of is the web that you’re looking to crawl, like, because you have to pick what your frontier is. Then you crawl that. Crawling, all of these components are extremely complicated by themselves. To crawl, it means like, you need to crawl politely. Some sites want you to crawl, some sites don’t want you to crawl. And so to be a good trustworthy netizen, you have to respect those things. And that’s an important part of crawling. It’s also important to have the bandwidth and the throughput to crawl at the scale that you need to crawl. And so fortunately for us, we’ve had a lot of experience with that, so we have that. The rendering side, you have to, when you fetch content, you have to render, including JavaScript and everything else like that. The only way to get the content is to literally run the whole webpage. Otherwise, like you get no content. So that’s quite an expensive process. So we kind of do a naive approach and then a more complicated rendered approach. Then you have content extraction, which is like the next step, where you think about your title, your description, your headings, metadata, main body stuff, where you extract the content, what the page actually means. And then we’re very fortunate in today’s day and age to have semantics. So semantic search is a big part of the pipeline. And what that means is what people are calling embeddings. And you calculate embeddings on extracted content. And then we use a database, which I quite like, called Vespa. And it’s all ingested into Vespa. In my opinion, kind of your indexing, your ingestion, your features, how you calculate those things, and how they describe, that’s a big description of your product. Because i