In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in infrastructure software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider story. For each such project, we go back in time and do in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to educate the next generation of engineers who were not there when those transformational projects were created. In our first “season,” we start with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. And what better than kicking off the Data Engineering season with an episode on Hadoop, a project that is synonymous with Big Data. We were incredibly fortunate to host the creators of Hadoop, Doug Cutting and Mike Cafarella, to share with us the untold history of Hadoop, how multiple technical breakthroughs and a little luck came together for them to create the project, and how Hadoop created a vibrant open source ecosystem that led to the next generation of technologies such as Spark. Timestamps * Introduction [00:00:00] * Origin story of Hadoop [00:03:26] * How Google’s work influenced Hadoop [00:05:47] * Yahoo’s contribution to Hadoop [00:13:51] * Major milestones for Hadoop [00:20:06] * Core components of Hadoop - the why’s and how’s [00:22:44] * Rise of Spark and how the Hadoop ecosystem reacted to it [00:27:19] * Hadoop vendors and the tension between Cloudera and Hortonworks  [00:31:51] * Proudest moments for the Hadoop creators [00:33:56] * Lightning round [00:36:04] Transcript Sudip: Welcome to the inaugural episode of the Engineers of Scale podcast. In our first season, we'll cover the projects that have transformed and shaped the data engineering industry. And what's better than starting with Hadoop, the project that is synonymous with Big Data. Today, I have the great pleasure of hosting Doug Cutting and Mike Cafarella, the creators of Hadoop. And just for the record, Hadoop is an open source software framework for storing enormous data and distributed processing of very large data. Think hundreds and thousands of petabytes of data on again, hundreds and thousands of commodity hardware nodes. If you have anything to do with data ever, you certainly know of Hadoop and have either used it or definitely have benefited from it one way or another. In fact, I remember back in 2008, I was working on my second startup, and we were actually processing massive amounts of data from retailers coming from their point of sale systems and inventory. And as we looked around, Hadoop was the only choice we really had. So today I'm incredibly excited to have the two creators of Hadoop, Mike Caffarella and Doug Cutting with us today. Mike and Doug, welcome to the podcast. It is great having you both. [00:01:02] Doug: It's great to see you. Thank you. Thanks for having us. [00:01:10] Sudip: If you guys don't mind, I think for our listeners, it'll be great to know what you guys are up to these days. Mike, maybe I'll start with you and then Doug. [00:01:19] Mike: Sure. I'm a research scientist in the Data Systems group at MIT. [00:01:27] Doug: I'm a retired guy. I stopped working 18 months ago. My wife ran for public office and it was a good time for me to transition into being a home keeper, do shopping and cooking. But I also have a healthy hobby of mountain biking and doing trail advocacy and development, trying to build more trail systems around the area that I live in. [00:01:44] Sudip: Sounds like you're having real fun, Doug. One day we all aspire to get there, for sure. I'm really curious to know how you guys had met. I've seen some interviews of you guys. You kind of talked about how, I think, Doug, you were working on Lucene at that time and then connected with Mike somehow through a common friend. I'd love to know a little more detail on how you guys met and how you guys started working together. [00:02:06] Doug: It kind of goes back to Hadoop really. Hadoop was preceded by this project, Nutch. Nutch was initiated when a company called Overture, which we'll probably hear more about, called me up out of the blue as a guy who had experience in both web search engines and open source software and said, hey, how would you like to write an open source web search engine? And I said, that'd be cool. And they say they had money to pay me at least part time and maybe a couple other people. And did I know anyone? I didn't know anybody offhand, but I had friends. I called up my freshman roommate, a guy named Sammy Shio, who is a founder of Marimba. And I said, Sammy, do you know anybody? And he said, you should talk to Mike Cafarella. I think it was the only name that I got. And I called Mike and he said, yeah, sure, let's do this. [00:02:49] Mike: So at the time, this would be in like late summer, early fall of 02. I had worked in startups and in industry for a few years, but I was looking to go back to school. So I was putting together applications for grad school. And I was working with an old professor of mine to kind of scoop up my application a little bit because I had been out of research and so on for a while. And that was a fun project, but it wasn't consuming all my time. And so Sammy, who was one of the founders of Marimba, which was my first job out of college, he got in touch and said that his buddy, Doug, had an interesting project and I should make sure I go talk to him, which was great. I was looking for something to do and it came at just the right moment. [00:03:26] Sudip: That was quite a connection, Mike. And then going back to that timeframe, 2002-2003, I think, Doug, you started touching on how you started working on Nutch and eventually became Hadoop. Would you mind just maybe walking us through a little bit like the origin story of Hadoop? I mean, I know Overture funded you for writing the web crawler, but what was their interest in an open source web crawler in the first place? [00:03:49] Doug: I think that's a good question to get back to some of the business context. We want to mostly focus on tech here, but the business context matters, as is often the case. So I had worked on web search from 96 to 98 at a company called Excite. I'd been pretty much the sole engineer doing the backend search and indexing system. And then I transitioned away from that, written this search engine on the side called Lucene, which I ended up open sourcing in 2000. Also in 98, Google launched, and initially they were ad-free. All the other search engines, there were a handful of them, were totally encrusted and covered with display ads. So just think like magazine ads, just random ads that they managed to sell the space to advertisers. Google started with no ads, and they also really focused and spent a lot of effort trying to work on search quality. All they were doing was search. Everybody else was trying all kinds of things to get more ads in front of people, and Google just focused on making search better. And by 2000, they'd succeeded, and the combination of this really clean, simple interface and better quality search results, they had taken most of the search market share already. But they needed a revenue plan. This company called Overture had, in the meantime, invented a way to make a lot of money from web search by auctioning off keywords to advertisers and matching them to the query. Google copied that and started themselves minting money. Overture was nervous because they had this market, and they were licensing it to Yahoo and Microsoft and others, but they were worried that all of their customers were going to get beaten by Google and go out of business. So on one hand, they sued Google. That's an interesting side story. But on the other hand, they decided, we should build our own search engine to compete with Google. We somehow need to do this. They bought AltaVista. They tried to build something internally, and they also thought, you know, open source is this big trend. Let's do an open source one to have something to compete. So they called me, and I called Mike, and we worked with a small team of guys there at Overture, led by a guy named Dan Fain, and we started working on trying to build web search as open source. [00:05:47] Sudip: That is such a phenomenal historical context. Including myself. I don't think many, very many people had that. And then interestingly, Google also came out with their GFS paper in 2003, their MapReduce paper in 2004, which obviously influenced a lot of the work that I think you guys did down the line. I'm curious, what do you think might have caused Google to publish those papers in the first place? Any hypothesis on that? [00:06:14] Mike: I think you're putting your finger on something interesting and important, which was, at the time, that wasn't common practice to have a research paper that told you a lot of technical details about an important piece of infrastructure. I don't think it was part of some genius, long-term plan to profit down the road. It was part of a general culture at the place to emphasize the virtues of publishing and openness and science. Maybe it helped them with hiring or something like that, but if so, that was kind of an indirect benefit. And it was really trend-setting. I mean, they ended up publishing a ton of papers. I think Microsoft and Yahoo and other companies followed suit. There's a whole string of really interesting papers throughout the 2000s and early 2010s, systems that we might never have learned about had they remained totally closed. But it's interesting to think about the impact of the GFS paper, I think, on our experience, Doug, which was we had worked on Nutch for, I guess, about a year. And after about a year's time, I recall that it was indexing on the order of tens of millions of pages, but you couldn't get more than a month's worth of fresh