Engineers of Scale

Sudip Chakrabarti, Partner at Decibel.vc

Hello everyone, welcome to the Engineers of Scale podcast. In this podcast, we go back in time and give you an insider’s view on the projects that have completely transformed the infrastructure industry. Most importantly, we celebrate the heroes who created and led those projects. sudipchakrabarti.substack.com

Folgen

  1. 12.12.2024

    Data Engineering: The Past, Present and Future with Joseph Hellerstein

    In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting and Mike Cafarella for a fascinating look back on Hadoop, Reynold Xin, co-creator of Apache Spark and co-founder of Databricks for a technical deep-dive into Spark, Ryan Blue, creator of Apache Iceberg on the technical breakthroughs that made Iceberg possible, and Stephan Ewen, creator of Apache Flink. In this episode, we host Joseph Hellerstein, Professor at UC Berkeley and founder of Trifacta and RunLLM. Joe helps us step back and explore the evolution of Data Engineering over the past several decades while also discussing the future innovations on the horizon. Show Notes Timestamps * [00:00:01] Introduction and Joe’s background * [00:01:38] What got Joe interested in data engineering * [00:03:59] Defining data engineering and its key components * [00:05:16] Significant trends and changes fueling data engineering over the last 20 years * [00:06:30] Key components of data engineering and the role of each * [00:08:07] Contrasting modern data stack with traditional data stack * [00:12:10] Developers vs. data engineers/analysts in building data pipelines * [00:14:12] Role of AI and LLMs in data preparation and cleaning * [00:16:51] Journey from data warehouses to data lakes to data lakehouses * [00:21:14] Role of data catalogs in data engineering going forward * [00:32:57] Unified data platforms vs. best-of-breed tools for data engineering * [00:37:03] Possibility of one system serving both OLTP and OLAP use cases * [00:40:46] Impact of AI on the data stack and data engineering * [00:43:23] Interesting future research directions in data engineering * [00:46:02] Lightning round: Acceleration, unsolved questions, key message Transcript Sudip [00:00:01]: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today I have the great pleasure of hosting Joe Hellerstein, professor at UC Berkeley and founder of Trifecta and Aqueduct. Joe, welcome to the Engineers of Scale podcast. Joe [00:00:36]: Thanks, it's fun to be here. Sudip [00:00:37]: Thank you so much. So I'm not going to walk our listeners through an overview of your background because you truly need no introduction. When it comes to innovation in the field of data engineering, there are really very few people who even come close to what you have done. I will, however, mention a fun fact that I learned recently, even though I think I've known you for many years, and that is your interest in music. Not only are you a musician on the side, you actually had even minored in music during your PhD at Wisconsin. So how do you balance all of your research and startup work with your interest in music? Joe [00:01:12]: Well, I'm a believer that you should have a rich life and that people who spend 24-7 on computing are spending a little too much time maybe. So I enjoy computing and data engineering and all that good stuff, and I love to geek out about those things, but it's one of a bunch of things I value in life, family, hobbies, and so on. And I'm sure a lot of your listeners are the same. And anyone who tells you that you have to do something 24-7 to Excel, I think is telling you a lie. Sudip [00:01:38]: So then what got you interested in data engineering as a quality of research in the first place? Joe [00:01:44]: Yeah, well, my background, going back to my training after college, was in database systems. And my first job right out of college was at IBM Research, which was the founding lab out in San Jose that built the first relational databases. And a bunch of those people were still there. So I was really brought up by some of the founders of the field of database systems. After that, I went to Berkeley and then Wisconsin for my schooling, which was more of the community of the folks who really pioneered the database system space. So I'm an old hand, even though I'm not as old as most of those people by about a generation, I still feel like I'm an old database hand. I come from that lineage. And what's happened over my career since the, you know, I got my PhD in the mid-90s, is that the process of managing data and the computation that goes around it has become more and more central to all of computing and the way it projects on the real world. So your listeners know better than any, probably, that really, we shouldn't talk about computer science. We should talk about data science, data engineering, because without data, computing is kind of meaningless. And this is a truth that emerged, you know, in the last 20 years, really. But it was one that the database people were working on well before that. And I feel kind of blessed to have been born into that community because the relevance of data engineering to all things computation and therefore much of society is so apparent today. Sudip [00:03:01]: I would say that both the schools that you have involved with, Wisconsin, and of course, UC Berkeley now, I think, have had tremendous quality of work coming out in database and data systems and data engineering together. So it just, you know, has had such a big history. It's an awesome tradition. Joe [00:03:19]: I mean, when I was coming up, there were three places to do real work in data systems. It was IBM, Berkeley, and Wisconsin. And I had the fortune, and to some degree, I took measures to interact with all those people when I was very young, straight out of college. And they were really the center of all activity because a lot of academic computer science at that time didn't get it. MIT, when I interviewed in the mid-90s, hadn't had a person on the faculty doing databases for over a decade. And it was very clear when I got there that they did not think it was an intellectual activity. They thought it was something that businesses do. And that's all changed radically in the course of my academic career. We're now data-driven computing is all of computing, really. Sudip [00:03:59]: Taking a step back, if you were to describe to someone what data engineering is, and I know you teach a very popular course on that at Berkeley, too, how do you describe it? Joe [00:04:12]: Yeah, it's been tricky, actually, because I think we're in a time of transition. And so you have to talk about things relative to where things are right now. So the way I talk about it with people is they understood that there was a shift from traditional computer science to what was being called data science over the last, say, decade, where clearly data had to be at the center of things, or at least some things. But what happened in the data science programs that evolved is they were largely developed as sort of statistics meets algorithms. And that left out all the engineering concerns of how do you build systems around these foundations, particularly systems that drive large amounts of data? Because the statisticians traditionally didn't use large amounts of data. Incredible what's achievable with statistics with very small amounts of data, of course. And so that's what I talk about. It's like, well, how do we systematize and build engineered artifacts that reflect the importance of data and computation together? Sudip [00:05:02]: And looking back in the last 20 years or so, since you started working at Berkeley and obviously started your two companies, are there certain significant trends or changes that have really fueled data engineering? Joe [00:05:16]: Yeah, I mean, you know, there's a long enough scope that you have to include the existence of the World Wide Web as one of them. So, you know, you go back to the 90s and data was all about relational databases because that was the data that was entered into computers. And the web changed all that. Now there's all sorts of nonsense that you could harvest. And I remember joking in the early aughts, maybe late 1990s, my buddy was saying, my gas station just got on the internet. Goodness knows why I would ever want to run a query against my gas station, right? But nowadays we realize that all that sort of light recording of ambient stuff and people's thoughts and ideas and conversations is highly valuable. That was not at all clear when the web started out. You know, web search is like, well, I might want to find some stuff. Most of it's irrelevant to me, but I want to find a few things, right? That was web search. But what you see, you know, if LLMs are a compression of the web, what we're seeing today is having a compressed version of everything anybody's ever said is outrageously powerful, even if the technology is pretty simple. So the rise of kind of ambient human information, something I did not anticipate whatsoever. Sudip [00:06:21]: Got it. Today, as we know data engineering, what would you say are the key components of data engineering and kind of what's the role of each? Joe [00:06:30]: You know, we often talk about pipelines, right? And I think it's not a bad way to think is to kind of start down the pipeline, look at what feeds what. Where does the data get generated? How does it get ingested? What data

    49 Min.
  2. 25.10.2024

    Flink: The Unified Stream and Batch Processing Engine - with Stephan Ewen

    In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting and Mike Cafarella for a fascinating look back on Hadoop, Reynold Xin, co-creator of Apache Spark and co-founder of Databricks for a technical deep-dive into Spark, and Ryan Blue, creator of Apache Iceberg on the technical breakthroughs that made Iceberg possible. In this episode, we host Stephan Ewen, creator of Apache Flink, the most popular open-source framework for unified batch and streaming data processing. Stephan shares with us the history of Flink, the technical breakthroughs that made it possible to build a unified engine to process both streaming and batch data, and his take on the emerging streaming market and use cases. Show Notes Timestamps * [00:00] Introduction and background on Flink * [01:44] Relationship between the Stratosphere research project and Flink * [10:28] How Flink evolved to focus on stream processing over time * [13:08] Technical innovations in Flink that allowed it to support both high throughput and low latency * [18:56] How Flink handles missing data and mission failures * [21:47] Interesting use cases of Flink in production * [26:02] Factors that led to widespread adoption of Flink in the community * [29:18] Comparison of Flink with other stream processing engines like Storm and Samza * [37:07] Current state of stream processing and where the technology is headed * [39:47] Lightning round questions Transcript Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. Sudip: Today, I have the great pleasure of having Stephan Ewen, the creator of Flink on our podcast. Hey Stephan, welcome and thanks for your time. Stephan: Hello. Hey, thanks for having me. Nice to be here. Sudip: All right! So you created Flink, which is one of the defining stream processing engine in Big Data. As I understand, if I go back a few years, it all actually had started when you were still doing a PhD in computer science from the famous Technical University in Berlin and working on a project called Stratosphere with Professor Markle. Can you talk a little bit about what was the relationship between Stratosphere and Flink, if any? Stephan: When we started in 2009 with the work on Stratosphere, Hadoop was all the rage and the industry was starting to pick it up. But also in academia, everybody was looking at it. Hadoop had some cool properties, but it threw away a lot of wisdom from databases. And there were a bunch of projects that were trying to reconcile both worlds - how can we get some of the Hadoop goodness and database goodness in the same project. And Stratosphere was one of those. It was doing basically general-purpose batch Big Data on a mixed Hadoop database engine. When we were done with that project, we kind of liked the result too much to say, “okay, that's it, it goes into an academic drawer.” We actually made it open source and started working on it. So the open source version of Stratosphere became Apache Flink. We had to pick a new name when we made it open source and donated it to Apache. Sudip: What was the story behind the name? Stephan: I kind of picked up on that reading some of the old interviews. Sudip: Like, was there a conflict with another trademark? Stephan: Yeah. I think in the end, all naming decisions are actually driven either by trademark conflicts or search engine optimization. It was actually no different for Stratosphere. I think it was a horrible name to get good search ranking for. And also, it was a registered trademark in so many industries already that we felt, okay, there's no chance this will work for us. And I think it was fashionable at that point in time to pick project names based on words in other languages, because they were usually not trademarked in the US. And we did the same thing. Flink just means quick and nimble in German. And because, you know, it's kind of where it started from, we picked that as a name. It was a good choice in hindsight, because it did lend itself to a cute squirrel as the mascot - and we really liked that. If you want to go back to the question, like Stratosphere to Flink, Stratosphere was the starting point, it was an experiment with mixed database engines. The first year of Apache Flink was also pretty much that. We're doing batch processing, graph processing, ML algorithms on the same engine. And then we explored stream processing as a possible use case, because we knew there were some interesting properties in the engine. But stream processing wasn't a very big thing back then. When we started exploring that and published the first APIs, we saw that folks really, really loved it. And that's when we shifted the priority of the project to stream processing. So 2015 is when Flink went all in on stream processing. Sudip: So what Flink is today actually has very little to do with what Stratosphere was, as it happens with so many projects. Stratosphere was like the research phase, the first prototype, and then you found your niche with Flink. And over time, kind of almost rewrote the project. I want to go back a little to the motivations that started the research project. Like you said, Hadoop threw away some of the wisdom from the database world. Were there particular problems that you guys had in mind that you wanted to solve with this new system that Hadoop could not? Stephan: Databases have these two components - they have a declarative language and a query optimizer component around that. And they kind of decouple how things get executed from how you specify the actual analytics programs. That is for a good reason, because the number of programmers that can write very good low level code is actually very, very small. And the data analysts that can do that is even smaller. So this decoupling is there for a reason. Hadoop just threw that away and made everybody write MapReduce jobs. And of course, that's why one of the first things that came was building SQL engines on top of Hadoop again. But then those threw away some of the generality of the MapReduce model. The nice thing about MapReduce is that you could implement a lot of things that you couldn't as easily express in SQL. So the question was: was there a nice place in between? Can we use DSLs or embeddings or compiler extensions in programming languages to capture some of the program semantics and optimize them in a database query optimization style and then map them to a suitable execution engine and get some of the database query optimization code generation magic - not for just SQL, but for a more general programming paradigm. That was one of the starting points. And then, of course, the Hadoop execution engine is sort of simplistic, right? Like just map and reduce steps. Sure, it's a powerful primitive, but also it's not for everything the most efficient primitive. If you look at the systems that dominate the space today, they all have way, way more general and powerful engines that support a lot of the operations which came from distributed databases. So we worked on what distributed execution primitives you need, and what's the minimum orthogonal set to build a very flexible engine that can then run these programs that are going through a translation and query optimization layer. Sudip: If you were to describe Flink to an engineer who doesn't really know what it is, what would be a description that you would use? Stephan: Yeah, so we called it on the website, and I think the website still does that today, stateful computation over data streams. The idea of Flink is that data streams are very, very fundamental primitives. It's not just real time streams but it's actually the shape in which most data gets produced. Unless you're running a particle accelerator experiment that literally dumps a petabyte on you in a few milliseconds, most of the data in real life get created as a sequence of interactions from users with services and sensors. So almost all data actually gets created as a stream. And it's a very natural way to process it as streams. You do preserve sort of causal relationships. You take into account that there's a time dimension through which data moves and so on. So Flink is really a system that's built to work with this idea of a data stream. And it can be a real-time data stream. That's where it is probably most famous for being useful and where its strongest capabilities lie - to actually be very strong in analyzing real time data streams. But it doesn't stop there. It can also process archived data streams. If you think of a file as an archived data stream, like a snapshot of a database table as a stream of records, Flink works equally well with those. It even works when you mix those, like you have a stream that keeps continuously moving versus a stream that is more of a slow moving, or even a static snapshot of data. And then how do you combine those? So, Flink handles all sorts of stream processing: real-time stream processing, and if you wish, offline time-lagged stream processing, under a unified programming

    50 Min.
  3. 01.08.2024

    Iceberg: The Open Table Format for Petabyte Scale Analytics - with Ryan Blue

    In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting and Mike Cafarella for a fascinating “look back” on Hadoop, and Reynold Xin, co-creator of Apache Spark and co-founder of Databricks for a technical deep-dive into Spark. In this episode, we host Ryan Blue, creator of Apache Iceberg, the most popular open table format that is driving much of the adoption of Data Lake Houses today. Ryan shares with us what led him to create Iceberg, the technical breakthroughs that have made it possible to handle petabytes of data safely and securely, and the critical role he sees Iceberg playing as more and more enterprises adopt the modern data stack. Show Notes Timestamps * [00:00:01] Introduction and background on Apache Iceberg * [00:02:50] The origin story of Apache Iceberg * [00:12:00] Where Iceberg sits in the modern data stack * [00:10:38] Transactional consistency in Iceberg * [00:14:38] Top features that drive Iceberg’s adoption * [00:20:00] The technical underpinnings of Iceberg * [00:21:33] How Iceberg makes "time travel" for data possible * [00:24:08] Storage system independence in Iceberg * [00:30:13] Query performance improvements with Iceberg * [00:35:08] Alternatives to Iceberg and pros/cons * [00:40:45] Future roadmap and planned features for Apache Iceberg Transcript Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. Today, we have another episode in our data engineering series. We are going to dive deep into Apache Iceberg, an amazing project that is redefining the modern data stack. For those of you who already don't know Iceberg, it is an open table format designed for huge petabyte-scale tables. It provides a database-type functionality on top of object stores such as Amazon S3. And the function of a table format is to determine how you manage, organize, and track all of the files that make up a table. You can think of it as an abstraction layer between your physical data files written in Parquet or other formats and how they are structured to form a table. The goal of the Iceberg project is really to allow organizations to finally build true data lake houses in an open architecture, avoiding any vendor and technology lock-in that we all trade off. And just to give you a little bit of history, the Iceberg development started in 2017. In November 2018, the project was open-sourced and donated to the Apache Foundation. And in May 2020, the Iceberg project graduated to become a top-level Apache project. Today, I have the pleasure of welcoming Ryan Blue, the creator of Iceberg project, to our podcast. Welcome, Ryan. It is so great to have you on and thanks for making the time. Ryan: Thanks for having me on. I always enjoy doing these. It's pretty fun to talk about this stuff. Sudip: Awesome. Maybe I'll start with having you tell us the origin story of Iceberg. Why did you create it in the first place? I imagine it probably goes back all the way to your days at Cloudera. Anything you can help us with connecting the dots between where you are at Cloudera, then Netflix, and then now building Iceberg? Ryan: Yeah, you are absolutely correct. It stemmed from my Cloudera days, as well as without Netflix, it wouldn't have happened. So basically what was going on at Cloudera was you had the two main products. You had Hive and Impala in the query space. And they did a just terrible job interacting with one another. I mean, even three years into the life of Iceberg as a project, you were having to have this command to invalidate Hive state in Impala, to pull it over from the meta store. And even then, you just had really tough situations where you want to go update a certain part of a table, and you have to basically work around the limitations of the Hive table format. So if you're overwriting something or modifying data in place, which is a fairly common operation for data engineers, you would have to read the data, union it, and sort of deduplicate or do your merging logic with the new data, and then write it back out as an overwrite. And that's super dangerous, because you're reading all the data essentially into your Hadoop cluster at the time, and then you're deleting all that data from the source of truth. You're saying, okay, this partition is now empty. It doesn't exist anymore. And then you're adding back the data in its place. And if something goes wrong, can you recover? Who knows? And so it was just a mess, and people knew it was a mess all over the place. I talked to engineers at Databricks at the time and was like, hey, we should really collaborate on this and fix the Hive table format and come out with something better. But it was just not to be at Cloudera, because they were distributors. They had so many different products, and it was very hard to get the will behind something like this, which is where Netflix certainly helped. Sudip: What was the scale of data at Netflix when you got there? Just rough estimates. Ryan: I think from around that time, we were saying we had about 50 to 100 petabytes in our data warehouse. It's hard to tell how accurate that is at any given time because of data duplication. Exactly what I was talking about a second ago means you're rewriting a lot. And that write amplification, because your write amplification is at the partition level in Hive tables, you would just do so much extra writing and keep that data around for a few days, and it's very hard to actually know what was live or active. Sudip: So you landed at Netflix. They have this amazing scale of data. You came from Cloudera with some very specific frustrations, and I would imagine some very specific thoughts on how you're going to solve it at Netflix scale. Maybe walk us a little bit through how those first year or two was at Netflix. How did you go about creating what is now Iceberg? Ryan: So we didn't actually work on it for a year or two. First, we moved over to Spark, and that was fun. So Netflix had a pre-existing solution to sort of make tables a little bit better. What we had was called the batch pattern. And this was basically, we hacked our Hive meta store so that you could swap partitions. So you could say, instead of deleting all these data files and then overwrite new data files in the same place, we just used different directories with what we called a batch ID. And then we would swap from the old partition to the new partition. And that allowed us to make atomic changes to our tables at a partition level. And that was really, really useful and got us a long way. In fact, we still see this in practice in other organizations where Netflix data engineers have gone off and gotten involved in infrastructure at other companies. And it worked well, as long as the only changes you needed to make swapped partitions. So again, the granularity was just off. But this affected Iceberg primarily because we really felt the usability challenges. We had to retrain basically all of our data engineers to think in terms of the underlying structure of partitions in a table and what they could do by swapping. So rather than having them think in a more modern row-oriented fashion, saying, okay, take all my new rows, remove anything where the row already exists, and then write it into the table, they had to do all of that manually and then think about swapping because you can't just insert or append. And so the two things were the usability was terrible, and it really got us thinking about how do we not only solve our challenge here, make the infrastructure better, but how do we also make a huge difference in the lives of the customers and everyone on our platform to make it so that they don't have to think about these lower levels. Usability has always been important to me in my career, so I had already said we're just going to fix schema evolution, but really thinking about the operations that people needed to be able to run on these tables, I think, influenced the design. And it also showed me because I'd ported our internal table format to Spark twice by the time we started the Iceberg project, I knew I needed to change in Spark. Sudip: I can imagine. How would you describe the way you guys were storing data on Netflix? Is that closest to something what we all now call Data Lake? Is that roughly what it was? Ryan: The terms are very nebulous here, but I think of Data Lake as basically the post-Hadoop world where everyone basically moved to object stores. And I know that Hortonworks was calling their system a Data Lake at the time, but I think we've really coalesced around this picture of a Data Lake as a really Hadoop-style infrastructure with an object store as your data store. And then, of course, now we have Lake Houses, right, which again plays a pretty meaningful role in Iceberg's adoption too. Sudip: Any particular thought on what's driving this movement towards Lake Houses? Ryan: Well, I'm going to have to answer a question with a question, which is, what do you mean by Lake House? Because there's a broad spectrum her

    57 Min.
  4. 13.12.2023

    From Spark to Databricks: Spark's Origins, Innovations, and What's Next - with Reynold Xin

    In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Infrastructure Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do an in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episode, we hosted Doug Cutting and Mike Cafarella for a fascinating discussion on Hadoop. In this episode, we are incredibly fortunate to have Reynold Xin, co-creator of Apache Spark and co-founder of Databricks, share with us the fascinating origin story of Spark, why Spark gained unprecedented adoption in a very short time, the technical innovations that made Spark truly special, and the core conviction that has made Databricks the most successful Data+AI company. Timestamps * Introduction [00:00:00] * Origin story of Spark [00:03:12] * How Spark benefited from Hadoop [00:07:09] * How Spark leveraged RAM to monopolize large-scale data processing [00:09:27] * RDDs demystified [00:11:43] * Three reasons behind Spark’s amazing adoption [[00:21:47] * Technical breakthroughs that speeded up Spark 100x [00:27:05] * Streaming in Spark [00:31:13] * Balancing open source ethos with commercialization plans [00:37:45] * The core conviction behind Databricks [00:40:40] * Future of Spark in the Generative AI era [00:44:39] * Lightning round [00:49:39] Transcript Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host and partner at Decibel.VC, where we back technical founders building technical products. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today, I have the great pleasure of welcoming Reynold Xin, co-founder of Databricks and co-creator of Spark to our podcast. Hey, Reynold, welcome. [00:00:37] Reynold: Hey, Sudip. [00:00:38] Sudip: Thank you so much for being on our podcast. Really appreciate it! [00:00:41] Reynold: Pleasure to be here. [00:00:42] Sudip: All right, we are going to talk a lot about Spark, the project that you created and is behind the company that is now Databricks. You went to University of Toronto, is that right? [00:00:54] Reynold: I did go to the University of Toronto, spent about five years in Canada, and then came to UC Berkeley for my PhD. So, been in the Bay Area for over, I think almost 15 years by now. [00:01:05] Sudip: What brought you to Berkeley specifically? [00:01:07] Reynold: It's an interesting point. So when I was considering where to pursue my PhD studies, I looked at all the usual suspects, the top schools, and one of the things that really attracted me to Berkeley - actually, it was two things. One is there's a very strong collaborative culture, in particular across disciplines, because in many PhD programs, the way it works in academic research is you have a PI, a principal investigator, a professor who leads a bunch of students, and they collaborate within that group. But one thing that's really unique about Berkeley is that they brought together all these different people from very different disciplines - machine learning, computer systems, databases - and have them all sit in one big open space and collaborate. So it led to a lot of research that was previously much more difficult to do because you really needed that cross discipline. The second part was Berkeley always had this DNA of building real-world systems. A lot of academic research kind of stop at publishing, but Berkeley sort of has had this tradition of going back to BSD, UNIX, Postgres, RAID, RISC, and all of that, actual systems that have a real-world industry impact. And that's really what attracted me. [00:02:20] Sudip: And obviously, that's what you guys did with Spark, too. [00:02:23] Reynold: Yeah, we tried to continue that tradition. So we didn't stop at just the papers. [00:02:27] Sudip: Yes, absolutely. And I think that's a very common criticism of a lot of academic work, right? Great in quality, but doesn't go that last mile to get to production or get to actual users. [00:02:40] Reynold: It's not necessarily a wrong thing either, just different approaches. You could argue, hey, let academia figure out the innovative ideas and validate them and have industry productize them. It's not necessarily the strength of academia to productize systems. To some extent, it's just different ways of doing things. [00:02:57] Sudip: Let me go back to when you guys started Spark, and this is circa 2009. I'm guessing you had just joined the PhD program, and this was still the AMP Lab - Algorithms, Machine, People Lab - right? [00:03:11] Reynold: Yeah. [00:03:12] Sudip: Which, of course, is now Sky Lab and was RISE Lab in between. Can you give us a little bit of an idea of what the motivations were to start the research behind Spark? I mean, this was the time when Hadoop was still king, right? Like, why do Spark? [00:03:27] Reynold: It was an interesting story. So Spark actually started technically the year before I showed up. By the time I showed up, there was this very early, early thing already. So Netflix back then in the 2000s, I think even a little bit before 2009, had this Netflix Prize, which is the competition they created in which they anonymized their movie rating datasets so anybody can participate in the competition to come up with better recommendation models for movies. And whoever can improve the baseline the most would get a million dollars. [00:03:58] Sudip: Yeah, I remember that. [00:04:01] Reynold: That was a big deal. Eventually, it was shut down for privacy reasons. I think maybe there were lawsuits that happened, but it was a big deal in computer science: and in the history of machine learning. And this particular PhD student, Lester, was really into this kind of competitions: and also a million dollars was a lot of money. [00:04:17] Sudip: Sure. For a grad student in particular, right? [00:04:20] Reynold: A grad student makes about $2,000 a month. So he tried to compete, and one thing that he noticed was that this dataset was much larger than the toy dataset he used to work with for academic research, and it wouldn't fit on his laptop anymore. So he needed something to scale out to be able to process all this data and implement machine learning algorithms. And one of the keys with machine learning is that you are not done when you come up with the first model. It is a continuously iterative process to improve it over time. The velocity of iteration is very important. And he tried Hadoop first because that was the unique thing - if you wanted to do distributed data processing, you used Hadoop back in 2009. And he realized it was horribly inefficient to run. Every single run takes minutes, and it was also horribly inefficient to write. So the productivity for iterating on the program itself was very difficult because the API was very complicated, it's very clunky. So he kind of walked down the aisle - the nice thing about having a giant open space with people from very different disciplines - and talked to Matei, who was also a PhD student back then, one of my fellow co-founders at Databricks. He said, hey, I have this challenge, and I think if you have those kind of primitives, I could really do my competition much faster. So Matei and him basically worked together over the weekend and came up with the very first version of Spark, which was only 600 lines of code. It was an extremely simple system that aimed to do two very simple things. One was a very, very simple, elegant API that exposed distributed data sets as if those were a single local data collection. And second, it could put or cache that data set in memory. So now you can repeatedly run computation on it, which is very important for machine learning because a lot of machine learning algorithms are iterative. So with those two primitives, they were able to make progress much faster for the Netflix Prize. And I think Lester's team even tied for the first place in terms of accuracy. [00:06:24] Sudip: Did he get the money? [00:06:24] Reynold: He did not get the money because their team were 20 minutes late in the submission. So they lost a million dollars for a 20-minute difference. So if Matei had worked a little bit harder and had Spark maybe 20 minutes earlier, Lester might have been a million dollar richer. [00:06:44] Sudip: That's such an amazing story. Wow, I actually did not know that! [00:06:48] Reynold: So when Spark started from research, it kind of started for a competition and really just the collaborative open space and the opportunity that all of those people just happened to be there at the right time led its very, very first version. Now, obviously Spark today looks very, very different from what the original 600 lines of code was. But that's how it got started. [00:07:09] Sudip: One question I have since you touched on Hadoop, do you feel that Spark benefited from Hadoop being already there? Did you guys use some components of the Hadoop ecosystem? Like for example, if Hadoop hadn't existed, do you think Spark could have still been created? [00:07:23] Reynold: Spark definitely benefited enormously from Hadoop early on. There are also baggages that we carry from Hadoop that up until today are still there. It's definitely benefited massively. The first example was that Hadoop solved the storage problem for Spark. And as a result, Spark never had to deal wit

    52 Min.
  5. 29.11.2023

    When Hadoop was King and Yahoo was Cool - with Doug Cutting and Mike Cafarella

    In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in infrastructure software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider story. For each such project, we go back in time and do in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to educate the next generation of engineers who were not there when those transformational projects were created. In our first “season,” we start with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. And what better than kicking off the Data Engineering season with an episode on Hadoop, a project that is synonymous with Big Data. We were incredibly fortunate to host the creators of Hadoop, Doug Cutting and Mike Cafarella, to share with us the untold history of Hadoop, how multiple technical breakthroughs and a little luck came together for them to create the project, and how Hadoop created a vibrant open source ecosystem that led to the next generation of technologies such as Spark. Timestamps * Introduction [00:00:00] * Origin story of Hadoop [00:03:26] * How Google’s work influenced Hadoop [00:05:47] * Yahoo’s contribution to Hadoop [00:13:51] * Major milestones for Hadoop [00:20:06] * Core components of Hadoop - the why’s and how’s [00:22:44] * Rise of Spark and how the Hadoop ecosystem reacted to it [00:27:19] * Hadoop vendors and the tension between Cloudera and Hortonworks [00:31:51] * Proudest moments for the Hadoop creators [00:33:56] * Lightning round [00:36:04] Transcript Sudip: Welcome to the inaugural episode of the Engineers of Scale podcast. In our first season, we'll cover the projects that have transformed and shaped the data engineering industry. And what's better than starting with Hadoop, the project that is synonymous with Big Data. Today, I have the great pleasure of hosting Doug Cutting and Mike Cafarella, the creators of Hadoop. And just for the record, Hadoop is an open source software framework for storing enormous data and distributed processing of very large data. Think hundreds and thousands of petabytes of data on again, hundreds and thousands of commodity hardware nodes. If you have anything to do with data ever, you certainly know of Hadoop and have either used it or definitely have benefited from it one way or another. In fact, I remember back in 2008, I was working on my second startup, and we were actually processing massive amounts of data from retailers coming from their point of sale systems and inventory. And as we looked around, Hadoop was the only choice we really had. So today I'm incredibly excited to have the two creators of Hadoop, Mike Caffarella and Doug Cutting with us today. Mike and Doug, welcome to the podcast. It is great having you both. [00:01:02] Doug: It's great to see you. Thank you. Thanks for having us. [00:01:10] Sudip: If you guys don't mind, I think for our listeners, it'll be great to know what you guys are up to these days. Mike, maybe I'll start with you and then Doug. [00:01:19] Mike: Sure. I'm a research scientist in the Data Systems group at MIT. [00:01:27] Doug: I'm a retired guy. I stopped working 18 months ago. My wife ran for public office and it was a good time for me to transition into being a home keeper, do shopping and cooking. But I also have a healthy hobby of mountain biking and doing trail advocacy and development, trying to build more trail systems around the area that I live in. [00:01:44] Sudip: Sounds like you're having real fun, Doug. One day we all aspire to get there, for sure. I'm really curious to know how you guys had met. I've seen some interviews of you guys. You kind of talked about how, I think, Doug, you were working on Lucene at that time and then connected with Mike somehow through a common friend. I'd love to know a little more detail on how you guys met and how you guys started working together. [00:02:06] Doug: It kind of goes back to Hadoop really. Hadoop was preceded by this project, Nutch. Nutch was initiated when a company called Overture, which we'll probably hear more about, called me up out of the blue as a guy who had experience in both web search engines and open source software and said, hey, how would you like to write an open source web search engine? And I said, that'd be cool. And they say they had money to pay me at least part time and maybe a couple other people. And did I know anyone? I didn't know anybody offhand, but I had friends. I called up my freshman roommate, a guy named Sammy Shio, who is a founder of Marimba. And I said, Sammy, do you know anybody? And he said, you should talk to Mike Cafarella. I think it was the only name that I got. And I called Mike and he said, yeah, sure, let's do this. [00:02:49] Mike: So at the time, this would be in like late summer, early fall of 02. I had worked in startups and in industry for a few years, but I was looking to go back to school. So I was putting together applications for grad school. And I was working with an old professor of mine to kind of scoop up my application a little bit because I had been out of research and so on for a while. And that was a fun project, but it wasn't consuming all my time. And so Sammy, who was one of the founders of Marimba, which was my first job out of college, he got in touch and said that his buddy, Doug, had an interesting project and I should make sure I go talk to him, which was great. I was looking for something to do and it came at just the right moment. [00:03:26] Sudip: That was quite a connection, Mike. And then going back to that timeframe, 2002-2003, I think, Doug, you started touching on how you started working on Nutch and eventually became Hadoop. Would you mind just maybe walking us through a little bit like the origin story of Hadoop? I mean, I know Overture funded you for writing the web crawler, but what was their interest in an open source web crawler in the first place? [00:03:49] Doug: I think that's a good question to get back to some of the business context. We want to mostly focus on tech here, but the business context matters, as is often the case. So I had worked on web search from 96 to 98 at a company called Excite. I'd been pretty much the sole engineer doing the backend search and indexing system. And then I transitioned away from that, written this search engine on the side called Lucene, which I ended up open sourcing in 2000. Also in 98, Google launched, and initially they were ad-free. All the other search engines, there were a handful of them, were totally encrusted and covered with display ads. So just think like magazine ads, just random ads that they managed to sell the space to advertisers. Google started with no ads, and they also really focused and spent a lot of effort trying to work on search quality. All they were doing was search. Everybody else was trying all kinds of things to get more ads in front of people, and Google just focused on making search better. And by 2000, they'd succeeded, and the combination of this really clean, simple interface and better quality search results, they had taken most of the search market share already. But they needed a revenue plan. This company called Overture had, in the meantime, invented a way to make a lot of money from web search by auctioning off keywords to advertisers and matching them to the query. Google copied that and started themselves minting money. Overture was nervous because they had this market, and they were licensing it to Yahoo and Microsoft and others, but they were worried that all of their customers were going to get beaten by Google and go out of business. So on one hand, they sued Google. That's an interesting side story. But on the other hand, they decided, we should build our own search engine to compete with Google. We somehow need to do this. They bought AltaVista. They tried to build something internally, and they also thought, you know, open source is this big trend. Let's do an open source one to have something to compete. So they called me, and I called Mike, and we worked with a small team of guys there at Overture, led by a guy named Dan Fain, and we started working on trying to build web search as open source. [00:05:47] Sudip: That is such a phenomenal historical context. Including myself. I don't think many, very many people had that. And then interestingly, Google also came out with their GFS paper in 2003, their MapReduce paper in 2004, which obviously influenced a lot of the work that I think you guys did down the line. I'm curious, what do you think might have caused Google to publish those papers in the first place? Any hypothesis on that? [00:06:14] Mike: I think you're putting your finger on something interesting and important, which was, at the time, that wasn't common practice to have a research paper that told you a lot of technical details about an important piece of infrastructure. I don't think it was part of some genius, long-term plan to profit down the road. It was part of a general culture at the place to emphasize the virtues of publishing and openness and science. Maybe it helped them with hiring or something like that, but if so, that was kind of an indirect benefit. And it was really trend-setting. I mean, they ended up publishing a ton of papers. I think Microsoft and Yahoo and other companies followed suit. There's a whole string of really interesting papers throughout the 2000s and early 2010s, systems that we might never have learned about had they remained totally closed. But it's interesting to think about the impact of the GFS paper, I think, on our experience, Doug, which was we had worked on Nutch for, I guess, about a year. And after about a year's time, I recall that it was indexing on the order of tens of millions of pages, but you couldn't get more than a month's worth of fresh

    45 Min.

Info

Hello everyone, welcome to the Engineers of Scale podcast. In this podcast, we go back in time and give you an insider’s view on the projects that have completely transformed the infrastructure industry. Most importantly, we celebrate the heroes who created and led those projects. sudipchakrabarti.substack.com