Software at Scale

Utsav Shah
Software at Scale

Software at Scale is where we discuss the technical stories behind large software applications. www.softwareatscale.dev

  1. AUG 5

    Software at Scale 60 - Data Platforms with Aravind Suresh

    Aravind was a Staff Software Engineer at Uber, and currently works at OpenAI. Apple Podcasts | Spotify | Google Podcasts Edited Transcript Can you tell us about the scale of data Uber was dealing with when you joined in 2018, and how it evolved? When I joined Uber in mid-2018, we were handling a few petabytes of data. The company was going through a significant scaling journey, both in terms of launching in new cities and the corresponding increase in data volume. By the time I left, our data had grown to over an exabyte. To put it in perspective, the amount of data grew by a factor of about 20 in just a three to four-year period. Currently, Uber ingests roughly a petabyte of data daily. This includes some replication, but it's still an enormous amount. About 60-70% of this is raw data, coming directly from online systems or message buses. The rest is derived data sets and model data sets built on top of the raw data. That's an incredible amount of data. What kinds of insights and decisions does this enable for Uber? This scale of data enables a wide range of complex analytics and data-driven decisions. For instance, we can analyze how many concurrent trips we're handling throughout the year globally. This is crucial for determining how many workers and CPUs we need running at any given time to serve trips worldwide. We can also identify trends like the fastest growing cities or seasonal patterns in traffic. The vast amount of historical data allows us to make more accurate predictions and spot long-term trends that might not be visible in shorter time frames. Another key use is identifying anomalous user patterns. For example, we can detect potentially fraudulent activities like a single user account logging in from multiple locations across the globe. We can also analyze user behavior patterns, such as which cities have higher rates of trip cancellations compared to completed trips. These insights don't just inform day-to-day operations; they can lead to key product decisions. For instance, by plotting heat maps of trip coordinates over a year, we could see overlapping patterns that eventually led to the concept of Uber Pool. How does Uber manage real-time versus batch data processing, and what are the trade-offs? We use both offline (batch) and online (real-time) data processing systems, each optimized for different use cases. For real-time analytics, we use tools like Apache Pinot. These systems are optimized for low latency and quick response times, which is crucial for certain applications. For example, our restaurant manager system uses Pinot to provide near-real-time insights. Data flows from the serving stack to Kafka, then to Pinot, where it can be queried quickly. This allows for rapid decision-making based on very recent data. On the other hand, our offline flow uses the Hadoop stack for batch processing. This is where we store and process the bulk of our historical data. It's optimized for throughput – processing large amounts of data over time. The trade-off is that real-time systems are generally 10 to 100 times more expensive than batch systems. They require careful tuning of indexes and partitioning to work efficiently. However, they enable us to answer queries in milliseconds or seconds, whereas batch jobs might take minutes or hours. The choice between batch and real-time depends on the specific use case. We always ask ourselves: Does this really need to be real-time, or can it be done in batch? The answer to this question goes a long way in deciding which approach to use and in building maintainable systems. What challenges come with maintaining such large-scale data systems, especially as they mature? As data systems mature, we face a range of challenges beyond just handling the growing volume of data. One major challenge is the need for additional tools and systems to manage the complexity. For instance, we needed to build tools for data discovery. When you have thousands of tables and hundreds o

    35 min
  2. 07/05/2023

    Software at Scale 59 - Incident Management with Nora Jones

    Nora is the CEO and co-founder of Jeli, an incident management platform. Apple Podcasts | Spotify | Google Podcasts Nora provides an in-depth look into incident management within the software industry and discusses the incident management platform Jeli. Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes. We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents. Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture. We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    44 min
  3. 06/13/2023

    Software at Scale 58 - Measuring Developer Productivity with Abi Noda

    Abi Noda is the CEO and co-founder of DX, a developer productivity platform. Apple Podcasts | Spotify | Google Podcasts My view on developer experience and productivity measurement aligns extremely closely with DX’s view. The productivity of a group of engineers cannot be measured by tools alone - there’s too many qualitative factors like cross-functional stakeholder beuracracy or inefficiency, and inherent domain/codebase complexity that cannot be measured by tools. At the same time, there are some metrics, like whether an engineer has committed any code-changes in their first week/month, that serve as useful guardrails for engineering leadership. A combination of tools and metrics may provide the holistic view and insights into the engineering organization’s throughput. In this episode, we discuss the DX platform, and Abi’s recently published research paper on developer experience. We talk about how organizations can use tools and surveys to iterate and improve upon developer experience, and ultimately, engineering throughput. GPT-4 generated summary In this episode, Abi Noda and I explore the landscape of engineering metrics and a quantifiable approach towards developer experience. Our discussion goes from the value of developer surveys and system-based metrics to the tangible ways in which DX is innovating the field. We initiate our conversation with a comparison of developer surveys and system-based metrics. Abi explains that while developer surveys offer a qualitative perspective on tool efficacy and user sentiment, system-based metrics present a quantitative analysis of productivity and code quality. The discussion then moves to the real-world applications of these metrics, with Pfizer and eBay as case studies. Pfizer, for example, uses a model where they employ metrics for a detailed understanding of developer needs, subsequently driving strategic decision-making processes. They have used these metrics to identify bottlenecks in their development cycle, and strategically address these pain points. eBay, on the other hand, uses the insights from developer sentiment surveys to design tools that directly enhance developer satisfaction and productivity. Next, our dialogue around survey development centered on the dilemma between standardization and customization. While standardization offers cost efficiency and benchmarking opportunities, customization acknowledges the unique nature of every organization. Abi proposes a blend of both to cater to different aspects of developer sentiment and productivity metrics. The highlight of the conversation was the introduction of DX's innovative data platform. The platform consolidates data across internal and third-party tools in a ready-to-analyze format, giving users the freedom to build their queries, reports, and metrics. The ability to combine survey and system data allows the unearthing of unique insights, marking a distinctive advantage of DX's approach. In this episode, Abi Noda shares enlightening perspectives on engineering metrics and the role they play in shaping the developer experience. We delve into how DX's unique approach to data aggregation and its potential applications can lead organizations toward more data-driven and effective decision-making processes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    49 min
  4. 05/16/2023

    Software at Scale 57 - Scalable Frontends with Robert Cooke

    Robert Cooke is the CTO and co-founder of 3Forge, a real-time data visualization platform. Apple Podcasts | Spotify | Google Podcasts In this episode, we delve into Wall Street's high-frequency trading evolution and the importance of high-volume trading data observability. We examine traditional software observability tools, such as Datadog, and contrast them with 3Forge’s financial observability platform, AMI. GPT-4 generated summary In this episode of the Software at Scale podcast, Robert Cooke, CTO and Co-founder of 3Forge, a comprehensive internal tools platform, shares his journey and insights. He outlines his career trajectory, which includes prominent positions such as the Infrastructure Lead at Bear Stearns and the Head of Infrastructure at Liquidnet, and his work on high-frequency trading systems that employ software and hardware to perform rapid, automated trading decisions based on market data. Cooke elucidates how 3Forge empowers subject matter experts to automate trading decisions by encoding business logic. He underscores the criticality of robust monitoring systems around these automated trading systems, drawing an analogy with nuclear reactors due to the potential catastrophic repercussions of any malfunction. The dialogue then shifts to the impact of significant events like the COVID-19 pandemic on high-frequency trading systems. Cooke postulates that these systems can falter under such conditions, as they are designed to follow developer-encoded instructions and lack the flexibility to adjust to unforeseen macro events. He refers to past instances like the Facebook IPO and Knight Capital's downfall, where automated trading systems were unable to handle atypical market conditions, highlighting the necessity for human intervention in such scenarios. Cooke then delves into how 3Forge designs software for mission-critical scenarios, making an analogy with military strategy. Utilizing the OODA loop concept - Observe, Orient, Decide, and Act, they can swiftly respond to situations like outages. He argues that traditional observability tools only address the first step, whereas their solution facilitates quick orientation and decision-making, substantially reducing reaction time. He cites a scenario involving a sudden surge in Facebook orders where their tool allows operators to detect the problem in real time, comprehend the context, decide on the response, and promptly act on it. He extends this example to situations like government incidents or emergencies where an expedited response is paramount. Additionally, Cooke emphasizes the significance of low latency UI updates in their tool. He explains that their software uses an online programming approach, reacting to changes in real-time and only updating the altered components. As data size increases and reaction time becomes more critical, this feature becomes increasingly important. Cooke concludes this segment by discussing the evolution of their clients' use cases, from initially needing static data overviews to progressively demanding real-time information and interactive workflows. He gives the example of users being able to comment on a chart and that comment being immediately visible to others, akin to the real-time collaboration features in tools like Google Docs. In the subsequent segment, Cooke shares his perspective on choosing the right technology to drive business decisions. He stresses the importance of understanding the history and trends of technology, having experienced several shifts in the tech industry since his early software writing days in the 1980s. He projects that while computer speeds might plateau, parallel computing will proliferate, leading to CPUs with more cores. He also predicts continued growth in memory, both in terms of RAM and disk space. He further elucidates his preference for web-based applications due to their security and absence of installation requirements. He underscores the necessity of minimizing the data in

    56 min
  5. 03/15/2023

    Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri

    Ben Ofiri is the CEO and Co-Founder of Komodor, a Kubernetes troubleshooting platform. Apple Podcasts | Spotify | Google Podcasts We had an episode with the other founder of Komodor, Itiel, in 2021, and I thought it would be fun to revisit the topic. Highlights (ChatGPT Generated) [0:00] Introduction to the Software At Scale podcast and the guest speaker, Ben Ofiri, CEO and co-founder of Komodor. - Discussion of why Ben decided to work on a Kubernetes platform and the potential impact of Kubernetes becoming the standard for managing microservices. - Reasons why companies are interested in adopting Kubernetes, including the ability to scale quickly and cost-effectively, and the enterprise-ready features it offers. - The different ways companies migrate to Kubernetes, either starting from a small team and gradually increasing usage, or a strategic decision from the top down. - The flexibility of Kubernetes is its strength, but it also comes with complexity that can lead to increased time spent on alerts and managing incidents. - The learning curve for developers to be able to efficiently troubleshoot and operate Kubernetes can be steep and is a concern for many organizations. [8:17] Tools for Managing Kubernetes. - The challenges that arise when trying to operate and manage Kubernetes. - DevOps and SRE teams become the bottleneck due to their expertise in managing Kubernetes, leading to frustration for other teams. - A report by the cloud native observability organization found that one out of five developers felt frustrated enough to want to quit their job due to friction between different teams. - Ben's idea for Komodor was to take the knowledge and expertise of the DevOps and SRE teams and democratize it to the entire organization. - The platform simplifies the operation, management, and troubleshooting aspects of Kubernetes for every engineer in the company, from junior developers to the head of engineering. - One of the most frustrating issues for customers is identifying which teams should care about which issues in Kubernetes, which Komodor helps solve with automated checks and reports that indicate whether the problem is an infrastructure or application issue, among other things. - Komodor provides suggestions for actions to take but leaves the decision-making and responsibility for taking the action to the users. - The platform allows users to track how many times they take an action and how useful it is, allowing for optimization over time. [8:17] Tools for Managing Kubernetes. [12:03] The Challenge of Balancing Standardization and Flexibility. - Kubernetes provides a lot of flexibility, but this can lead to fragmented infrastructure and inconsistent usage patterns. - Komodor aims to strike a balance between standardization and flexibility, allowing for best practices and guidelines to be established while still allowing for customization and unique needs. [16:14] Using Data to Improve Kubernetes Management. - The platform tracks user actions and the effectiveness of those actions to make suggestions and fine-tune recommendations over time. - The goal is to build a machine that knows what actions to take for almost all scenarios in Kubernetes, providing maximum benefit to customers. [20:40] Why Kubernetes Doesn't Include All Management Functionality. - Kubernetes is an open-source project with many different directions it can go in terms of adding functionality. - Reliability, observability, and operational functionality are typically provided by vendors or cloud providers and not organically from the Kubernetes community. - Different players in the ecosystem contribute different pieces to create a comprehensive experience for the end user. [25:05] Keeping Up with Kubernetes Development and Adoption. - How Komodor keeps up with Kubernetes development and adoption. - The team is data-driven and closely tracks user feedback and needs, as well as new developments and changes in the ecosystem. - The use and adoptio

    44 min
  6. 02/01/2023

    Software at Scale 54 - Community Trust with Vikas Agarwal

    Vikas Agarwal is an engineering leader with over twenty years of experience leading engineering teams. We focused this episode on his experience as the Head of Community Trust at Amazon and dealing with the various challenges of fake reviews on Amazon products. Apple Podcasts | Spotify | Google Podcasts Highlights (GPT-3 generated) [0:00:17] Vikas Agarwal's origin story. [0:00:52] How Vikas learned to code. [0:03:24] Vikas's first job out of college. [0:04:30] Vikas' experience with the review business and community trust. [0:06:10] Mission of the community trust team. [0:07:14] How to start off with a problem. [0:09:30] Different flavors of review abuse. [0:10:15] The program for gift cards and fake reviews. [0:12:10] Google search and FinTech. [0:14:00] Fraud and ML models. [0:15:51] Other things to consider when it comes to trust. [0:17:42] Ryan Reynolds' funny review on his product. [0:18:10] Reddit-like problems. [0:21:03] Activism filters. [0:23:03] Elon Musk's changing policy. [0:23:59] False positives and appeals process. [0:28:29] Stress levels and question mark emails from Jeff Bezos. [0:30:32] Jeff Bezos' mathematical skills. [0:31:45] Amazon's closed loop auditing process. [0:32:24] Amazon's success and leadership principles. [0:33:35] Operationalizing appeals at scale. [0:35:45] Data science, metrics, and hackathons. [0:37:14] Developer experience and iterating changes. [0:37:52] Advice for tackling a problem of this scale. [0:39:19] Striving for trust and external validation. [0:40:01] Amazon's efforts to combat abuse. [0:40:32] Conclusion. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    41 min
  7. 12/28/2022

    Software at Scale 53 - Testing Culture with Mike Bland

    Mike Bland is a software instigator - he helped drive adoption of automated testing at Google, and the Quality Culture Initiative at Apple. Apple Podcasts | Spotify | Google Podcasts Mike’s blog was instrumental towards my decision to pick a job in developer productivity/platform engineering. We talk about the Rainbow of Death - the idea of driving cultural change in large engineering organizations - one of the key challenges of platform engineering teams. And we deep dive into the value and common pushbacks against automated testing. Highlights (GPT-3 generated) [0:00 - 0:29] Welcome [0:29 - 0:38] Explanation of Rainbow of Death [0:38 - 0:52] Story of Testing Grouplet at Google [0:52 - 5:52] Benefits of Writing Blogs and Engineering Culture Change [5:52 - 6:48] Impact of Mike's Blog [6:48 - 7:45] Automated Testing at Scale [7:45 - 8:10] "I'm a Snowflake" Mentality [8:10 - 8:59] Instigator Theory and Crossing the Chasm Model [8:59 - 9:55] Discussion of Dependency Injection and Functional Decomposition [9:55 - 16:19] Discussion of Testing and Testable Code [16:19 - 24:30] Impact of Organizational and Cultural Change on Writing Tests [24:30 - 26:04] Instigator Theory [26:04 - 32:47] Strategies for Leaders to Foster and Support Testing [32:47 - 38:50] Role of Leadership in Promoting Testing [38:50 - 43:29] Philosophical Implications of Testing Practices This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    1h 7m

Ratings & Reviews

4.6
out of 5
13 Ratings

About

Software at Scale is where we discuss the technical stories behind large software applications. www.softwareatscale.dev

You Might Also Like

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes, and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada