Software at Scale

Utsav Shah
Software at Scale

Software at Scale is where we discuss the technical stories behind large software applications. www.softwareatscale.dev

  1. 2024/08/05

    Software at Scale 60 - Data Platforms with Aravind Suresh

    Aravind was a Staff Software Engineer at Uber, and currently works at OpenAI. Apple Podcasts | Spotify | Google Podcasts Edited Transcript Can you tell us about the scale of data Uber was dealing with when you joined in 2018, and how it evolved? When I joined Uber in mid-2018, we were handling a few petabytes of data. The company was going through a significant scaling journey, both in terms of launching in new cities and the corresponding increase in data volume. By the time I left, our data had grown to over an exabyte. To put it in perspective, the amount of data grew by a factor of about 20 in just a three to four-year period. Currently, Uber ingests roughly a petabyte of data daily. This includes some replication, but it's still an enormous amount. About 60-70% of this is raw data, coming directly from online systems or message buses. The rest is derived data sets and model data sets built on top of the raw data. That's an incredible amount of data. What kinds of insights and decisions does this enable for Uber? This scale of data enables a wide range of complex analytics and data-driven decisions. For instance, we can analyze how many concurrent trips we're handling throughout the year globally. This is crucial for determining how many workers and CPUs we need running at any given time to serve trips worldwide. We can also identify trends like the fastest growing cities or seasonal patterns in traffic. The vast amount of historical data allows us to make more accurate predictions and spot long-term trends that might not be visible in shorter time frames. Another key use is identifying anomalous user patterns. For example, we can detect potentially fraudulent activities like a single user account logging in from multiple locations across the globe. We can also analyze user behavior patterns, such as which cities have higher rates of trip cancellations compared to completed trips. These insights don't just inform day-to-day operations; they can lead to key product decisions. For instance, by plotting heat maps of trip coordinates over a year, we could see overlapping patterns that eventually led to the concept of Uber Pool. How does Uber manage real-time versus batch data processing, and what are the trade-offs? We use both offline (batch) and online (real-time) data processing systems, each optimized for different use cases. For real-time analytics, we use tools like Apache Pinot. These systems are optimized for low latency and quick response times, which is crucial for certain applications. For example, our restaurant manager system uses Pinot to provide near-real-time insights. Data flows from the serving stack to Kafka, then to Pinot, where it can be queried quickly. This allows for rapid decision-making based on very recent data. On the other hand, our offline flow uses the Hadoop stack for batch processing. This is where we store and process the bulk of our historical data. It's optimized for throughput – processing large amounts of data over time. The trade-off is that real-time systems are generally 10 to 100 times more expensive than batch systems. They require careful tuning of indexes and partitioning to work efficiently. However, they enable us to answer queries in milliseconds or seconds, whereas batch jobs might take minutes or hours. The choice between batch and real-time depends on the specific use case. We always ask ourselves: Does this really need to be real-time, or can it be done in batch? The answer to this question goes a long way in deciding which approach to use and in building maintainable systems. What challenges come with maintaining such large-scale data systems, especially as they mature? As data systems mature, we face a range of challenges beyond just handling the growing volume of data. One major challenge is the need for additional tools and systems to manage the complexity. For instance, we needed to build tools for data discovery. When you have thousands of tables and hundreds of users, you need a way for people to find the right data for their needs. We built a tool called Data Book at Uber to solve this problem. Governance and compliance are also huge challenges. When you're dealing with sensitive customer data, you need robust systems to enforce data retention policies and handle data deletion requests. This is particularly challenging in a distributed system where data might be replicated across multiple tables and derived data sets. We built an in-house lineage system to track which workloads derive from what data. This is crucial for tasks like deleting specific data across the entire system. It's not just about deleting from one table – you need to track down and update all derived data sets as well. Data deletion itself is a complex process. Because most files in the batch world are kept immutable for efficiency, deleting data often means rewriting entire files. We have to batch these operations and perform them carefully to maintain system performance. Cost optimization is an ongoing challenge. We're constantly looking for ways to make our systems more efficient, whether that's by optimizing our storage formats, improving our query performance, or finding better ways to manage our compute resources. How do you see the future of data infrastructure evolving, especially with recent AI advancements? The rise of AI and particularly generative AI is opening up new dimensions in data infrastructure. One area we're seeing a lot of activity in is vector databases and semantic search capabilities. Traditional keyword-based search is being supplemented or replaced by embedding-based semantic search, which requires new types of databases and indexing strategies. We're also seeing increased demand for real-time processing. As AI models become more integrated into production systems, there's a need to handle more GPUs in the serving flow, which presents its own set of challenges. Another interesting trend is the convergence of traditional data analytics with AI workloads. We're starting to see use cases where people want to perform complex queries that involve both structured data analytics and AI model inference. Overall, I think we're moving towards more integrated, real-time, and AI-aware data infrastructure. The challenge will be balancing the need for advanced capabilities with concerns around cost, efficiency, and maintainability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    35 分钟
  2. 2023/07/05

    Software at Scale 59 - Incident Management with Nora Jones

    Nora is the CEO and co-founder of Jeli, an incident management platform. Apple Podcasts | Spotify | Google Podcasts Nora provides an in-depth look into incident management within the software industry and discusses the incident management platform Jeli. Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes. We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents. Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture. We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    44 分钟
  3. 2023/06/13

    Software at Scale 58 - Measuring Developer Productivity with Abi Noda

    Abi Noda is the CEO and co-founder of DX, a developer productivity platform. Apple Podcasts | Spotify | Google Podcasts My view on developer experience and productivity measurement aligns extremely closely with DX’s view. The productivity of a group of engineers cannot be measured by tools alone - there’s too many qualitative factors like cross-functional stakeholder beuracracy or inefficiency, and inherent domain/codebase complexity that cannot be measured by tools. At the same time, there are some metrics, like whether an engineer has committed any code-changes in their first week/month, that serve as useful guardrails for engineering leadership. A combination of tools and metrics may provide the holistic view and insights into the engineering organization’s throughput. In this episode, we discuss the DX platform, and Abi’s recently published research paper on developer experience. We talk about how organizations can use tools and surveys to iterate and improve upon developer experience, and ultimately, engineering throughput. GPT-4 generated summary In this episode, Abi Noda and I explore the landscape of engineering metrics and a quantifiable approach towards developer experience. Our discussion goes from the value of developer surveys and system-based metrics to the tangible ways in which DX is innovating the field. We initiate our conversation with a comparison of developer surveys and system-based metrics. Abi explains that while developer surveys offer a qualitative perspective on tool efficacy and user sentiment, system-based metrics present a quantitative analysis of productivity and code quality. The discussion then moves to the real-world applications of these metrics, with Pfizer and eBay as case studies. Pfizer, for example, uses a model where they employ metrics for a detailed understanding of developer needs, subsequently driving strategic decision-making processes. They have used these metrics to identify bottlenecks in their development cycle, and strategically address these pain points. eBay, on the other hand, uses the insights from developer sentiment surveys to design tools that directly enhance developer satisfaction and productivity. Next, our dialogue around survey development centered on the dilemma between standardization and customization. While standardization offers cost efficiency and benchmarking opportunities, customization acknowledges the unique nature of every organization. Abi proposes a blend of both to cater to different aspects of developer sentiment and productivity metrics. The highlight of the conversation was the introduction of DX's innovative data platform. The platform consolidates data across internal and third-party tools in a ready-to-analyze format, giving users the freedom to build their queries, reports, and metrics. The ability to combine survey and system data allows the unearthing of unique insights, marking a distinctive advantage of DX's approach. In this episode, Abi Noda shares enlightening perspectives on engineering metrics and the role they play in shaping the developer experience. We delve into how DX's unique approach to data aggregation and its potential applications can lead organizations toward more data-driven and effective decision-making processes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    49 分钟
  4. 2023/05/16

    Software at Scale 57 - Scalable Frontends with Robert Cooke

    Robert Cooke is the CTO and co-founder of 3Forge, a real-time data visualization platform. Apple Podcasts | Spotify | Google Podcasts In this episode, we delve into Wall Street's high-frequency trading evolution and the importance of high-volume trading data observability. We examine traditional software observability tools, such as Datadog, and contrast them with 3Forge’s financial observability platform, AMI. GPT-4 generated summary In this episode of the Software at Scale podcast, Robert Cooke, CTO and Co-founder of 3Forge, a comprehensive internal tools platform, shares his journey and insights. He outlines his career trajectory, which includes prominent positions such as the Infrastructure Lead at Bear Stearns and the Head of Infrastructure at Liquidnet, and his work on high-frequency trading systems that employ software and hardware to perform rapid, automated trading decisions based on market data. Cooke elucidates how 3Forge empowers subject matter experts to automate trading decisions by encoding business logic. He underscores the criticality of robust monitoring systems around these automated trading systems, drawing an analogy with nuclear reactors due to the potential catastrophic repercussions of any malfunction. The dialogue then shifts to the impact of significant events like the COVID-19 pandemic on high-frequency trading systems. Cooke postulates that these systems can falter under such conditions, as they are designed to follow developer-encoded instructions and lack the flexibility to adjust to unforeseen macro events. He refers to past instances like the Facebook IPO and Knight Capital's downfall, where automated trading systems were unable to handle atypical market conditions, highlighting the necessity for human intervention in such scenarios. Cooke then delves into how 3Forge designs software for mission-critical scenarios, making an analogy with military strategy. Utilizing the OODA loop concept - Observe, Orient, Decide, and Act, they can swiftly respond to situations like outages. He argues that traditional observability tools only address the first step, whereas their solution facilitates quick orientation and decision-making, substantially reducing reaction time. He cites a scenario involving a sudden surge in Facebook orders where their tool allows operators to detect the problem in real time, comprehend the context, decide on the response, and promptly act on it. He extends this example to situations like government incidents or emergencies where an expedited response is paramount. Additionally, Cooke emphasizes the significance of low latency UI updates in their tool. He explains that their software uses an online programming approach, reacting to changes in real-time and only updating the altered components. As data size increases and reaction time becomes more critical, this feature becomes increasingly important. Cooke concludes this segment by discussing the evolution of their clients' use cases, from initially needing static data overviews to progressively demanding real-time information and interactive workflows. He gives the example of users being able to comment on a chart and that comment being immediately visible to others, akin to the real-time collaboration features in tools like Google Docs. In the subsequent segment, Cooke shares his perspective on choosing the right technology to drive business decisions. He stresses the importance of understanding the history and trends of technology, having experienced several shifts in the tech industry since his early software writing days in the 1980s. He projects that while computer speeds might plateau, parallel computing will proliferate, leading to CPUs with more cores. He also predicts continued growth in memory, both in terms of RAM and disk space. He further elucidates his preference for web-based applications due to their security and absence of installation requirements. He underscores the necessity of minimizing the data in the web browser and shares how they have built every component from scratch to achieve this. Their components are designed to handle as much data as possible, constantly pulling in data based on user interaction. He also emphasizes the importance of constructing a high-performing component library that integrates seamlessly with different components, providing a consistent user experience. He asserts that developers often face confusion when required to amalgamate different components since these components tend to behave differently. He envisions a future where software development involves no JavaScript or HTML, a concept that he acknowledges may be unsettling to some developers. Using the example of a dropdown menu, Cooke explains how a component initially designed for a small amount of data might eventually need to handle much larger data sets. He emphasizes the need to design components to handle the maximum possible data from the outset to avoid such issues. The conversation then pivots to the concept of over-engineering. Cooke argues that building a robust and universal solution from the start is not over-engineering but an efficient approach. He notes the significant overlap in applications use cases, making it advantageous to create a component that can cater to a wide variety of needs. In response to the host's query about selling software to Wall Street, Cooke advocates targeting the most demanding customers first. He believes that if a product can satisfy such customers, it's easier to sell to others. They argue that it's challenging to start with a simple product and then scale it up for more complex use cases, but it's feasible to start with a complex product and tailor it for simpler use cases. Cooke further describes their process of creating a software product. Their strategy was to focus on core components, striving to make them as efficient and effective as possible. This involved investing years on foundational elements like string libraries and data marshalling. After establishing a robust foundation, they could then layer on additional features and enhancements. This approach allowed them to produce a mature and capable product eventually. They also underscore the inevitability of users pushing software to its limits, regardless of its optimization. Thus, they argue for creating software that is as fast as possible right from the start. They refer to an interview with Steve Jobs, who argued that the best developers can create software that's substantially faster than others. Cooke's team continually seeks ways to refine and improve the efficiency of their platform. Next, the discussion shifts to team composition and the necessary attributes for software engineers. Cooke emphasizes the importance of a strong work ethic and a passion for crafting good software. He explains how his ambition to become the best software developer from a young age has shaped his company's culture, fostering a virtuous cycle of hard work and dedication among his team. The host then emphasizes the importance of engineers working on high-quality products, suggesting that problems and bugs can sap energy and demotivate a team. Cooke concurs, comparing the experience of working on high-quality software to working on an F1 race car, and how the pursuit of refinement and optimization is a dream for engineers. The conversation then turns to the importance of having a team with diverse thought processes and skillsets. Cooke recounts how the introduction of different disciplines and perspectives in 2019 profoundly transformed his company. The dialogue then transitions to the state of software solutions before the introduction of their high-quality software, touching upon the compartmentalized nature of systems in large corporations and the problems that arise from it. Cooke explains how their solution offers a more comprehensive and holistic overview that cuts across different risk categories. Finally, in response to the host's question about open-source systems, Cooke expresses reservations about the use of open-source software in a corporate setting. However, he acknowledges the extensive overlap and redundancy among the many new systems being developed. Although he does not identify any specific groundbreaking technology, he believes the rapid proliferation of similar technologies might lead to considerable technical debt in the future. Host Utsav wraps up the conversation by asking Cooke about his expectations and concerns for the future of technology and the industry. Cooke voices his concern about the continually growing number of different systems and technologies that companies are adopting, which makes integrating and orchestrating all these components a challenge. He advises companies to exercise caution when adopting multiple technologies simultaneously. However, Cooke also expresses enthusiasm about the future of 3Forge, a platform he has devoted a decade of his life to developing. He expresses confidence in the unique approach and discipline employed in building the platform. Cooke is optimistic about the company's growth and marketing efforts and their focus on fostering a developer community. He believes that the platform will thrive as developers share their experiences, and the product gains momentum. Utsav acknowledges the excitement and potential challenges that lie ahead, especially in managing community-driven systems. They conclude the conversation by inviting Cooke to return for another discussion in the future to review the progression and evolution of the topic. Both express their appreciation for the fruitful discussion before ending the podcast. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    56 分钟
  5. 2023/03/15

    Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri

    Ben Ofiri is the CEO and Co-Founder of Komodor, a Kubernetes troubleshooting platform. Apple Podcasts | Spotify | Google Podcasts We had an episode with the other founder of Komodor, Itiel, in 2021, and I thought it would be fun to revisit the topic. Highlights (ChatGPT Generated) [0:00] Introduction to the Software At Scale podcast and the guest speaker, Ben Ofiri, CEO and co-founder of Komodor. - Discussion of why Ben decided to work on a Kubernetes platform and the potential impact of Kubernetes becoming the standard for managing microservices. - Reasons why companies are interested in adopting Kubernetes, including the ability to scale quickly and cost-effectively, and the enterprise-ready features it offers. - The different ways companies migrate to Kubernetes, either starting from a small team and gradually increasing usage, or a strategic decision from the top down. - The flexibility of Kubernetes is its strength, but it also comes with complexity that can lead to increased time spent on alerts and managing incidents. - The learning curve for developers to be able to efficiently troubleshoot and operate Kubernetes can be steep and is a concern for many organizations. [8:17] Tools for Managing Kubernetes. - The challenges that arise when trying to operate and manage Kubernetes. - DevOps and SRE teams become the bottleneck due to their expertise in managing Kubernetes, leading to frustration for other teams. - A report by the cloud native observability organization found that one out of five developers felt frustrated enough to want to quit their job due to friction between different teams. - Ben's idea for Komodor was to take the knowledge and expertise of the DevOps and SRE teams and democratize it to the entire organization. - The platform simplifies the operation, management, and troubleshooting aspects of Kubernetes for every engineer in the company, from junior developers to the head of engineering. - One of the most frustrating issues for customers is identifying which teams should care about which issues in Kubernetes, which Komodor helps solve with automated checks and reports that indicate whether the problem is an infrastructure or application issue, among other things. - Komodor provides suggestions for actions to take but leaves the decision-making and responsibility for taking the action to the users. - The platform allows users to track how many times they take an action and how useful it is, allowing for optimization over time. [8:17] Tools for Managing Kubernetes. [12:03] The Challenge of Balancing Standardization and Flexibility. - Kubernetes provides a lot of flexibility, but this can lead to fragmented infrastructure and inconsistent usage patterns. - Komodor aims to strike a balance between standardization and flexibility, allowing for best practices and guidelines to be established while still allowing for customization and unique needs. [16:14] Using Data to Improve Kubernetes Management. - The platform tracks user actions and the effectiveness of those actions to make suggestions and fine-tune recommendations over time. - The goal is to build a machine that knows what actions to take for almost all scenarios in Kubernetes, providing maximum benefit to customers. [20:40] Why Kubernetes Doesn't Include All Management Functionality. - Kubernetes is an open-source project with many different directions it can go in terms of adding functionality. - Reliability, observability, and operational functionality are typically provided by vendors or cloud providers and not organically from the Kubernetes community. - Different players in the ecosystem contribute different pieces to create a comprehensive experience for the end user. [25:05] Keeping Up with Kubernetes Development and Adoption. - How Komodor keeps up with Kubernetes development and adoption. - The team is data-driven and closely tracks user feedback and needs, as well as new developments and changes in the ecosystem. - The use and adoption of custom resources is a constantly evolving and rapidly changing area, requiring quick research and translation into product specs. - The company hires deeply technical people, including those with backgrounds in DevOps and SRE, to ensure a deep understanding of the complex problem they are trying to solve. [32:12] The Effects of the Economy on Komodor. - The effects of the economy pivot on Komodor. - Companiesmust be more cost-efficient, leading to increased interest in Kubernetes and tools like Komodor. - The pandemic has also highlighted the need for remote work and cloud-based infrastructure, further fueling demand. - Komodor has seen growth as a result of these factors and believes it is well-positioned for continued success. [36:17] The Future of Kubernetes and Komodor. - Kubernetes will continue to evolve and be adopted more widely by organizations of all sizes and industries. - The team is excited about the potential of rule engines and other tools to improve management and automation within Kubernetes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    44 分钟
  6. 2023/02/01

    Software at Scale 54 - Community Trust with Vikas Agarwal

    Vikas Agarwal is an engineering leader with over twenty years of experience leading engineering teams. We focused this episode on his experience as the Head of Community Trust at Amazon and dealing with the various challenges of fake reviews on Amazon products. Apple Podcasts | Spotify | Google Podcasts Highlights (GPT-3 generated) [0:00:17] Vikas Agarwal's origin story. [0:00:52] How Vikas learned to code. [0:03:24] Vikas's first job out of college. [0:04:30] Vikas' experience with the review business and community trust. [0:06:10] Mission of the community trust team. [0:07:14] How to start off with a problem. [0:09:30] Different flavors of review abuse. [0:10:15] The program for gift cards and fake reviews. [0:12:10] Google search and FinTech. [0:14:00] Fraud and ML models. [0:15:51] Other things to consider when it comes to trust. [0:17:42] Ryan Reynolds' funny review on his product. [0:18:10] Reddit-like problems. [0:21:03] Activism filters. [0:23:03] Elon Musk's changing policy. [0:23:59] False positives and appeals process. [0:28:29] Stress levels and question mark emails from Jeff Bezos. [0:30:32] Jeff Bezos' mathematical skills. [0:31:45] Amazon's closed loop auditing process. [0:32:24] Amazon's success and leadership principles. [0:33:35] Operationalizing appeals at scale. [0:35:45] Data science, metrics, and hackathons. [0:37:14] Developer experience and iterating changes. [0:37:52] Advice for tackling a problem of this scale. [0:39:19] Striving for trust and external validation. [0:40:01] Amazon's efforts to combat abuse. [0:40:32] Conclusion. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    41 分钟
  7. 2022/12/28

    Software at Scale 53 - Testing Culture with Mike Bland

    Mike Bland is a software instigator - he helped drive adoption of automated testing at Google, and the Quality Culture Initiative at Apple. Apple Podcasts | Spotify | Google Podcasts Mike’s blog was instrumental towards my decision to pick a job in developer productivity/platform engineering. We talk about the Rainbow of Death - the idea of driving cultural change in large engineering organizations - one of the key challenges of platform engineering teams. And we deep dive into the value and common pushbacks against automated testing. Highlights (GPT-3 generated) [0:00 - 0:29] Welcome [0:29 - 0:38] Explanation of Rainbow of Death [0:38 - 0:52] Story of Testing Grouplet at Google [0:52 - 5:52] Benefits of Writing Blogs and Engineering Culture Change [5:52 - 6:48] Impact of Mike's Blog [6:48 - 7:45] Automated Testing at Scale [7:45 - 8:10] "I'm a Snowflake" Mentality [8:10 - 8:59] Instigator Theory and Crossing the Chasm Model [8:59 - 9:55] Discussion of Dependency Injection and Functional Decomposition [9:55 - 16:19] Discussion of Testing and Testable Code [16:19 - 24:30] Impact of Organizational and Cultural Change on Writing Tests [24:30 - 26:04] Instigator Theory [26:04 - 32:47] Strategies for Leaders to Foster and Support Testing [32:47 - 38:50] Role of Leadership in Promoting Testing [38:50 - 43:29] Philosophical Implications of Testing Practices This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

    1 小时 7 分钟

评分及评论

4.6
共 5 分
13 个评分

关于

Software at Scale is where we discuss the technical stories behind large software applications. www.softwareatscale.dev

若要收听包含儿童不宜内容的单集,请登录。

关注此节目的最新内容

登录或注册,以关注节目、存储单集,并获取最新更新。

选择国家或地区

非洲、中东和印度

亚太地区

欧洲

拉丁美洲和加勒比海地区

美国和加拿大