59 episodes

Software at Scale Utsav Shah

- Technology
- 4.6 • 13 Ratings

Software at Scale is where we discuss the technical stories behind large software applications.

www.softwareatscale.dev

- JUL 5, 2023
Software at Scale 59 - Incident Management with Nora Jones

Software at Scale 59 - Incident Management with Nora Jones

Nora is the CEO and co-founder of Jeli, an incident management platform.
Apple Podcasts | Spotify | Google Podcasts
Nora provides an in-depth look into incident management within the software industry and discusses the incident management platform Jeli.
Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes.
We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents.
Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture.
We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
- 44 min
- JUN 13, 2023
Software at Scale 58 - Measuring Developer Productivity with Abi Noda

Software at Scale 58 - Measuring Developer Productivity with Abi Noda

Abi Noda is the CEO and co-founder of DX, a developer productivity platform.
Apple Podcasts | Spotify | Google Podcasts
My view on developer experience and productivity measurement aligns extremely closely with DX’s view. The productivity of a group of engineers cannot be measured by tools alone - there’s too many qualitative factors like cross-functional stakeholder beuracracy or inefficiency, and inherent domain/codebase complexity that cannot be measured by tools. At the same time, there are some metrics, like whether an engineer has committed any code-changes in their first week/month, that serve as useful guardrails for engineering leadership. A combination of tools and metrics may provide the holistic view and insights into the engineering organization’s throughput.
In this episode, we discuss the DX platform, and Abi’s recently published research paper on developer experience. We talk about how organizations can use tools and surveys to iterate and improve upon developer experience, and ultimately, engineering throughput.
GPT-4 generated summary
In this episode, Abi Noda and I explore the landscape of engineering metrics and a quantifiable approach towards developer experience. Our discussion goes from the value of developer surveys and system-based metrics to the tangible ways in which DX is innovating the field.
We initiate our conversation with a comparison of developer surveys and system-based metrics. Abi explains that while developer surveys offer a qualitative perspective on tool efficacy and user sentiment, system-based metrics present a quantitative analysis of productivity and code quality.
The discussion then moves to the real-world applications of these metrics, with Pfizer and eBay as case studies. Pfizer, for example, uses a model where they employ metrics for a detailed understanding of developer needs, subsequently driving strategic decision-making processes. They have used these metrics to identify bottlenecks in their development cycle, and strategically address these pain points. eBay, on the other hand, uses the insights from developer sentiment surveys to design tools that directly enhance developer satisfaction and productivity.
Next, our dialogue around survey development centered on the dilemma between standardization and customization. While standardization offers cost efficiency and benchmarking opportunities, customization acknowledges the unique nature of every organization. Abi proposes a blend of both to cater to different aspects of developer sentiment and productivity metrics.
The highlight of the conversation was the introduction of DX's innovative data platform. The platform consolidates data across internal and third-party tools in a ready-to-analyze format, giving users the freedom to build their queries, reports, and metrics. The ability to combine survey and system data allows the unearthing of unique insights, marking a distinctive advantage of DX's approach.
In this episode, Abi Noda shares enlightening perspectives on engineering metrics and the role they play in shaping the developer experience. We delve into how DX's unique approach to data aggregation and its potential applications can lead organizations toward more data-driven and effective decision-making processes.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
- 49 min
- MAY 16, 2023
Software at Scale 57 - Scalable Frontends with Robert Cooke

Software at Scale 57 - Scalable Frontends with Robert Cooke

Robert Cooke is the CTO and co-founder of 3Forge, a real-time data visualization platform.
Apple Podcasts | Spotify | Google Podcasts
In this episode, we delve into Wall Street's high-frequency trading evolution and the importance of high-volume trading data observability. We examine traditional software observability tools, such as Datadog, and contrast them with 3Forge’s financial observability platform, AMI.
GPT-4 generated summary
In this episode of the Software at Scale podcast, Robert Cooke, CTO and Co-founder of 3Forge, a comprehensive internal tools platform, shares his journey and insights. He outlines his career trajectory, which includes prominent positions such as the Infrastructure Lead at Bear Stearns and the Head of Infrastructure at Liquidnet, and his work on high-frequency trading systems that employ software and hardware to perform rapid, automated trading decisions based on market data.
Cooke elucidates how 3Forge empowers subject matter experts to automate trading decisions by encoding business logic. He underscores the criticality of robust monitoring systems around these automated trading systems, drawing an analogy with nuclear reactors due to the potential catastrophic repercussions of any malfunction.
The dialogue then shifts to the impact of significant events like the COVID-19 pandemic on high-frequency trading systems. Cooke postulates that these systems can falter under such conditions, as they are designed to follow developer-encoded instructions and lack the flexibility to adjust to unforeseen macro events. He refers to past instances like the Facebook IPO and Knight Capital's downfall, where automated trading systems were unable to handle atypical market conditions, highlighting the necessity for human intervention in such scenarios.
Cooke then delves into how 3Forge designs software for mission-critical scenarios, making an analogy with military strategy. Utilizing the OODA loop concept - Observe, Orient, Decide, and Act, they can swiftly respond to situations like outages. He argues that traditional observability tools only address the first step, whereas their solution facilitates quick orientation and decision-making, substantially reducing reaction time.
He cites a scenario involving a sudden surge in Facebook orders where their tool allows operators to detect the problem in real time, comprehend the context, decide on the response, and promptly act on it. He extends this example to situations like government incidents or emergencies where an expedited response is paramount.
Additionally, Cooke emphasizes the significance of low latency UI updates in their tool. He explains that their software uses an online programming approach, reacting to changes in real-time and only updating the altered components. As data size increases and reaction time becomes more critical, this feature becomes increasingly important.
Cooke concludes this segment by discussing the evolution of their clients' use cases, from initially needing static data overviews to progressively demanding real-time information and interactive workflows. He gives the example of users being able to comment on a chart and that comment being immediately visible to others, akin to the real-time collaboration features in tools like Google Docs.
In the subsequent segment, Cooke shares his perspective on choosing the right technology to drive business decisions. He stresses the importance of understanding the history and trends of technology, having experienced several shifts in the tech industry since his early software writing days in the 1980s. He projects that while computer speeds might plateau, parallel computing will proliferate, leading to CPUs with more cores. He also predicts continued growth in memory, both in terms of RAM and disk space.
He further elucidates his preference for web-based applications due to their security and absence of installation requirements. He underscores the necessity of minimizing the data in
- 55 min
- APR 17, 2023
Software at Scale 56 - SaaS cost with Roi Rav-Hon

Software at Scale 56 - SaaS cost with Roi Rav-Hon

Roi Rav-Hon is the co-founder and CEO of Finout, a SaaS cost management platform.
Apple Podcasts | Spotify | Google Podcasts
In this episode, we review the challenge of maintaining reasonable SaaS costs for tech companies. Usage-based pricing models of infrastructure costs lead to a gradual ramp-up of costs and always have sneakily come up as a priority in my career as an infrastructure/platform engineer. So I’m particularly interested in how engineering teams can better understand, track, and “shift left” infrastructure cost tracking and prevent regressions.
We specifically go over Kubernetes cost management, and why cost management needs to be attributable to the most specific teams in order to be self-governing in an organization.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
- 28 min
- MAR 15, 2023
Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri

Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri

Ben Ofiri is the CEO and Co-Founder of Komodor, a Kubernetes troubleshooting platform.
Apple Podcasts | Spotify | Google Podcasts
We had an episode with the other founder of Komodor, Itiel, in 2021, and I thought it would be fun to revisit the topic.
Highlights (ChatGPT Generated)
[0:00] Introduction to the Software At Scale podcast and the guest speaker, Ben Ofiri, CEO and co-founder of Komodor.
- Discussion of why Ben decided to work on a Kubernetes platform and the potential impact of Kubernetes becoming the standard for managing microservices.
- Reasons why companies are interested in adopting Kubernetes, including the ability to scale quickly and cost-effectively, and the enterprise-ready features it offers.
- The different ways companies migrate to Kubernetes, either starting from a small team and gradually increasing usage, or a strategic decision from the top down.
- The flexibility of Kubernetes is its strength, but it also comes with complexity that can lead to increased time spent on alerts and managing incidents.
- The learning curve for developers to be able to efficiently troubleshoot and operate Kubernetes can be steep and is a concern for many organizations.
[8:17] Tools for Managing Kubernetes.
- The challenges that arise when trying to operate and manage Kubernetes.
- DevOps and SRE teams become the bottleneck due to their expertise in managing Kubernetes, leading to frustration for other teams.
- A report by the cloud native observability organization found that one out of five developers felt frustrated enough to want to quit their job due to friction between different teams.
- Ben's idea for Komodor was to take the knowledge and expertise of the DevOps and SRE teams and democratize it to the entire organization.
- The platform simplifies the operation, management, and troubleshooting aspects of Kubernetes for every engineer in the company, from junior developers to the head of engineering.
- One of the most frustrating issues for customers is identifying which teams should care about which issues in Kubernetes, which Komodor helps solve with automated checks and reports that indicate whether the problem is an infrastructure or application issue, among other things.
- Komodor provides suggestions for actions to take but leaves the decision-making and responsibility for taking the action to the users.
- The platform allows users to track how many times they take an action and how useful it is, allowing for optimization over time.
[8:17] Tools for Managing Kubernetes.
[12:03] The Challenge of Balancing Standardization and Flexibility.
- Kubernetes provides a lot of flexibility, but this can lead to fragmented infrastructure and inconsistent usage patterns.
- Komodor aims to strike a balance between standardization and flexibility, allowing for best practices and guidelines to be established while still allowing for customization and unique needs.
[16:14] Using Data to Improve Kubernetes Management.
- The platform tracks user actions and the effectiveness of those actions to make suggestions and fine-tune recommendations over time.
- The goal is to build a machine that knows what actions to take for almost all scenarios in Kubernetes, providing maximum benefit to customers.
[20:40] Why Kubernetes Doesn't Include All Management Functionality.
- Kubernetes is an open-source project with many different directions it can go in terms of adding functionality.
- Reliability, observability, and operational functionality are typically provided by vendors or cloud providers and not organically from the Kubernetes community.
- Different players in the ecosystem contribute different pieces to create a comprehensive experience for the end user.
[25:05] Keeping Up with Kubernetes Development and Adoption.
- How Komodor keeps up with Kubernetes development and adoption.
- The team is data-driven and closely tracks user feedback and needs, as well as new developments and changes in the ecosystem.
- The use and adoptio
- 44 min
- FEB 1, 2023
Software at Scale 54 - Community Trust with Vikas Agarwal

Software at Scale 54 - Community Trust with Vikas Agarwal

Vikas Agarwal is an engineering leader with over twenty years of experience leading engineering teams. We focused this episode on his experience as the Head of Community Trust at Amazon and dealing with the various challenges of fake reviews on Amazon products.
Apple Podcasts | Spotify | Google Podcasts
Highlights (GPT-3 generated)
[0:00:17] Vikas Agarwal's origin story.
[0:00:52] How Vikas learned to code.
[0:03:24] Vikas's first job out of college.
[0:04:30] Vikas' experience with the review business and community trust.
[0:06:10] Mission of the community trust team.
[0:07:14] How to start off with a problem.
[0:09:30] Different flavors of review abuse.
[0:10:15] The program for gift cards and fake reviews.
[0:12:10] Google search and FinTech.
[0:14:00] Fraud and ML models.
[0:15:51] Other things to consider when it comes to trust.
[0:17:42] Ryan Reynolds' funny review on his product.
[0:18:10] Reddit-like problems.
[0:21:03] Activism filters.
[0:23:03] Elon Musk's changing policy.
[0:23:59] False positives and appeals process.
[0:28:29] Stress levels and question mark emails from Jeff Bezos.
[0:30:32] Jeff Bezos' mathematical skills.
[0:31:45] Amazon's closed loop auditing process.
[0:32:24] Amazon's success and leadership principles.
[0:33:35] Operationalizing appeals at scale.
[0:35:45] Data science, metrics, and hackathons.
[0:37:14] Developer experience and iterating changes.
[0:37:52] Advice for tackling a problem of this scale.
[0:39:19] Striving for trust and external validation.
[0:40:01] Amazon's efforts to combat abuse.
[0:40:32] Conclusion.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
- 40 min

4.6 out of 5

13 Ratings

Excellent show!

Software at Scale has quickly become one of my favorite podcasts! I’m consistently impressed by the depth of insights and knowledge in each episode. No matter the topic, you’re guaranteed to learn something every time you listen. Highly recommend!