A podcast from the team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
115. Demystifying the User Experience with Performance Monitoring
In this episode of Codeish, Greg Nokes, distinguished technical architect with Salesforce Heroku, talks with Innocent Bindura, a senior developer at Raygun about performance monitoring.
Raygun provides tools and utilities for developers to improve software quality through crash reporting and browser and application performance monitoring.
According to Innocent, the absence of crash reports does not mean that software is performing well. Software can work - but not be optimal. Thus, Innocent takes a holistic view:
“I look at the size of my audience, and if it's something sizable, that gets a lot of traffic, for example, a shopping cart that gets a lot of traffic on a Black Friday. I would want to be in a comfort zone when I know that during the peak periods my application is still performing, so I tend to look at the end-user, how their experience looks like during very high peak periods. And from there I start working my way back to the technology that is supporting that application.”
Raygun really shines in monitoring the time spent in different functions and helping to improve the performance of highly hit endpoints. This includes performance telemetry of browser pages, the current application running, and server-side performance application monitoring. Raygun has lightweight SDKs or lightweight providers that can be injected into code. These provide a catch-all to deal with unhandled exceptions. They also encourage best practices for developers.
Over time, Raygun can provide a complete picture of how the user session performed “from the point they visited your page, logged in, visited a couple of pages, and then left your application. The crash reports and the traces relating to that particular user are also tied up with that session on the Raygun side.” Innocent highlights a sampling strategy that reduces the noise of APM data.
Raygun also provides a birds-eye application view that provides aggregated stats on application performance: “For the run product, you will have each page aggregated over time, regardless of how many users you've had in a period of time. You want to look at the individual sessions. That information is aggregated and you're able to see, for example, your median, your P90, and P99.”
Innocent focuses on the P99 figure because “whoever is in there has had a terrible time, and that forms the basis of my investigations. I want to know why there are so many sessions in that P99, and that P99 is probably a six or seven-second load time. I want to move that to a sub-three-second.” Innocent provides a definition of P99 for new customers undergoing the journey of performance optimization.
Next, Innocent asserts that decisions should be based on numbers and empirical evidence. He has found that the use of actionable data has enabled him to redesign applications and focus on the mission-critical command needed in real time.
Innocent concludes: “I think the life of a developer is an interesting one. We fit in everywhere situations permit, and we definitely take different routes to develop our careers. But ultimately what we should all be concerned about is the quality of the products that I produce. This definitely reflects on my capability as a software developer. What sets me apart from the next developer is not the number of cool techniques I can do with code, it's delivering a product that actually works and what better way of knowing what works when you actually measure things. Everybo
114. Beyond Root Cause Analysis in Complex Systems
In this episode of Codeish, Marcus Blankenship, a Senior Engineering Manager at Salesforce, is joined by Robert Blumen, a Lead DevOps Engineer at Salesforce.
During their discussion, they take a deep dive into the theories that underpin human error and complex system failures and offer fresh perspectives on improving complex systems.
Root cause analysis is the method of analyzing a failure after it occurs in an attempt to identify the cause. This method looks at the fundamental reasons that a failure occurs, particularly digging into issues such as processes, systems, designs, and chains of events. Complex system failures usually begin when a single component of the system fails, requiring nearby "nodes" (or other components in the system network) to take up the workload or obligation of the failed component.
Complex system breakdowns are not limited to IT. They also exist in medicine, industrial accidents, shipping, and aeronautics. As Robert asserts: "In the case of IT, [systems breakdowns] mean people can't check their email, or can’t obtain services from a business. In other fields of medicine, maybe the patient dies, a ship capsizes, a plane crashes."
The 5 WHYs
The 5 WHYs root cause analysis is about truly getting to the bottom of a problem by asking “why” five levels deep. Using this method often uncovers an unexpected internal or process-related problem.
Accident investigation can represent both simple and complex systems. Robert explains, "Simple systems are like five dominoes that have a knock-on effort. By comparison, complex systems have a large number of heterogeneous pieces. And the interaction between the pieces is also quite complex. If you have N pieces, you could have N squared connections between them and an IT system."
He further explains, "You can lose a server, but if you're properly configured to have retries, your next level upstream should be able to find a different service. That's a pretty complex interaction that you've set up to avoid an outage."
In the case of a complex system, generally, there is not a single root cause for the failure. Instead, it's a combination of emergent properties that manifest themselves as the result of various system components working together, not as a property of any individual component.
An example of this is the worst airline disaster in history. Two 747 planes were flying to Gran Canaria airport. However, the airport was closed due to an exploded bomb, and the planes were rerouted to Tenerife. The runway in Tenerife was unaccustomed to handling 747s. Inadequate radars and fog compounded a combination of human errors such as misheard commands. Two planes tried to take off at the same time and collided with each other in the air.
Robert talks about Dr. Cook, who wrote about the dual role of operators.
"The dual role is the need to preserve the operation of the system and the health of the business. Everything an operator does is with those two objectives in mind." They must take calculated risks to preserve outputs, but this is rarely recognized or complemented.
Another component of complex systems is that they are in a perpetual state of partially broken. You don't necessarily discover this until an outage occurs. Only through the post-mortem process do you realize there was a failure. Humans are imperfect beings and are naturally prone to making errors. And when we are given responsibilities, there is always the chance for error.
What's a more useful way of thinking about the causes of failures in a complex system?
Robert gives the example of a tree structure or AC graph showing one node at the edge, representing the outage or incident.
If you step back one layer, you might not ask what is the cause, but rather what were contributing causes? In this manner, you might find multiple contributing factors that interconnect
113. Principles of Pragmatic Engineering
Karan Gupta, Senior Vice President of Engineering, Shift Technologies joins host Marcus Blankenship, Senior Manager Software Engineering, Heroku in this week's episode.
Karan shared his career trajectory, which includes founding aliceapp.ai, a fast, privacy-first recording and transcription service for investigative journalism, and acting as an advisor for various companies, including Alphy, a platform for women's career advancement.
A concept important to Karan is pragmatic engineering. Pragmatic engineering is about having "an oversized impact on the business by applying the right technology at the right time". It's about the technology, the process of creating that technology, and its impact on the underlying business. For example, building an electric car is cool, but producing a version in which people feel safe? That's engineering that changes the world forever.
According to Karan, these are the key things that matter in development:
Function (capabilities provided)
Form (how it looks and feels)
Fabrication (how it is built on the inside)
He recalls the value of the snake game on 404 pages. And the value of intentionality, saying "once you add a feature, it's probably going to be there forever. It's probably going to need maintenance and love and care forever. So do we really want to put it in?"
He talks about design and the balance between form versus function, such as designing something aesthetically pleasing versus easy to use. Then, there's fabrication: "How well can we make it? Can we deliver it quickly? And can others maintain it?" Sometimes using off-the-shelf software and well-proven frameworks are the most effective, and "Perfect is the enemy of good enough."
Karan stresses the importance of being a learning organization. "Be open to picking up what's out there to help make more informed choices, especially if the choice is to stick with the tried and tested." Good engineers are always open to learning about what new things are coming out and open to different opinions, frameworks, and ways of thinking.
Links from this episode
112. Managing Public Key Infrastructure within an Enterprise
This episode features a conversation between Robert Blumen, DevOps engineer at Salesforce, and Matthew Myers, principal public key interface (PKI) engineer at Salesforce. Matthew shares his experience running a certification authority (CA) within the Salesforce enterprise. He shares the rationale for the decision to take CA in-house, explaining that becoming a certificate authority means you can become the master of your universe by establishing internal trust. A private or in-house CA can act in ways not dissimilar to a PKU but can issue its own certificates, trusted only by internal users and systems.
Using a public certificate authority can be expensive at scale, particularly for enterprises with millions (or even billions) of certificates. However, an enterprise CA can be an important cost-saving measure. It adds a granular level of control in certificate issuing, such as naming conventions and the overall lifecycle. You can effectively have as many CAs as you can afford to maintain as well as the ability to separate them by use case and environment.
Further, having the ability to control access to data and to verify the identities of people, systems, and devices in-house removes the cybersecurity challenges such as the recent SolarWinds supply chain attack. Matthew notes that Information within a PKI is potentially insecure “as the information gets disclosed to the internet and printed on the actual certificates which leave them vulnerable to experienced hackers.” Matthews shares the importance of onboarding and people management and the need to ensure staff doesn’t buy SSL certificates externally.
Myerss offers some thoughts for businesses considering the DIY route discussing the advantages and limitations of open source resources such as OpenSSL and Let's Encrypt. Identity mapping and tracking are particularly important as you’re giving certificates to people, systems, and services that will eventually expire. Matthew shares the benefits of a central identity store, its core features, and how it works in tandem with PKI infrastructure. There’s also the need to know how many certificates you have in the wild at any given time.
As a manager, the revocation infrastructure for PKI implementation means that you're inserting yourself in the middle of every single deal, because if you’re doing it correctly everything needs to validate that the certificates are genuine. When you have a real possibility of slowing down others’ connections, you want to ensure that your supporting infrastructure is positioned in such a way that you are providing those responses as quickly as possible. Network latency becomes a very real thing.
Auditability and the ability to trust a certificate authority are paramount. The service that creates and maintains a PKI should provide records of its development and usage so that an auditor or third party can evaluate it.
Links from this episode
Wikipedia page on Public Key Infrastructure
Wikipedia page on Certificate Authorities
111. Gift Cards for Small Businesses
This episode is a conversation between Heroku developer advocate, Chris Castle and James Dong, developer and owner of Last Minute Gear. The business enables San Francisco residents to buy, rent, and borrow clothing and outdoor gear for activities such as camping, snow sports, and climbing. During the early days of the pandemic, the business was forced to close to comply with shelter-in-place regulations. There was an outpouring of support for small businesses, but not everyone has a Venmo account or wants to donate to a GoFundMe appeal.
While many used the pandemic to catch up on Netflix and banana bread baking, James spent a day coding a website and platform where businesses could sell gift cards. It not only helped his own anxiety and insomnia but helped brick-and-mortar businesses like gyms and restaurants (and his own shop) to still earn revenue.
It allowed customers to purchase gift cards to be remunerated once businesses reopened. While other platforms with this functionality already existed, James’ project included business-critical functions, such as processing payments and gift cards.
James talks about his experiences of anxiety and insomnia which acted as catalysts in making his website operational in just one day. Support from Stripe and Heroku meant there were no fees—all money generated went to the businesses.
The conversation offers interesting insights into the value of using a decision logger to document ideas and milestones as well as notes and commit messages to explain why particular decisions were made at certain points in time. It’s also a great example of what can happen when developers build projects that help others in need.
Links from this episode
Last minute gear — James’ outdoor sports store.
Gift Cards for Small Businesses
110. Scaling a Bernie Meme
This episode is a conversation led by Greg Nokes, a Product Manager with Salesforce, Dan Mehlman, a Director of Technical Architecture for Salesforce, Mike Rose, a Director of Technical Architecture for Salesforce, Jack Ziesing, a Technical Architect with Salesforce. They're interviewing Nick Sawhney, a college student who saw an opportunity to make his friends laugh and built something that grew beyond his wildest dreams. At the 2021 US Inauguration, a single shot of Bernie Sanders sitting in a chair captured the hearts of many on the Internet. People everywhere were photoshopping him in the unlikeliest of places. Nick utilized his Python skills and quickly built a Heroku app that would allow users to place Bernie anywhere in the world, by adding him to any image available on Google Street View.
To say the app was a success was an understatement. Inundated by tweets and distracted by press requests, Nick couldn't devote the time needed to keep the app stable and operational. He sent out a desperate tweet for help, only to be picked up by no less than Dan and Michael, who recruited Jack to help Nick with his operational issues. They paired together in a number of ways, optimizing Jack's Python code, securing its authentication logic, and autoscaling dynos in order to handle the waves of traffic. All of these rapid changes allowed Nick to step back and engage with fans on where they'd like to take Bernie next.
In addition to a newfound gratitude towards Heroku's team, Nick learned a few lessons from this experience. He was really humbled by the availability of the engineering community to donate their time and knowledge to help his issues. It's also inspired him to create videos to teach others how they can mitigate scaling issues in their architecture before it becomes a problem. He's also hoping to create some open source tools that to monitor things like server costs and availability issues for other small projects.