Dear Analyst

KeyCuts
Dear Analyst

This is a podcast made by a lifelong analyst. I cover topics including Excel, data analysis, and tools for sharing data. In addition to data analysis topics, I may also cover topics related to software engineering and building applications. I also do a roundup of my favorite podcasts and episodes.

Épisodes

  1. 02/12/2024

    Dear Analyst #133: Find or check if a cell contains text from a list of values or partial matching text in a list (3 methods)

    There are times when a problem nags at you like a pebble in your shoe. Even when you're not working on the problem itself, you're thinking about the problem which make it even worse. Rarely do I come across a Google Sheets/Excel task I cannot solve with formulas, but this one evaded my extensive knowledge (or lack thereof). The reason why this problem kept on nagging at me was because it felt like it should be easy to do. Like it was something I've solved before in the past. Or a data manipulation task that should be easy to solve. Yet, when I tried building out the solution, I couldn't quite figure it out. Here's a quick graphic showing the data question at hand: Breaking down the problem of checking if cell value contains text from another list Seems simple, right? I have a bunch of sentences in column A and a bunch of words in column C. I want to know if the sentences in column A contain any of the words in column C. I faced this task at work a few weeks ago and the kicker was not only did I need to figure out if the sentence contained the word, but also return another column from that "List to check" (imagine another list of values in column D in the screenshot above that I want to return). Immediately, I thought about different permutations of VLOOKUP and SEARCH but quickly realized on their own or combined, these formulas wouldn't to the trick. If we break down the problem, what we really need to do is this: * Loop through each color in column C * Check to see if cell A2 contains any of the words in step #1 * If it does, great! Return a TRUE or another column from the lookup list * Move to cell A3 and loop through all the colors again Once we think through the steps involved, it starts to become a trickier problem. Again, I thought this problem would have a simple solution. It kept on nagging at me so I figured I should share the solution I came up with in this episode. The solution I'm showing below is in Google Sheets and the formulas are slightly different (actually easier) in Excel. Here's the Google Sheet with the 3 methods I came up with after doing some research. This episode is also a YouTube tutorial if you prefer seeing the solution: https://youtu.be/bgsdDpZ3dTY Method 1: Check if cell value contains text from list (partial match) and return TRUE or FALSE This first method is actually the most important method to understand because it's building block for methods #2 and #3 where we want to return another column from the lookup list. This is the actual dummy data from the Google Sheet we'll be using to build out the formula for method #1: The ten sentences all contain the first name of a character from the TV show The Fresh Prince of Bel Air (one of the all-time greatest TV shows, of course). There are a list of first names of each character in column F. Some sentences do not contain a character from the list in column F. In column B, the goal is to write a formula that returns a TRUE or FALSE if the sentence in column A contains one of the first names in cells F2:F8. If you want to jump straight to the answer, the formula you write in cell B2 is this: =OR(ARRAYFORMULA(ISNUMBER(ARRAYFORMULA(SEARCH($F$2:$F$8,A2))))) If you're using Excel (in Office 365), you can basically take out the ARRAYFORMULA in the formula or using CTRL+SHIFT+ENTER to enter the formula (if not using Office 365):

    32 min
  2. 10/09/2024

    Dear Analyst #132: How the semantic layer translates your physical data into user-centric business data with Frances O’Rafferty

    When you think of your data warehouse, the "semantic layer" may not be the first thing that pops in your mind. Prior to reading Frances O'Rafferty's blog post on this topic, I didn't even know this was a concept that mattered in the data stack. To be honest, the concept is still a bit confusing to me since I'm not building data warehouses and data products all day. Frances grew up in northern England studying mathematics during the recession. The decision to jump into data was a function of what jobs happened to be available at the time. Frances worked through a variety of data warehousing, BI, and ETL roles before looking more into the data management space like data modeling and cataloguing. This conversation is a deep dive into the world of data warehousing, data catalogues, and of course, the data semantic layer. Enforcing data warehouse conformity for an insurance company Imagine an insurance company where the policies are in two different systems. Which database contains the "right" policy for a customer? This is the mess Frances had to deal with when she helped build out the insurance company's data warehouse. What I thought was interesting is Frances' team looked at the source data and then interviewed people in the business to understand how the data is generated and how the data is being used. The questions she was asking were pretty high-level: * What do you do on a day-to-day basis? * What works well and doesn't work well? * What would you like the data to do? Source: LinkedIn Data quality validation checks and global lookups were set up so that if a new piece of data entered the warehouse and it didn't match, then the administrator would get an alert. They would then have to figure out what to do with that rogue piece of data to fit the rules that have been set up. A methodology Frances brought up I've never heard before is the Kimball methodology for setting up a data warehouse or BI system. The main tenets of the methodology are basically how modern data warehouses are setup: add business value, structure data with dimensions, and develop the warehouse iteratively. This is an image of the lifecycle from their website: Source: Kimball Group Focusing on different layers of the warehouse "stack" Frances' team first focused on the data source layer and tried to figure out where all the data came from. After that, then came the consolidation layer. That consolidation layer is where the data gets split into facts and dimensions. I figured even for a data warehouse project, Excel must come into play at some point. Excel was used fro all the modeling to figure out what the dimensions and facts were. It wasn't a core part of the warehouse but it was still a one-time use tool in the development of the warehouse. The final layer is the target layer where we are getting more into the business intelligence realm. There are different ways the insurance company wanted to see the data. So Frances team had to create different views of the data to answer questions like: What premiums have we received? What transactions have come through? The actuarial team wanted to see what the balance was on an account so another view was created for them. Frances noticed that different regions would call the data different things but they were all s...

    35 min
  3. 05/08/2024

    Dear Analyst #131: Key insights and best practices from writing SQL for 15+ years with Ergest Xheblati

    If you could only learn one programming language for the rest of your career, what would be it be? You could Google the most popular programming languages and just pick the one of the top 3 and off you go (FYI they are Python, C++, and C). Or, you could pick measly #10 and build a thriving career out of it. Ergest Xheblati didn't just pick SQL and decide to bet his career on it. He started as a software engineer, held various roles in data, and even became a product manager. After trying these different roles, Ergest would still find his way back to data engineering and has dedicated 15+ years to writing SQL for a living. In this episode he talks about why he loves SQL, reducing the number of dashboards in your company, and best practices for writing SQL. Why Ergest loves writing SQL The reason why Ergest loves SQL is also the reason how most things get invented: laziness. As that Stack Exchange thread points out, a lazy developer tends to find shortcuts and automations to make repetitive and tedious tasks less onerous. You could also argue that Excel shortcuts are a result of analysts being lazy and not wanting to use their mouse to do mundane formatting tasks. Source: Invisible Bread As it pertains to programming, Ergest saw that a standard framework might require 20-30 lines of code to pull some data from a database. Ergest could handle do that same operation by writing a few lines of SQL with a simple SELECT statement. Solving business problems with technology Ergest was a data analyst, data engineer, and also what we call an analytics engineer. When Ergest was a data analyst, he didn't have tools like dbt which prevented him from succeeding as an analyst. As with many data roles, Ergest still straddles multiple roles today. He still considers himself a blend between a data analyst and data engineer with SQL being his main tool of choice. At a high level, Ergest talks about "solving business problems with technology." Source: dbt Labs I think it's important to emphasize this point which many other guests on Dear Analyst have pointed out as well. Learning tools like Excel and SQL are great, but if you cannot communicate your findings and solve real business problems with these tools, then what's the point? I think data professionals get caught up with how to utilize a data tool's features when time should really be spent on what can be done to solve your customer's problems. I recently had a conversation with a technical program manager who had an opportunity to sit on a few customer meetings with her sales team. She was amazed to learn about the actual problems her company's customers face every day. It gave her a new perspective on the backend infrastructure her team supports. Mining open source data with SQL Most of the projects Ergest works on are focused on business intelligence. For instance, he had to work on a project where the company wanted to build robust customer profiles. You typically want to see all these different aspects of a customer so you know how to best market to and retain the customer. From a data perspective, Ergest was writing SQL to a href="https://www.thekeycuts.com/dear-analyst-104-creating-a-single-source-of-truth-by-cleaning-marketing-analytics...

    48 min
  4. 24/06/2024

    Dear Analyst #130: What happens when we rely too much on Excel spreadsheets and shadow IT takes over?

    This is a replay of an episode from the Make Sense podcast with Lindsay Tabas. In the 1990s, large enterprises typically bought software in a tops-down approach. IT teams would get get Oracle software or Microsoft Office and get their entire organization to use the software. Since these tools are the default IT "blessed" tools, people start putting everything in these tools. This is why I think most people decide to push everything into Excel even though Excel is primarily meant for financial analysis. When it's already installed on your computer and everyone knows how to use it, Excel becomes the crutch that we turn to regardless of the use case. Source: xkcd Shadow IT and the swinging pendulum of SaaS tools vs. Excel spreadsheets In this episode, Lindsay Tabas and I talk about why large enterprises rely so much on Excel. This is part of a bigger movement of shadow IT and citizen development where individuals build business-critical workflows without needing an engineer or developer to step in. We talk about the shift from the 1990s of big monolithic software platforms to the explosion of workplace SaaS tools going into the 2000s and 2010s. The pendulum keeps on swinging back and forth as the SaaS tool sprawl gets too wide for IT departments to handle. Despite the ebb and flow of teams having freedom to pick up their own tools vs. IT shoving software down everyone's throats, we talk about why Excel will never die. We talk about how to get off the crutch of using Excel and one of the strategies I mention is to have a curious mind and be willing to learn new to tools. Every week we see new tools launched on Product Hunt that are supposed to replace some feature in Excel. These tools were borne out of the frustration that comes with trying to do something in Excel that Excel was not meant for. Nevertheless, you need to keep an open mind to see what these new tools are all about. You never know which one of these tools just might replace how you use Excel. Source: Not Boring by Packy McCormick Other Podcasts & Blog Posts No other podcasts or blog posts mentioned in this episode!

    45 min
  5. 18/06/2024

    Dear Analyst #129: How to scale self-serve analytics tools to thousands of users at Datadog with Jean-Mathieu Saponaro

    When you're organization is small, a centralized data team can take care of all the internal data tooling, reporting, and requests for all departments. As the team grows from 100 to thousands of people, a centralized data team simply cannot handle the number of requests and doesn't have the domain knowledge of all the departments. Jean-Mathieu Saponaro (JM) has experienced this transformation at Datadog. He first joined Datadog in 2015 as a research engineer. He was part of the inaugural data analytics team which now supports 6,000+ employees. In this episode, he discusses scaling a self-serve analytics tool, moving from ETL to ELT data pipelines, and structuring the data team in a hybrid data mesh model. Building a data catalog for data discovery According to JM, creating a data catalog is not that hard (when you're organization is small). I've seen data catalogs done in a shared Google Doc where everyone knows what all the tables and columns mean. When the data warehouse grows to hundreds of tables, that's when you'll need a proper data cataloging solution to store all the metadata about your data assets. This is when you move to something like Excel (just kidding)! In all seriousness, a shared Google Sheet isn't a terrible solution if your data warehouse isn't that large and the data structure isn't very complicated. Source: North Shore Data Services JM discussed a few strategies that helped them scale their internal data discovery tool: Strong naming conventions A pretty common pattern for data warehouses containing "business" data is using dim and fact tables. All tables in the data warehouse have to be prepended with dim or fact so that it's clear what data is stored in the table. There are also consistent naming conventions for the properties in the table. Finally, the "display" name for the table should be closely related to the actual table name itself. For instance, if the table is dim_customers, the display name for the table would just be customers. Snowflake schema Another common pattern is using a snowflake scheme to structure the relationship between tables. This structure makes it easy to do business intelligence (e.g. reports in Excel) later on. Source: Wikipedia Customizing the data discovery experience Datadog switched BI tools a few years ago so that the tool could be used by technical and non-technical users alike. They ended up going with Metabase because it didn't feel as "advanced" as Tableau. In terms of their data catalog, one of the key decisions going into picking a tool was being able to quickly answer the question: where do I start? Where do I go to learn about our customer data? Product data? This is where the discovery experience is important. JM said the entry point to their catalog is still just a list of 800+ tables but they are working on a custom home page. JM's team thought about the classic build vs. buy decision for their data cataloging tool. Given the size of their organization, they went with the building the tool internally. If the number of users was smaller, it would've been fine to go with an off-the-shelf SaaS tool. JM's team set a goal to build the tool in a few months and it took them 3.5 months exactly. Building the tool internally also meant they could design and re-use custom UI components. This resulted in a consistent user experience for every step of the data discovery process. Should you migrate data pipelines from ETL to ELT?

    32 min
  6. 28/05/2024

    Dear Analyst #128: What is citizen development and how to build solutions with spreadsheets?

    This is a replay of an episode from the Citizen Development Live podcast with Neil Miller. Citizen development is a relatively new term I learned about a year ago or so. To me, it's using no-code tools at scale within a large enterprise. It's a term that covers the population of people who are not developers, programmers, and software engineers by trade but know how to build apps and workflows to accomplish business-critical tasks. This is the definition of a citizen developer from PMI (Project Management Institute): Low-code or no-code development is the creation of applications software using graphic user interfaces or minimal basic code instead of large strings of complex coding. This term is often used to describe citizen development processes and technology. Low-code and no-code technology provides visual drag-and-drop or point-and-click user interfaces, making them easy for anyone to use. Source: PMI In this conversation on the Citizen Development Live podcast, Neil and I discuss various spreadsheets I've built in the past, when to move beyond spreadsheets, and why citizen development is a growing trend within the enterprise. I referred to a talk I gave at the 2019 No-Code Conference where I spoke about building tools with spreadsheets (and why the spreadsheet is the real first no-code tool): https://www.youtube.com/watch?v=M1GAArkYfug Other Podcasts & Blog Posts No other podcasts or blog posts mentioned in this episode!

  7. 13/05/2024

    Dear Analyst #127: Spreadsheets vs. Jira: Which one is better for your team?

    I wasn't sure if this topic should be it's own episode but it's been on my mind ever since I came back from Atlassian Team '24 (Atlassian's annual conference). At the conference, I had the opportunity to meet with a few people who are just as interested in spreadsheets as I am. We talked specifically how Jira can best work with spreadsheets (Excel or Google Sheets) and different workflows that result from the combination of these two tools. It was fascinating to hear how company culture and old ingrained way of doing things leads to the usage of spreadsheets when Jira and its add-ons can accomplish 80-90% of what the business need is. This episode highlights some of the things we discussed at the conference and implications for the future for teams using Jira and spreadsheets. Source: Atlassian What is Jira? Since most people following the newsletter are data analysts, I thought it would be relevant to first share what Jira is. Most would say Jira is issue-tracking software used by engineering and product teams to track software projects. The software aims to mirror agile and scrum methodologies for accomplishing tasks versus traditional waterfall techniques. The rituals behind agile and scrum are codified in Jira's features, so that's why the software is loved by thousands of engineering teams around the world. This is a good video from Atlassian on what a scrum project in Jira looks like. Near the end, you'll see a backlog of tasks. The backlog is one of the most foundational principles of the scrum methodology and will serve as the launching pad for this discussion on Jira and spreadsheets. https://www.youtube.com/watch?v=SOVGEsV5O9A Why do teams export Jira issues to Excel spreadsheets? One theme for why teams would want to export Jira issues into spreadsheets is reporting. We also talked about using other tools like Power BI for reporting purposes, but the intermediary step between Jira and Power BI is still a CSV export. There are built-in reporting and charting capabilities in Jira. There are also a plethora of add-ons in the Atlassian marketplace for custom charts. The issue with the add-ons is they can get quite costly since you are paying on a per-seat basis. So even if the Jira admin is the one creating the charts, you still have to pay for the other Jira users who are simply viewing the charts. This charting add-on below is one of the most popular add-ons for Jira with 10,000+ downloads. Looks a bit like Excel, no? Source: eazyBI Digging a little deeper, we also discussed how the Jira backlog is kind of like a datastore for what the product and eng teams are working on. You can think of this almost like another table of data in your data warehouse. What does this mean for a regular business user who doesn't work on the eng or product team and still needs the data? Traditionally, they would write a SQL query to get the data they need, do their analysis, and call it a day. With Jira, they would need the Jira admin to export the backlog to a CSV and then they can go off into Excel and do their custom reporting, PivotTables, and dashboarding to show how the product and eng team's work aligns with the rest of the work of the company. Story points,

    37 min
  8. 15/04/2024

    Dear Analyst #126: How to data storytelling and create amazing data visualizations with Amanda Makulec

    With an undergraduate degree in zoology and a master's in public health, you wouldn't expect Amanda Makulec to lead a successful career in data analytics and data visualization. As we've seen with multiple guests on the podcast, the path to a career in data analytics is windy and unexpected. It was the intersection of public health and data visualization that got Amanda interested in data visualization as a career. In one of her roles, Amanda was supporting USAID by analyzing open data sets and creating charts and graphs for publishing content. Her team consisted of graphic designers and developers. Designers would basically take her charts from Excel and add more color and add on text to the chart. Amanda found that large enterprises were facing the same challenges as the organizations she was supporting in public health (and enterprises have more money to throw at this problem). Thus began Amanda's career in data viz. How do you tell a data story? We've talked a lot about data storytelling a lot on this podcast. If there is one person who can crisply define what data storytelling is, it would be Amanda. This is Amanda's definition according to this blog post: Finding creative ways to weave together numbers, charts, and context in a meaningful narrative to help someone understand or communicate a complex topic.  We talked a bit about how data storytelling can mean different things to different people (this blog post in Nightingale talks more about this). You might work with a business partner or client who says they want a data story, but all they really want is just an interactive dashboard with a filter. Amanda cites Robert Kosara's definition of data storytelling in 2014 as one of her favorites: * ties facts together: there is a reason why this particular collection of facts is in this story, and the story gives you that reason * provides a narrative path through those facts: guides the viewer/reader through the world, rather than just throwing them in there * presents a particular interpretation of those facts: a story is always a particular path through a world, so it favors one way of seeing things over all others Amanda stresses the 3rd bullet point as the most important part of data storytelling. If the audience has to walk away with one analytics fact from the story, what is that fact you want to get across? Source: Effective Data Storytelling Getting feedback on your data stories and visualization One point Amanda brought up during the conversation which I think is worth highlighting is feedback. After you've published of launched an analysis, dashboard, or data story, you rarely get feedback on how effective the product was at telling a story. You might get some qualitative feedback like the dashboard answers specific questions or that the findings are "interesting." But was the visualization actually effective at telling a story? Amanda likes to ask people what they like and don't like about her data stories and visualizations. Often people will get frustrate because the key takeaway from the data story is simply counter to what they believe. This leads them to questioning the validity of the data source. But you as the storyteller are simply conveying the signal from t...

    45 min
  9. 25/03/2024

    Dear Analyst #125: How to identify Taylor Swift’s most underrated songs using data with Andrew Firriolo

    Sometimes pop culture and data analysis meet and the result is something interesting, thought-provoking, and of course controversial. How can one use data to prove definitely which Taylor Swift songs are the most underrated? Isn't this a question for your heart to answer? Andrew Firriolo sought to answer this question over the last few months and the results are interesting (if you're a Taylor Swift fan). As a Swiftie since 2006 (moniker for Taylor Swift fans), Andrew wanted to find a way to bridge his passions for Taylor Swift and data analysis. He's currently a senior data analyst at Buzzfeed, and published his findings on Buzzfeed to much reaction from the Swiftie community. In the words of Taylor Swift, Andrew's methodology and analysis just "hits different." From comp sci to data analytics Andrew studied computer science at New Jersey Institute of Technology but realized he liked the math parts of his degree over the engineering parts. Like many guests on this podcast, he made a transition to data analytics. Interestingly, it wasn't a job that propelled him into the world of data analytics. But rather, going to graduate school at Georgia Institute of Technology (Georgia Tech). GIT has some really affordable online technical programs including data analytics. After getting his master's degree, he worked at Rolling Stone as a data analyst. This is the beginning of Andrew's exploration into the Spotify API to see the data behind music. You can see some of the articles Andrew published while at Rolling Stone here. Source: Pocketmags After Rolling Stone, Andrew landed his current role at Buzzfeed building internal dashboards and doing internal analysis. In both of his roles, he talks about using a lot of SQL and R. A big part of his job is explaining the analyses he's doing to his colleagues. This is where the data storytelling aspect of a data analyst's job comes into play. I call this the "soft" side of analytics but some would argue that it's the most important part of a data analyst's job. In most data analyst roles you aren't just sitting at your desk writing SQL queries and building Excel models. You're a business partner with other people in the organization communication skills are more important than technical skills. Answering a Taylor Swift question with data Andrew became a Taylor Swift fan through his sister in 2006. They both listed to the world premier of Taylor's first album. Given his background in data, Andrew decided to answer a question about Taylor Swift that's been on his mind for a while: what are Taylor Swift's most underrated songs? To read Andrew's full article, go to this Buzzfeed post. Andrew's hypothesis was that there's a way to use data to prove which songs in Taylor's discography are most underrated. When I classify something as "underrated," it's usually a decision you make with your gut. But it's always interesting to see the data (and the methodology) for determining if something is truly "underrated." Multiple iterations in song streaming analysis ...

    37 min
  10. 05/02/2024

    Dear Analyst #124: Navigating people, politics and analytics solutions at large companies with Alex Kolokolov

    We sometimes forget that a large organization is composed of groups and divisions. Within these groups, there are teams and individuals looking to advance their careers. Sometimes at the expense of others. When your advancement depends on the success of your project, the benefits of that project to your company may be suspect and the tools you use to complete that project may not be the best tools for the job. Alex Kolokolov started his journey in data like many of us: in Excel. He moved on to Power BI, PowerPivot, PowerQuery, and building data visualizations for the last 15 years. In this episode, he talks through consulting with a company as the analytics expert only to find out that the the underlying forces at play were company politics. He also discusses strategies to make your line charts tell a better data story. The state of analytics at companies in traditional industries Alex consults with large companies in "traditional" industries like oil, gas, and mining companies. The state of analytics and knowledge of analytics is not equal in these companies, according to Alex. You'll come across data science and AI groups at these companies who are, indeed, working on the cutting edge. But then when you approach other departments like HR or operations, they are still quite far from this digital transformation that everyone's talking about. Alex worked with a mining company where there are cameras that can ID employees using facial recognition when they walk through the door. But when you sit down with the folks who are actually doing the work at the plant, they are still humming along on Excel 2010. Excel 2010! What a time... Source: dummies.com In terms of creating dashboards, teams from these companies would consult their IT or tech team to create a report. But then the IT team comes back and says it will take three months to create this report given their current backlog. Hence the reason these companies outsource the analytics training, metrics collection, and dashboarding to people like Alex. Internal battles for power and platforms Alex once worked with a government institution and they were building an internal SQL data warehouse before Power BI came on the scene. This specific project was driven by IT as a warehouse solution for the finance department. a few years later, the head of this SQL project became the CIO, but started getting some pushback from the heads of the finance department. It turns out the finance department heads already had their own platform in mind and claimed Microsoft's technology was outdated for their purposes (the finance team wanted to go with Tableau to build out pretty dashboards). Source: reddit.com The finance department proceeded to roll out their solution in Tableau and the CFO eventually became the Chief Digital Office and pushed the CIO who was spearheading the SQL project out. The project wasn't about Microsoft vs. Tableau at all. It was all about who was better at playing the game of internal politics and fighting for the resources to get your project across the line. When digital transformation is 10 years too late Large companies Alex has worked claimed they went through "digital transformation" but this was back in 2012. When Alex started working with these companies over the last few years, he found that individuals were still using SAP and Excel 2010. It's as if the digital transformation didn't go past 2012,

    38 min
3,8
sur 5
5 notes

À propos

This is a podcast made by a lifelong analyst. I cover topics including Excel, data analysis, and tools for sharing data. In addition to data analysis topics, I may also cover topics related to software engineering and building applications. I also do a roundup of my favorite podcasts and episodes.

Pour écouter des épisodes au contenu explicite, connectez‑vous.

Recevez les dernières actualités sur cette émission

Connectez‑vous ou inscrivez‑vous pour suivre des émissions, enregistrer des épisodes et recevoir les dernières actualités.

Choisissez un pays ou une région

Afrique, Moyen‑Orient et Inde

Asie‑Pacifique

Europe

Amérique latine et Caraïbes

États‑Unis et Canada