10 episodes

data_beats is committed to demystifying the data space for everybody! We believe data is not just for analysts, engineers, and scientists but for everybody to understand.

Learn about broad data topics very quickly as expert practitioners and founders of data companies candidly answer some hard questions.

And while you're at it, enjoy some drum beats!

databeats.community

data_beats podcast Arpit Choudhury

    • Technology

data_beats is committed to demystifying the data space for everybody! We believe data is not just for analysts, engineers, and scientists but for everybody to understand.

Learn about broad data topics very quickly as expert practitioners and founders of data companies candidly answer some hard questions.

And while you're at it, enjoy some drum beats!

databeats.community

    [db] Demystifying Data Observability - Part 1

    [db] Demystifying Data Observability - Part 1

    Data Observability is an established category but the tools that fall under this category don't necessarily have the same capabilities or even solve the same problems.

    There are infrastructure monitoring tools, pipeline monitoring tools, and tools to monitor the actual data that rests in a database/warehouse/lake. And then there are data testing tools and tools to understand data lineage.

    In this episode, Kevin Hu, the CEO of Metaplane provides clear, concise answers to these questions, and he does it with a big smile, making it sound all too simple.

    But that's not it.

    Kevin is a brilliant mind so we also got him to share some advice for companies looking to invest in data observability efforts.

    Enjoy the beats! 🥁

    You can also subscribe to the show on Spotify or Apple Podcasts.

    Prefer watching the interview? Here you go:

    🤔 Have questions?

    Prefer reading? Here you go:

    Welcome to data_beats. Today, Kevin Hu, the CEO of Metaplane, is here to answer some questions about Data Observability. So here we go. Hey Kevin, thanks for joining us.

    Totally my pleasure, Arpit. I see your face and your writing all over the datasphere nowadays, and it's such a pleasure to talk with you and learn that you're a musician. I don't have that talent, but it's nice to meet someone who does.

    Haha thank you, thank you so much, Kevin. Let's jump right in, first question for you, Kevin.
    Q. Please tell us what exactly is data observability.

    A. Data observability is the degree of visibility you have into your data systems. And this visibility helps address many use cases from detecting data issues to understanding the impact of those issues or diagnosing the root cause of those issues.

    Q. There's a fair bit of confusion in the data observability space as there are many tools with varying capabilities. So let's try to address that. Can you first describe what is data infrastructure monitoring?

    A. Infrastructure monitoring is a space that emerged decades ago, but really came to the fore around 10 years ago with the emergence of the cloud, like Amazon Web Services. So tools like Datadog and Splunk and New Relic help you understand whether your infrastructure is healthy. For example, how much free storage you have in your database. What are the median response times of your API or the RAM usage of an EC2 instance. And this is really critical for software teams, especially as they deploy more and more of their resources into the cloud.

    Q. And now can you explain what is data pipeline monitoring?

    A. Pipelines, to put it simply, take A, turn it into B, and put it into C. And this is used across the data system, whether it's using airflow to pull data from a first-party system into your data warehouse or to transform data within a data warehouse or even prepare features for machine learning. And data pipeline monitoring, on the first level, is trying to understand, are my jobs running? This is a surprisingly hard question to answer sometimes. But the level two question is, are my jobs running correctly? As I take A, turn into B, and put it into C, is A what I expect, is B what I expect, and was it loaded into C correctly?

    Q. You made it sound very simple. What about monitoring the actual data in the warehouse? How would you describe that?

    A. So cloud data warehouses, like Snowflake, Redshift, and BigQuery are increasingly the center of gravity of data within companies. To put it more simply, it's where you put everything. And a lot of applications, whether it's a BI tool like Looker or reverse-ETL tool, a machine learning model, are kind of mounted on top of the warehouse. So data warehouse monitoring tries to understand whether the data within the warehouse that is used for all these systems is correct.

    Q. So now some observability tools also offer data cataloging and data lineage capabilities. Can you explain those briefly?

    A. Data cataloging tries to address the problem, what does this data mean? And there is a gap between how

    • 11 min
    [db] Building and Using Data Infrastructure

    [db] Building and Using Data Infrastructure

    Building data infrastructure is one thing — and a fun thing for those building it — but getting teams across an org to use and derive value from data is an arduous journey.

    David Jayatillake (who was at Avora and is now at Metaplane) has a ton of experience building data infra as well as figuring out how to get folks to use data in their day-to-day. In this episode, David answers some fundamental questions like:

    What are the core components of a well-executed data infrastructure?

    What are the prerequisites in terms of the tech stack to set up a basic data infrastructure?

    How do data-adjacent teams like Product and Growth make use of and derive value from good data infra?

    And he also offers some advice for companies getting started on their data journey.

    Enjoy the beats! 🥁

    You can also subscribe to the show on Spotify or Apple Podcasts.

    Prefer watching the interview? Here you go:

    🤔 Have questions?

    Prefer reading? Here you go:

    Q. Please tell us what it means to build data infrastructure in the context of the modern data landscape.

    A. I think there's a typical stack that's understood as infrastructure in terms of being able to ELT a data warehouse, a BI tool on top, and then now increasingly, there's additional non-core pieces like reverse ETL, observability, CDPs, streaming tools as well that are being added to this infrastructure.

    Q. So what are the core components of a well-executed data infrastructure?

    A. So I think, the difference between well-executed, I think it's about things that ensure quality, things that ensure reliability, especially around the development process. So being able to use version control, CI/CD in your development process, that's gonna really enable what I think most people will consider well-executed looks like. And that's where frameworks like dbt come in, which have been enabling development on top of data pipelines flowing from the data ware, into the data warehouse and on the data warehouse that's enabled. We're looking for more tools like that to spread further out, and dbt to take a bigger footprint as well to push that quality outwards.

    Q. So if we talk about building a basic data infrastructure, like a minimum viable data stack, if you may, what would be the two or three tools that would comprise a minimum viable data stack?

    A. Sure, so it depends on your context. So for some companies, if you have to work with a lot of third-party tools, you definitely need an ELT tool like Fivetran, Airbyte, Gravity Data. You need one of those kinds of tools to help you get data from those third party systems into your data warehouse. Obviously then, you need a data warehouse.

    I personally think, for a minimal viable data stack you want a data warehouse that's quite easy to use, and that scales without much needed thought and planning. You don't want to need to have a DBA. So I think Snowflake and BigQuery are the two easiest to use. BigQuery is possibly even easier for a smaller startup. It's basically a "use it and forget about it," nothing to do.

    Q. Yeah, that makes sense. So, David, can you tell us why there's been an explosion in terms of data infrastructure tooling over the last couple of years?

    A. I think it's because, it's a bit of a hangover from the big data era. So in the big data era, in order to do any amount of data engineering, you'd have to hire a huge amount of very expensive people. So I've been at a company where we were on SQL server when I joined the company as an analytics stack. And they planned to do a big migration to Hadoop on Hortonworks and it took years. And they hired a data engineering team of 50 people paying them a huge amount of money, and it actually failed. They didn't even succeed in this data project.
    So what we've realized, and venture capitalists have realized is that the data engineering space is ripe for automation and for SaaS tooling, and so that's more or less achieved now. If you think of Fivetran, especially from a batch poin

    • 14 min
    [db] Warehouse-native Apps - Part 3

    [db] Warehouse-native Apps - Part 3

    Once again, what exactly is a warehouse-native app?

    This time, hear it from George Xing who is building Supergrain, a warehouse-native engagement tool. As someone who helped build the analytics function at Lyft, George has a deep understanding of the priorities and constraints of data teams.

    In this episode of Data Beats, George explains how the warehouse-native architecture is superior to its predecessor and what benefits organizations can reap by adopting tools that are built on top of the data warehouse.

    George also has some advice for organizations looking to get started with a warehouse-native app.

    Enjoy the beats! 🥁

    You can also subscribe to the show on Spotify or Apple Podcasts.

    Prefer watching the interview? Here you go:

    🤔 Have questions?

    Prefer reading? Here you go:

    Q. George, why don't you tell us what is a warehouse-native app?

    A. Yeah, so. Warehouse-native apps are business applications that run on top of the customer's cloud data warehouse. So something like a Snowflake or Redshift or BigQuery and relies on that piece of infrastructure as the source of truth for customer data. This is different than managed applications or the traditional way of building SaaS where the vendor will store a copy of the customer data on their own system and manage it that way. And some of the other advantages of this is you don't have to be tied to a fixed schema. So the business relationships that the customer has already defined in their cloud data warehouse, you can leverage. Warehouse-native apps are cloud... Warehouse-native apps are schema-agnostic and they also remove the need for pipelines. So they move data in and out of the cloud data warehouse without additional ETL tools.

    Q. So, what is leading to such a shift in the way SaaS tools are being built? You mentioned some of the benefits which sort of answers the question, but what else?

    A. One of the things that we see is that more and more so SaaS products are relying on data as a core differentiator. So in our space, in marketing, marketing ROI is really driven by the use of customer data. And the other big trend is that customer data is getting centralized in cloud data warehouses because that's where you can see all your touch points, that's where your source of truth is. And so in order to connect those to the obvious kind of architecture is to move the software into the data that is housed in the source of truth.

    Q. Got it. And you mentioned that, you know, with warehouse-native apps, you don't have to basically replicate your data. So obviously that brings cost savings. But besides that, what are the other core benefits of warehouse-native apps over a traditional-managed app?

    A. Yeah, there's so many, but I think some of the ones that stand out, one is just speed to set up. One of the biggest challenges of working with a data intensive application today, is that you have to first send all your customer data to the vendor in order to just get started or have any value, whereas in a warehouse-native app, your data's already there and you just connect the application directly to your data via a standard database connection. And so it just simplifies and speeds up the cost of, or the time to getting set up from sometimes months to days. The other piece of this is just having more access for personalization, as I mentioned. If your warehouse is the source of truth for all your data, then you also have more data, richer data there. You can leverage a lot more of those data points for personalization. In the case of Supergrain, that would be targeting for emails. That would be more personalized messages. And obviously, that drives more business ROI. And then I think the third piece of this gets at the pipelines. Normally you have to move data from your vendor back into your warehouse for analysis and reporting and BI tools. Warehouse-native apps simplify that because they write directly back to the warehouse, meaning the reporting is both more comp

    • 11 min
    [db] Warehouse-native Apps - Part 2

    [db] Warehouse-native Apps - Part 2

    What is a warehouse-native app and what are its benefits over a managed app?

    Luke Ambrosetti (who was at MessageGears and is now at Snowflake) not only has first-hand experience, but is also deeply knowledgeable and passionate about the warehouse-native architecture. I learned a bunch from his answers and am pretty sure you will too.

    And if you're about to get started with a data warehouse, Luke has some useful tips for you.
    Or learn more about the middleware and the separation of the system of record and the system of engagement.

    Enjoy the beats! 🥁

    You can also subscribe to the show on Spotify or Apple Podcasts.

    Prefer watching the interview? Here you go:

    🤔 Have questions?

    Prefer reading? Here you go:

    Q. Please tell us what is the Warehouse-native app?

    A. Yeah, so a Warehouse-native app is effectively a SaaS, you know, application that is going through and doing a data-intensive process for you, and connects directly to the data warehouse.

    Q. All right. So are connected apps or data apps the same as warehouse-native apps?

    A. That's a good question. Kind of, it depends on who you ask, right? The definition of a connected app or a data app could be very different, you know, to people. And in my view, you know, I'm seeing connected app as kind of a good definition of a warehouse-native you know, SaaS app. Whereas data app could be, you know it could be a SAS, it could be that thing, but maybe data app is actually a larger term that incorporates, you know, maybe an internal app that you create at your, you know, at the company you work for, you know, could be a data app. Right? It doesn't have to be some sort of external tool that you, you buy and or use

    Q. Oh yeah, that makes sense. So please tell us what is leading to such a paradigm shift in the way B2B SaaS tools are being built.

    A. Yeah, and it's very exciting, right? Lots of new companies are taking this new approach. Like this, you know, it's called warehouse-native, warehouse-centric, warehouse first right approach. Lots of terms for it. But it honestly, it comes from this idea of the separation of compute and storage. Right? And then specifically kind of in the industry that I work in, which is more marketing, is this idea of that I've seen, it explained very well, is this idea of the, you know the separation of the system of engagement. Right? Of how you're reaching out to, in this case the marketing's case, your customers, right? And the system of record of, and where, you know what is the state of that customer? You know what they're doing, you know, where they are. You know, the Customer 360 as it's called. So with those two things separated, or you know, traditionally, I should say you've had to get your data to your system, whatever your system of engagement is. Right? And sometimes, you know, companies like Salesforce have traditionally tried to be both the system of record and the system of engagement. So, you know, now it's with a separation of compute and storage. Right? You know, you can now have those two things, you know, the system of engagement and the system of record separated as well.

    🤔 Have questions?

    Q. So besides cost savings, what are some key benefits of using a warehouse-native engagement tool over a traditional one?

    A. Yeah, so yeah, and again the cost savings aspect right here is the idea that you don't have to sync that data right to whatever, you know, your other platform is of of where you want to, you know, do either that that in whether it's, if it's marketing, it's engagement, maybe it's analytics, you don't have to sync that data there either whatever it might be. Right? So that's where the cost savings comes from, but there are so many other, you know, benefits as well right? So it's, you know, you have few data silos, you do it this way. Right, you know, instead of shipping your your data out to 10 different SaaS companies. If all 10 could connect your data warehouse and use that system of record, you

    • 12 min
    [db] Warehouse-native (Connected) Apps - Part 1

    [db] Warehouse-native (Connected) Apps - Part 1

    What is a connected app or a warehouse-native app? How is it different from a managed app? And what is leading to such a paradigm shift in the way B2B SaaS tools are built?

    Omer Singer has an insider view of how the connected app paradigm is taking shape and is here to answer those questions.

    Oh and we also got him to share some tips on how to evaluate vendors of connected apps.

    Enjoy the beats! 🥁

    You can also subscribe to the show on Spotify or Apple Podcasts.

    Prefer watching the interview? Here you go:

    🤔 Have questions?

    Prefer reading? Here you go:

    Q. Please tell us what exactly is a connected app.

    A. Yeah, a connected app is a SaaS solution that lets the customer bring their data platform of choice, and so the vendor is bringing the work to the data.

    Q. Are connected apps the same as warehouse-native apps or why are connected apps also being referred to as warehouse-native apps?

    A. Yeah, yeah, it's talking about the same thing, and I think it's this idea that there's been so much progress in data warehouse technology that what's possible from the SaaS integration perspective has changed, and from a Snowflake perspective, we're calling it connected applications as to differentiate from managed applications. Traditional SaaS solutions always cared about data.
    If you look at the traditional, maybe the first really famous SaaS solution being Salesforce, Salesforce has a very significant database under the hood and all that data about customer opportunities and all that's stored within Salesforce. That model, we call the managed application.
    Now to differentiate from that, we have the connected application and that's where the SaaS solution says, "Look, we're gonna focus on the app, and for customers that want to, they can connect us to their existing data warehouse, data platform and we will use that," and it's an exciting space. We're seeing more and more companies embracing that, and for our customers too, this is becoming a direction of choice. Customers are really preferring this model and it's helping them to be more successful.

    Q. You've already answered my next question, what is the difference between the connected app paradigm and its traditional counterpart, managed apps, so that's great. Why don't you tell us what is leading to such a paradigm shift in the way B2B tools are built?

    A. Yeah, I think this shift couldn't have happened a few years ago before all the progress in cloud data platform technology. It used to be that applications needed a backend that would handle the data and different structures, et cetera, and they had, of course, demands on how reliable it would be, how powerful it would be, so they needed to own it and to be responsible for that back end, end to end. They couldn't count on whatever database technology the customer was using 'cause maybe customers don't have the same power in their data platform and the same reliability in their data platform.
    I focus on the cybersecurity space and security teams often would use technologies like Elasticsearch to collect a bunch of log data to it, and a lot of vendors use Elasticsearch under the hood of their application too, but the vendor couldn't count on the customer's Elasticsearch cluster to be 24/7 available, to be scalable, and fast enough to handle large amounts of data, so then the vendor had to own the data on their side and maybe they exposed it to the customer through an API so the customer could maybe get access to some of their data through that API. With advances in the cloud data platform, I think what Snowflake has really pioneered is this cloud data platform that is very robust and consistently powerful and reliable, basically putting the vendors and their customers on the same footing.
    Now for the first time, the customer can bring their SaaS solution to the data platform, and in that way, they avoid a silo, avoid having the vendor own the data and then trying to get, the customer getting the data through th

    • 15 min
    [db] Data Automation

    [db] Data Automation

    What is data automation?

    How is it different from iPaaS and workflow automation?

    What is data unification and how is that different from identity resolution?

    Nick Bonfiglio from Syncari is here to tell us and answer some related questions.

    Enjoy the beats! 🥁

    You can also subscribe to the show on Spotify or Apple Podcasts.

    Prefer watching the interview? Here you go:

    🤔 Have questions?

    Prefer reading? Here you go:

    Q. Please tell us what exactly is data automation?

    A. Yeah, that's a great question because we sometimes get asked that question. But when we started the company, we were trying to get a category going. Data automation is essentially what we're doing. But, originally, data automation was a data science concept to describe processing, normalizing, handling of data with automated techniques. So until Syncari, multi-directional data automation was only available to, well, first of all, it's available to anyone. But if you were trying to do it, it would be a lot of coders and data science that were trying to do that. So Syncari is actually bringing these capabilities, plus some MDM and CDP capabilities that we borrowed, to essentially business users with a no-code platform. In our case, it's targeted at go-to-market engines that go all the way from leads to billings inside of your go-to-market engine.

    Q. Cool. Yeah, that's interesting. So is data automation the future of iPaaS?

    A. So the way I would answer that is, point-blank, data automation is not integration. And, rather, it's a way to generate a 360-degree unified view from all your systems, and then orchestrate the go-to-market processes across your business. Again, from leads to billings. The difference with Syncari is this distributed 360 view of your data is not centralized and it allows you to curate and share that with all your teams. And, you know, and that's a very unique position versus just simple point-to-point integrations that are moving data back and forth, whether they happen to be ETL point to points, or automation point to points, or whatever they happen to be data point to points. It's very different. The other big difference that Syncari has over iPaaS is these are stateful, multi-directional synchronizations. Now, everybody uses the word sync these days, but, truly, sync is to have data be identical in more than one place at exactly the same time. So, you know, point-to-point connectivity solutions, like iPaas, just can't touch the data at the level that we do.

    Q. Okay, so then, how is data automation different from workflow automation? You might have answered part of this already. But can you just, you know, explain that further?

    A. If you think about how workflow automations today work in most systems, they are triggered and have access to transitory data across two endpoints. And that truly minimizes your flexibility, whether it's an ETL, whether it's a integration, or whatever it happens to be, it's transitory data and it minimizes your flexibility in being able to create automations. And so what data automation, and especially Syncari, does is we take this data model view, a unified data model view, of all the connected systems. And you can use all of your data to orchestrate cross-system and cross-object automations. The best way to think of it is if I wanna look up what's going on with an account based on a contact or lead that just came in, it's incredibly difficult or impossible to do with today's, you know, workflow automation platforms. So the largest difference is it can also transform data, normalize it, enrich it, calculate it, dedupe it, all centrally, and then distribute that end result to all the connected systems at the same time. And so, this ability to keep all your systems in sync and near real-time, you know, across all touch points is really what Syncari does.

    Q. So would you say data automation is a team sport, since it involves so many different teams?

    A. That's a great question. We run into t

    • 12 min

Top Podcasts In Technology

Lex Fridman
Jason Calacanis
NPR
Jack Rhysider
Gregg Phillips
Jason Calacanis