11 min

Demystifying Data Observability w/ Kevin from Metaplane databeats

- Technology

Data Observability is an established category but the tools that fall under this category don't necessarily have the same capabilities or even solve the same problems.There are infrastructure monitoring tools, pipeline monitoring tools, and tools to monitor the actual data that rests in a database/warehouse/lake. And then there are data testing tools and tools to understand data lineage.In this episode, Kevin Hu makes it sound all too simple, and he does it with a big smile.But that's not it — Kevin is a brilliant mind so we also got him to share some advice for companies looking to invest in data observability efforts.Let’s dive in:Q. Please tell us what exactly is data observability.Data observability is the degree of visibility you have into your data systems. And this visibility helps address many use cases from detecting data issues to understanding the impact of those issues or diagnosing the root cause of those issues.There's a fair bit of confusion in the data observability space as there are many tools with varying capabilities. So let's try to address that. Q. Can you first describe what is data infrastructure monitoring?A. Infrastructure monitoring is a space that emerged decades ago, but really came to the fore around 10 years ago with the emergence of the cloud, like Amazon Web Services. So tools like Datadog and Splunk and New Relic help you understand whether your infrastructure is healthy. For example, how much free storage you have in your database. What are the median response times of your API or the RAM usage of an EC2 instance. And this is really critical for software teams, especially as they deploy more and more of their resources into the cloud.Q. And can you explain what is data pipeline monitoring?A. Pipelines, to put it simply, take A, turn it into B, and put it into C. And this is used across the data system, whether it's using airflow to pull data from a first-party system into your data warehouse or to transform data within a data warehouse or even prepare features for machine learning. And data pipeline monitoring, on the first level, is trying to understand, are my jobs running? This is a surprisingly hard question to answer sometimes. But the level two question is, are my jobs running correctly? As I take A, turn into B, and put it into C, is A what I expect, is B what I expect, and was it loaded into C correctly?You make it sound so simple! Q. What about monitoring the actual data in the warehouse? How would you describe that?So cloud data warehouses, like Snowflake, Redshift, and BigQuery are increasingly the center of gravity of data within companies. To put it more simply, it's where you put everything. And a lot of applications, whether it's a BI tool like Looker or reverse-ETL tool, a machine learning model, are kind of mounted on top of the warehouse. So data warehouse monitoring tries to understand whether the data within the warehouse that is used for all these systems is correct.Q. Some observability tools also offer data cataloging and data lineage capabilities. Can you explain those briefly?Data cataloging tries to address the problem, what does this data mean? And there is a gap between how data is represented in a technical system to how it represents business objects. So a data catalog is an easy way to attach semantic meaning to the objects within your data system. Here's how a metric is derived. Here is how a table is derived. So when the VP of Data asks you about this revenue metric, you point them towards a data catalog as opposed to having to type out the answer. Data lineage solves the problem of understanding how data within your system relate to each other. If you trace data all the way back to the source, either a machine created it or a human put it in, but rarely do the end users of data use that raw data. This is in some ways, the job of a data team is to turn that raw data into an analytics-ready form that can be used for many different purposes. Data lineage