361 episodes

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Data Engineering Podcast Tobias Macey

    • Technology
    • 4.7 • 115 Ratings

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

    Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

    Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

    Summary

    Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
    Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what Omni Analytics is and the story behind it?


    What are the core goals that you are trying to achieve with building Omni?

    Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market?



    What are the technical and organizational anti-patterns that typically grow up around BI systems?

    What are the elements that contribute to BI being such a difficult product to use effectively in an organization?

    Can you describe how you have implemented the Omni platform?



    How have the design/scope/goals of the product changed since you first started working on it?

    What does the workflow for a team using Omni look like?

    What are some of the developments in the broader ecosystem that have made your work possible?

    What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses?

    What are the most interesting, innovative, or unexpected ways that you have seen Omni used?

    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni?

    When is Omni the wrong choice?

    What do you have planned for the future of Omni?



    Contact Info


    LinkedIn
    @cmerrick on Twitter


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
    To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers


    Links


    Omni Analytics
    Stitch
    RJ Metrics
    Looker


    Podcast Episode

    Singer
    dbt


    Podcast Episode

    Teradata
    Fivetran
    Apache Arrow


    Podcast Episode

    DuckDB


    Podcast Episode

    BigQuery
    Snowflake


    Podcast Episode



    The intro and outro music is from The H

    • 50 min
    Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

    Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

    Summary

    The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
    Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
    Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what Tonic is and the story behind it?
    What are the core problems that you are trying to solve?
    What are some of the ways that fake or obfuscated data is used in development and analytics workflows?
    challenges of reliably subsetting data


    impact of ORMs and bad habits developers get into with database modeling

    Can you describe how Tonic is implemented?


    What are the units of composition that you are building to allow for evolution and expansion of your product?
    How have the design and goals of the platform evolved since you started working on it?

    Can you describe some of the different workflows that customers build on top of your various tools
    What are the most interesting, innovative, or unexpected ways that you have seen Tonic used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic?
    When is Tonic the wrong choice?
    What do you have planned for the future of Tonic?


    Contact Info


    LinkedIn
    @AdamKamor on Twitter


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
    To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers


    Links


    Tonic


    Djinn

    Djan

    • 45 min
    Building Applications With Data As Code On The DataOS

    Building Applications With Data As Code On The DataOS

    Summary

    The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
    Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
    Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
    Your host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what your mission at The Modern Data Company is and the story behind it?
    Your flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform?


    Who is the target audience?

    On your site you refer to the idea of "data as software". What are the principles and ways of thinking that are encompassed by that concept?


    What are the platform capabilities that are required to make it possible?

    There are 11 "Key Features" listed on your site for the DataOS. What was your process for identifying the "must have" vs "nice to have" features for launching the platform?
    Can you describe the technical architecture that powers your DataOS product?


    What are the core principles that you are optimizing for in the design of your platform?
    How have the design and goals of the system changed or evolved since you started working on DataOS?

    Can you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS?
    What are the interfaces and escape hatches that are available for integrating with and

    • 48 min
    Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

    Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

    Summary

    Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
    Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
    Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
    Your host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what the SQLake product is and the story behind it?


    What is the core problem that you are trying to solve?

    What are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow?
    What are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)?
    Can you describe the technical implementation of the SQLake feature?
    What does the workflow look like for designing and deploying pipelines in SQLake?
    What are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales?


    SQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling?

    What are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales?
    What are some of the edge cases that you have had to provide escape hatches for?
    What are the most interesting, inn

    • 44 min
    Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

    Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

    Summary

    Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
    Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
    Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
    Your host is Tobias Macey and today I'm interviewing Rehgan Avon about her work at AlignAI to help organizations standardize their technical and procedural approaches to working with data


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what AlignAI is and the story behind it?
    What are the core problems that you are focused on addressing?


    What are the tactical ways that you are working to solve those problems?

    What are some of the common and avoidable ways that analytics/AI projects go wrong?


    What are some of the ways that organizational scale and complexity impacts their ability to execute on data and AI projects?

    What are the ways that incomplete/unevenly distributed knowledge manifests in project design and execution?
    Can you describe the design and implementation of the AlignAI platform?


    How have the goals and implementation of the product changed since

    • 59 min
    Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

    Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

    Summary

    With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
    Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
    RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
    Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
    Your host is Tobias Macey and today I'm interviewing Vishal Singh about his experien

    • 58 min

Customer Reviews

4.7 out of 5
115 Ratings

115 Ratings

Googleduser ,

Interesting topics guests

Tobias does a great job covering the future of data engineering - practical tips, the future of the industry with the founders of new tools, and no-nonsense advice on how to build data pipelines, viz, and process that will scale.

SteveT3ch ,

Best Data Engineering Podcast

Found this podcast by accident and now can’t do without it. Very knowledgeable host and guesses

LisaIsHereForIt ,

Incredible insights!💥

No matter the topic, you’re guaranteed to gain something from every episode - can’t recommend Data Engineering enough. 🙌

Top Podcasts In Technology

Lex Fridman
Jason Calacanis
The Cut & The Verge
The New York Times
NPR
The Wall Street Journal

You Might Also Like

Software Engineering Daily
Michael Kennedy (@mkennedy)
Kyle Polich
Jon Krohn and Guests on Machine Learning, A.I., and Data-Career Success
DataCamp
Real Python