415 episodes

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Data Engineering Podcast Tobias Macey

    • Technology
    • 4.7 • 127 Ratings

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

    Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

    Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

    Summary

    Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
    Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
    Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines?


    This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture?

    Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components?
    One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform?
    Can you describe t

    • 56 min
    Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

    Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

    Summary

    A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
    Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today.
    Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg


    Interview


    Introduction
    How did you get involved in the area of data management?
    To start, can you share your definition of what constitutes a "Data Lakehouse"?


    What are the technical/architectural/UX challenges that have hindered the progression of lakehouses?
    What are the notable advancements in recent months/years that make them a more viable platform choice?

    There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg?


    What are the key points of comparison for that combination in relation to other possible selections?

    What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems?


    What progress is being made (within or across the ecosystem) to address those sharp edges?

    For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements?
    What are the differences in terms of pipeline design/access and usage patterns when using a Tr

    • 58 min
    Data Sharing Across Business And Platform Boundaries

    Data Sharing Across Business And Platform Boundaries

    Summary

    Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
    Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation?
    What is the current state of the ecosystem for data sharing protocols/practices/platforms?


    What are some of the main challenges/shortcomings that teams/organizations experience with these options?

    What are the technical capabilities that need to be present for an effective data sharing solution?


    How does that change as a function of the type of data? (e.g. tabular, image, etc.)

    What are the requirements around governance and auditability of data access that need to be addressed when sharing data?
    What are the typical boundaries along which data access requires special consideration for how the sharing is managed?
    Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform?
    What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing?
    When is Bobsled the wrong choice?
    What do you have planned for the future of data sharing?


    Contact Info


    LinkedIn


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machi

    • 59 min
    Tackling Real Time Streaming Data With SQL Using RisingWave

    Tackling Real Time Streaming Data With SQL Using RisingWave

    Summary

    Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
    Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what RisingWave is and the story behind it?
    There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?


    What are some of the platforms/architectures that teams are replacing with RisingWave?

    What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem?
    Can you describe how RisingWave is architected and implemented?


    How have the design and goals/scope changed since you first started working on it?
    What are the core design philosophies that you rely on to prioritize the ongoing development of the project?

    What are the most complex engineering challenges that you have had to address in the creation of RisingWave?
    Can you describe a typical workflow for teams that are building on top of RisingWave?


    What are the user/developer experience elements that you have prioritized most highly?

    What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine?
    What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave?
    When is RisingWave the wrong choice?
    What do you have planned for the future of RisingWave?


    Contact Info


    yingjunwu on GitHub
    Personal Website
    LinkedIn


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other show

    • 56 min
    Build A Data Lake For Your Security Logs With Scanner

    Build A Data Lake For Your Security Logs With Scanner

    Summary

    Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what Scanner is and the story behind it?


    What were the shortcomings of other tools that are available in the ecosystem?

    What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM)
    A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc.


    What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner?

    Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies?
    Can you describe the architecture of the Scanner platform?


    What were the motivating constraints that led you to your current implementation?
    How have the design and goals of the product changed since you first started working on it?

    Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies?
    What are the personas of the end-users for Scanner?


    How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct?

    For teams who are working with Scanner can you describe how it fits into their workflow?
    What are the most interesting, innovative, or unexpected ways that you have seen Scanner used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner?
    When is Scanner the wrong choice?
    What do you have planned for the future of Scanner?


    Contact Info


    LinkedIn


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    Visit the site to subscribe to the show, sign up for the mailing list, and read th

    • 1 hr 2 min
    Modern Customer Data Platform Principles

    Modern Customer Data Platform Principles

    Summary

    Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
    Your host is Tobias Macey and today I'm interviewing Tasso Argyros about the role of a customer data platform in the context of the modern data stack


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what the role of the CDP is in the context of a businesses data ecosystem?


    What are the core technical challenges associated with building and maintaining a CDP?
    What are the organizational/business factors that contribute to the complexity of these systems?

    The early days of CDPs came with the promise of "Customer 360". Can you unpack that concept and how it has changed over the past ~5 years?
    Recent years have seen the adoption of reverse ETL, cloud data warehouses, and sophisticated product analytics suites. How has that changed the architectural approach to CDPs?


    How have the architectural shifts changed the ways that organizations interact with their customer data?

    How have the responsibilities shifted across different roles?


    What are the governance policy and enforcement challenges that are added with the expansion of access and responsibility?

    What are the most interesting, innovative, or unexpected ways that you have seen CDPs built/used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDPs?
    When is a CDP the wrong choice?
    What do you have planned for the future of ActionIQ?


    Contact Info


    LinkedIn
    @Tasso on Twitter


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being

    • 1 hr 1 min

Customer Reviews

4.7 out of 5
127 Ratings

127 Ratings

Googleduser ,

Interesting topics guests

Tobias does a great job covering the future of data engineering - practical tips, the future of the industry with the founders of new tools, and no-nonsense advice on how to build data pipelines, viz, and process that will scale.

Fkn2013 ,

Azure

I really enjoy this podcast and learn a lot from it. I wonder why none of data tools in Azure is never mentioned.

Thanks

SteveT3ch ,

Best Data Engineering Podcast

Found this podcast by accident and now can’t do without it. Very knowledgeable host and guesses

Top Podcasts In Technology

Jason Calacanis
Lex Fridman
Cool Zone Media
Boston Consulting Group BCG
BBC Radio 4
The New York Times

You Might Also Like

DataCamp
Michael Kennedy (@mkennedy)
Jon Krohn
Software Engineering Daily
Kyle Polich
Real Python