369 episodes

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Data Engineering Podcast Tobias Macey

    • Technology
    • 4.0 • 1 Rating

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

    Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

    Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

    Summary

    The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more
    Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
    Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today
    Your host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what Grainite is and the story behind it?
    What are the personas that you are focused on addressing with Grainite?
    What are some of the most complex aspects of building streaming data applications in the absence of something like Grainite?



    How does Grainite work to reduce that complexity?

    What are some of the commonalities that you see in the teams/organizations that find their way to Grainite?

    What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture?

    Can you describe how Grainite is architected?



    How have the design and goals of the platform changed/evolved since you first started working on it?

    • 1 hr 13 min
    Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed

    Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed

    Summary

    As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today
    RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder
    Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
    Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization


    Interview


    Introduction
    How did you get involved in the area of data management?
    Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved?
    How has the scope and complexity of implementing security controls on data systems changed in recent years?


    In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within?

    What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls?


    How much of the problem is technical vs. procedural/organizational?

    As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.)


    What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.)

    What are some of the ways that data security and organizational productivity are at odds with each other?


    What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls?

    What are some of the methods that you have found to be most

    • 51 min
    Use Your Data Warehouse To Power Your Product Analytics With NetSpring

    Use Your Data Warehouse To Power Your Product Analytics With NetSpring

    Summary

    With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today!
    RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
    Your host is Tobias Macey and today I'm interviewing Priyendra Deshwal about how NetSpring is using the data warehouse to deliver a more flexible and detailed view of your product analytics


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what NetSpring is and the story behind it?


    What are the activities that constitute "product analytics" and what are the roles/teams involved in those activities?

    When teams first come to you, what are the common challenges that they are facing and what are the solutions that they have attempted to employ?
    Can you describe some of the challenges involved in bringing product analytics into enterprise or highly regulated environments/industries?


    How does a warehouse-native approach simplify that effort?

    There are many different players (both commercial and open source) in the product analytics space. Can you share your view on the role that NetSpring plays in that ecosystem?
    How is the NetSpring platform implemented to be able to best take advantage of modern warehouse technologies and the associated data stacks?


    What are the pre-requisites for an organization's infrastructure/data maturity for being able to benefit from NetSpring?
    How have the goals and implementation of the NetSpring platform evolved from when you first started working on it?

    Can you describe the steps involved in integrating NetSpring with an organization's existing warehouse?


    What are the signals that NetSpring uses to understand the customer journeys of different organizations?
    How do you manage the variance of the data models in the warehouse while providing a consistent experience for your users?

    Given that you are a product organization, how are you using NetSpring to power NetSpring?
    What are the most interesting, innovative, or unexpected ways that you have seen NetSpring used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ne

    • 49 min
    Exploring The Nuances Of Building An Intentional Data Culture

    Exploring The Nuances Of Building An Intentional Data Culture

    Summary

    The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
    Your host is Tobias Macey and today I'm interviewing Pete Soderling and Maggie Hays about the growing importance of establishing and investing in an organization's data culture and their experience forming an entire conference track around this topic


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what your working definition of "Data Culture" is?


    In what ways is a data culture distinct from an organization's corporate culture? How are they interdependent?
    What are the elements that are most impactful in forming the data culture of an organization?

    What are some of the motivations that teams/companies might have in fighting against the creation and support of an explicit data culture?


    Are there any strategies that you have found helpful in counteracting those tendencies?

    In terms of the conference, what are the factors that you consider when deciding how to group the different presentations into tracks or themes?


    What are the experiences that you have had personally and in community interactions that led you to elevate data culture to be it's own track?

    What are the broad challenges that practitioners are facing as they develop their own understanding of what constitutes a healthy and productive data culture?
    What are some of the risks that you considered when forming this track and evaluating proposals?
    What are your criteria for determining whether this track is successful?
    What are the most interesting, innovative, or unexpected aspects of data culture that you have encountered through developing this track?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on selecting presentations for this year's event?
    What do you have planned for the future of this topic at Data Council events?


    Contact Info


    Pete


    @petesoder on Twitter
    LinkedIn

    Maggie


    LinkedIn



    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machin

    • 45 min
    Building A Data Mesh Platform At PayPal

    Building A Data Mesh Platform At PayPal

    Summary

    There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
    Your host is Tobias Macey and today I'm interviewing Jean-Georges Perrin about his work at PayPal to implement a data mesh and the role of data contracts in making it work


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you start by describing the goals and scope of your work at PayPal to implement a data mesh?


    What are the core problems that you were addressing with this project?
    Is a data mesh ever "done"?

    What was your experience engaging at the organizational level to identify the granularity and ownership of the data products that were needed in the initial iteration?
    What was the impact of leading multiple teams on the design of how to implement communication/contracts throughout the mesh?
    What are the technical systems that you are relying on to power the different data domains?


    What is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency?

    What are the biggest challenges (technical and procedural) that you have encountered during your implementation?
    How are you managing visibility/auditability across the different data domains? (e.g. observability, data quality, etc.)
    What are the most interesting, innovative, or unexpected ways that you have seen PayPal's data mesh used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh?
    When is a data mesh the wrong choice?
    What do you have planned for the future of your data mesh at PayPal?


    Contact Info


    LinkedIn
    Blog


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
    To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers


    Links


    Data Mesh


    O'Reilly Book (affiliate link)

    The next generation of Data Platforms is the Data Mesh
    PayPal
    Conway's Law
    Data Mesh For All Ages - US, Data Mesh For All Ages - UK
    Data Mesh Radio
    Data Mesh Community
    Data Mesh I

    • 46 min
    The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

    The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

    Summary

    Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.


    Announcements


    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today.
    Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular


    Interview


    Introduction
    How did you get involved in the area of data management?
    Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?


    Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?

    What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018?
    Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?


    Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?

    For someone who wants to manage their data in Iceberg tables, what does the implementation look like?


    How does that change based on the type of query/processing engine being used?

    Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance?
    What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular?
    When is Iceberg/Tabular the wrong choice?
    What do you have planned for the future of Iceberg/Tabular?


    Contact Info


    LinkedIn
    rdblue on GitHub


    Parting Question


    From your perspective, what is the biggest gap in the tooling or technology for data management today?


    Closing Announcements


    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    Visit the site to subscribe to t

    • 55 min

Customer Reviews

4.0 out of 5
1 Rating

1 Rating

Top Podcasts In Technology

Lex Fridman
De Standaard
Micode
The New York Times
Ben Orenstein and Adam Wathan
Jack Rhysider