The Data Flowcast: Mastering Airflow for Data Engineering & AI

Astronomer
The Data Flowcast: Mastering Airflow for Data Engineering & AI

Welcome to The Data Flowcast: Mastering Airflow for Data Engineering & AI — the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

  1. How Uber Manages 1 Million Daily Tasks Using Airflow, with Shobhit Shah and Sumit Maheshwari

    NOV 14

    How Uber Manages 1 Million Daily Tasks Using Airflow, with Shobhit Shah and Sumit Maheshwari

    When data orchestration reaches Uber’s scale, innovation becomes a necessity, not a luxury. In this episode, we discuss the innovations behind Uber’s unique Airflow setup. With our guests Shobhit Shah and Sumit Maheshwari, both Staff Software Engineers at Uber, we explore how their team manages one of the largest data workflow systems in the world. Shobhit and Sumit walk us through the evolution of Uber’s Airflow implementation, detailing the custom solutions that support 200,000 daily pipelines. They discuss Uber's approach to tackling complex challenges in data orchestration, disaster recovery and scaling to meet the company’s extensive data needs. Key Takeaways: (02:03) Airflow as a service streamlines Uber’s data workflows. (06:16) Serialization boosts security and reduces errors. (10:05) Java-based scheduler improves system reliability. (13:40) Custom recovery model supports emergency pipeline switching. (15:58) No-code UI allows easy pipeline creation for non-coders. (18:12) Backfill feature enables historical data processing. (22:06) Regular updates keep Uber aligned with Airflow advancements. (26:07) Plans to leverage Airflow’s latest features. Resources Mentioned: Shobhit Shah - https://www.linkedin.com/in/shahshobhit/ Sumit Maheshwar - https://www.linkedin.com/in/maheshwarisumit/ Uber - https://www.linkedin.com/company/uber-com/ Apache Airflow - https://airflow.apache.org/ Airflow Summit - https://airflowsummit.org/ Uber - https://www.uber.com/tw/en/ Apache Airflow Survey - https://astronomer.typeform.com/airflowsurvey24 Thanks for listening to The Data Flowcast: Mastering Airflow for Data Engineering & AI. If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    29 min
  2. Building Resilient Data Systems for Modern Enterprises at Astrafy with Andrea Bombino

    NOV 7

    Building Resilient Data Systems for Modern Enterprises at Astrafy with Andrea Bombino

    Efficient data orchestration is the backbone of modern analytics and AI-driven workflows. Without the right tools, even the best data can fall short of its potential. In this episode, Andrea Bombino, Co-Founder and Head of Analytics Engineering at Astrafy, shares insights into his team’s approach to optimizing data transformation and orchestration using tools like datasets and Pub/Sub to drive real-time processing. Andrea explains how they leverage Apache Airflow and Google Cloud to power dynamic data workflows.  Key Takeaways: (01:55) Astrafy helps companies manage data using Google Cloud. (04:36) Airflow is central to Astrafy’s data engineering efforts. (07:17) Datasets and Pub/Sub are used for real-time workflows. (09:59) Pub/Sub links multiple Airflow environments. (12:40) Datasets eliminate the need for constant monitoring. (15:22) Airflow updates have improved large-scale data operations. (18:03) New Airflow API features make dataset updates easier. (20:45) Real-time orchestration speeds up data processing for clients. (23:26) Pub/Sub enhances flexibility across cloud environments. (26:08) Future Airflow features will offer more control over data workflows. Resources Mentioned: Andrea Bombino - https://www.linkedin.com/in/andrea-bombino/ Astrafy - https://www.linkedin.com/company/astrafy/ Apache Airflow - https://airflow.apache.org/ Google Cloud - https://cloud.google.com/ dbt - https://www.getdbt.com/ Apache Airflow Survey - https://astronomer.typeform.com/airflowsurvey24 Thanks for listening to “The Data Flowcast: Mastering Airflow for Data Engineering & AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    28 min
  3. Inside Airflow 3: Redefining Data Engineering with Vikram Koka

    OCT 31

    Inside Airflow 3: Redefining Data Engineering with Vikram Koka

    Data orchestration is evolving faster than ever and Apache Airflow 3 is set to revolutionize how enterprises handle complex workflows. In this episode, we dive into the exciting advancements with Vikram Koka, Chief Strategy Officer at Astronomer and PMC Member at The Apache Software Foundation. Vikram shares his insights on the evolution of Airflow and its pivotal role in shaping modern data-driven workflows, particularly with the upcoming release of Airflow 3. Key Takeaways: (02:36) Vikram leads Astronomer’s engineering and open-source teams for Airflow. (05:26) Airflow enables reliable data ingestion and curation. (08:17) Enterprises use Airflow for mission-critical data pipelines. (11:08) Airflow 3 introduces major architectural updates. (13:58) Multi-cloud and edge deployments are supported in Airflow 3. (16:49) Event-driven scheduling makes Airflow more dynamic. (19:40) Tasks in Airflow 3 can run in any language. (22:30) Multilingual task support is crucial for enterprises. (25:21) Data assets and event-based integration enhance orchestration. (28:12) Community feedback plays a vital role in Airflow 3. Resources Mentioned: Vikram Koka - https://www.linkedin.com/in/vikramkoka/ Astronomer - https://www.linkedin.com/company/astronomer/ The Apache Software Foundation LinkedIn - https://www.linkedin.com/company/the-apache-software-foundation/ Apache Airflow LinkedIn - https://www.linkedin.com/company/apache-airflow/ Apache Airflow - https://airflow.apache.org/ Astronomer - https://www.astronomer.io/ The Apache Software Foundation - https://www.apache.org/ Join the Airflow slack and/or Dev list - https://airflow.apache.org/community/ Apache Airflow Survey - https://astronomer.typeform.com/airflowsurvey24 Thanks for listening to The Data Flowcast: Mastering Airflow for Data Engineering & AI. If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    30 min
  4. Building a Data-Driven HR Platform at 15Five with Guy Dassa

    OCT 24

    Building a Data-Driven HR Platform at 15Five with Guy Dassa

    Data and AI are revolutionizing HR, empowering leaders to measure performance and drive strategic decisions like never before.  In this episode, we explore the transformation of HR technology with Guy Dassa, Chief Technology Officer at 15Five, as he shares insights into their evolving data platform. Guy discusses how 15Five equips HR leaders with tools to measure and take action on team performance, engagement and retention. He explains their data-driven approach, highlighting how Apache Airflow supports their data ingestion, transformation, and AI integration. Key Takeaways: (01:54) 15Five acts as a command center for HR leaders. (03:40) Tools like performance reviews, engagement surveys, and an insights dashboard guide actionable HR steps. (05:33) Data visualization, insights, and action recommendations enhance HR effectiveness to improve their people's outcomes. (07:08) Strict data confidentiality and sanitized AI model training. (09:21) Airflow is central to data transformation and enrichment. (11:15) Airflow enrichment DAGs integrate AI models. (13:33) Integration of Airflow and DBT enables efficient data transformation. (15:28) Synchronization challenges arise with reverse ETL processes. (17:10) Future plans include deeper Airflow integration with AI. (19:31) Emphasizing the need for DAG versioning and improved dependency visibility. Resources Mentioned: Guy Dassa - https://www.linkedin.com/in/guydassa/ 15Five - https://www.linkedin.com/company/15five/ Apache Airflow - https://airflow.apache.org/ MLflow - https://mlflow.org/ DBT - https://www.getdbt.com/ Kubernetes - https://kubernetes.io/ RedShift - https://aws.amazon.com/redshift/ 15Five - https://www.15five.com/ Thanks for listening to The Data Flowcast: Mastering Airflow for Data Engineering & AI. If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    20 min
  5. The Intersection of AI and Data Management at Dosu with Devin Stein

    OCT 4

    The Intersection of AI and Data Management at Dosu with Devin Stein

    Unlocking engineering productivity goes beyond coding — it’s about managing knowledge efficiently. In this episode, we explore the innovative ways in which Dosu leverages Airflow for data orchestration and supports the Airflow project.  Devin Stein, Founder of Dosu, shares his insights on how engineering teams can focus on value-added work by automating knowledge management. Devin dives into Dosu’s purpose, the significance of AI in their product, and why they chose Airflow as the backbone for scheduling and data management.  Key Takeaways: (01:33) Dosu's mission to democratize engineering knowledge. (05:00) AI is central to Dosu's product for structuring engineering knowledge. (06:23) The importance of maintaining up-to-date data for AI effectiveness. (07:55) How Airflow supports Dosu’s data ingestion and automation processes. (08:45) The reasoning behind choosing Airflow over other orchestrators. (11:00) Airflow enables Dosu to manage both traditional ETL and dynamic workflows. (13:04) Dosu assists the Airflow project by auto-labeling issues and discussions. (14:56) Thoughtful collaboration with the Airflow community to introduce AI tools. (16:37) The potential of Airflow to handle more dynamic, scheduled workflows in the future. (18:00) Challenges and custom solutions for implementing dynamic workflows in Airflow. Resources Mentioned: Apache Airflow - https://airflow.apache.org/ Dosu Website - https://dosu.dev/ Thanks for listening to The Data Flowcast: Mastering Airflow for Data Engineering & AI. If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    20 min
  6. AI-Powered Vehicle Automation at Ford Motor Company with Serjesh Sharma

    SEP 12

    AI-Powered Vehicle Automation at Ford Motor Company with Serjesh Sharma

    Harnessing data at scale is the key to driving innovation in autonomous vehicle technology. In this episode, we uncover how advanced orchestration tools are transforming machine learning operations in the automotive industry. Serjesh Sharma, Supervisor ADAS Machine Learning Operations (MLOps) at Ford Motor Company, joins us to discuss the challenges and innovations his team faces working to enhance vehicle safety and automation. Serjesh shares insights into the intricate data processes that support Ford’s Advanced Driver Assistance Systems (ADAS) and how his team leverages Apache Airflow to manage massive data loads efficiently. Key Takeaways: (01:44) ADAS involves advanced features like pre-collision assist and self-driving capabilities. (04:47) Ensuring sensor accuracy and vehicle safety requires extensive data processing. (05:08) The combination of on-prem and cloud infrastructure optimizes data handling. (09:27) Ford processes around one petabyte of data per week, using both CPUs and GPUs. (10:33) Implementing software engineering best practices to improve scalability and reliability. (15:18) GitHub Issues streamline onboarding and infrastructure provisioning. (17:00) Airflow's modular design allows Ford to manage complex data pipelines. (19:00) Kubernetes pod operators help optimize resource usage for CPU-intensive tasks. (20:35) Ford's scale challenges led to customized Airflow configurations for high concurrency. (21:02) Advanced orchestration tools are pivotal in managing vast data landscapes in automotive innovation. Resources Mentioned: Serjesh Sharma - www.linkedin.com/in/serjeshsharma/ Ford Motor Company - www.linkedin.com/company/ford-motor-company/ Apache Airflow - airflow.apache.org/ Kubernetes - kubernetes.io/ Thanks for listening to The Data Flowcast: Mastering Airflow for Data Engineering & AI. If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    26 min
  7. From Task Failures to Operational Excellence at GumGum with Brendan Frick

    SEP 6

    From Task Failures to Operational Excellence at GumGum with Brendan Frick

    Data failures are inevitable but how you manage them can define the success of your operations. In this episode, we dive deep into the challenges of data engineering and AI with Brendan Frick, Senior Engineering Manager, Data at GumGum. Brendan shares his unique approach to managing task failures and DAG issues in a high-stakes ad-tech environment. Brendan discusses how GumGum leverages Apache Airflow to streamline data processes, ensuring efficient data movement and orchestration while minimizing disruptions in their operations. Key Takeaways: (02:02) Brendan’s role at GumGum and its approach to ad tech. (04:27) How GumGum uses Airflow for daily data orchestration, moving data from S3 to warehouses. (07:02) Handling task failures in Airflow using Jira for actionable, developer-friendly responses. (09:13) Transitioning from email alerts to a more structured system with Jira and PagerDuty. (11:40) Monitoring task retry rates as a key metric to identify potential issues early. (14:15) Utilizing Looker dashboards to track and analyze task performance and retry rates. (16:39) Transitioning from Kubernetes operator to a more reliable system for data processing. (19:25) The importance of automating stakeholder communication with data lineage tools like Atlan. (20:48) Implementing data contracts to ensure SLAs are met across all data processes. (22:01) The role of scalable SLAs in Airflow to ensure data reliability and meet business needs. Resources Mentioned: Brendan Frick - https://www.linkedin.com/in/brendan-frick-399345107/ GumGum - https://www.linkedin.com/company/gumgum/ Apache Airflow - https://airflow.apache.org/ Jira - https://www.atlassian.com/software/jira Atlan - https://atlan.com/ Kubernetes - https://kubernetes.io/ Thanks for listening to The Data Flowcast: Mastering Airflow for Data Engineering & AI. If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    24 min
  8. From Sensors to Datasets: Enhancing Airflow at Astronomer with Maggie Stark and Marion Azoulai

    AUG 29

    From Sensors to Datasets: Enhancing Airflow at Astronomer with Maggie Stark and Marion Azoulai

    A 13% reduction in failure rates — this is how two data scientists at Astronomer revolutionized their data pipelines using Apache Airflow. In this episode, we enter the world of data orchestration and AI with Maggie Stark and Marion Azoulai, both Senior Data Scientists at Astronomer. Maggie and Marion discuss how their team re-architected their use of Airflow to improve scalability, reliability and efficiency in data processing. They share insights on overcoming challenges with sensors and how moving to datasets transformed their workflows. Key Takeaways: (02:23) The data team’s role as a centralized hub within Astronomer. (05:11) Airflow is the backbone of all data processes, running 60,000 tasks daily. (07:13) Custom task groups enable efficient code reuse and adherence to best practices. (11:33) Sensor-heavy architectures can lead to cascading failures and resource issues. (12:09) Switching to datasets has improved reliability and scalability. (14:19) Building a control DAG provides end-to-end visibility of pipelines. (16:42) Breaking down DAGs into smaller units minimizes failures and improves management. (19:02) Failure rates improved from 16% to 3% with the new architecture. Resources Mentioned: Maggie Stark - https://www.linkedin.com/in/margaretstark/ Marion Azoulai - https://www.linkedin.com/in/marionazoulai/ Astronomer | LinkedIn - https://www.linkedin.com/company/astronomer/ Apache Airflow - https://airflow.apache.org/ Astronomer | Website - https://www.astronomer.io/ Thanks for listening to The Data Flowcast: Mastering Airflow for Data Engineering & AI. If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    22 min
5
out of 5
20 Ratings

About

Welcome to The Data Flowcast: Mastering Airflow for Data Engineering & AI — the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

You Might Also Like

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes, and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada