The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

Astronomer

Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

  1. Orchestrating 2,000 Airflow pipelines at Luiza Labs with Mateus Ferreira

    10h ago

    Orchestrating 2,000 Airflow pipelines at Luiza Labs with Mateus Ferreira

    Running Airflow at the scale of a national retailer means more than just scheduling. It means giving non-engineers a path to ship DAGs, and classifying thousands of runs to know which ones need attention. In this episode, Mateus Ferreira, Senior Data Engineer at Luiza Labs (the technology arm of Magazine Luiza, one of Brazil's largest retailers), joins Marc to talk about the patterns his team uses to run 2,000+ Airflow pipelines across more than four petabytes of data. Key Takeaways: 00:00 Introduction01:11 Mateus introduces himself and Luiza Labs, the technology arm of Magazine Luiza (Magalu), one of Brazil's largest retailers (founded 1957). 1,000+ physical stores, multi-region operations, and a data team that has to handle the variability that comes with all of it.04:33 Lu Brain, Magalu's AI initiative built around their character Lu, and how AI fits into the data work.06:47 The data reliability engineering channel where AI summarizes Airflow errors with confidence scores and posts a suggested fix in chat.08:30 How Airflow became the heart of orchestration. Coming from Control-M in banking, then GCP, then consolidating on Cloud Composer to centralize roughly 2,000 pipelines.14:23 The YAML wrapper that lets non-engineers ship DAGs. Reads namespace, tables, and Spark options. Handles CDC, JDBC full, and JDBC incremental collection types with checkpoints. All changes go through data reliability engineering.17:20 Why metadata is the most valuable asset in the AI era, and how the wrapper makes data lineage observable across 2,000 pipelines.18:26 The Data Reliability Engineering team. A 10-person group that is the window to the company, handling maintenance, validation, corrections, and optimization for the business unit pipelines.20:09 Operating at four petabytes of data.21:24 Why they built custom Spark operators. Cost drove the move off the DataprocOperator. The custom operator exposes Spark driver and executor sizing as Airflow parameters and generates the Kubernetes manifest.24:36 The monitoring dashboard built on the Airflow metadata DB. A timeline view that shows how many DAGs run each hour, used to spread scheduling across the day.26:37 Classifying DAGs by their last five runs: success, partially correct, intermittent, total failure. A reusable observability pattern.29:57 How to reach Mateus, and a closing thought in Portuguese on appreciating the good old times while you are living them. Resources Mentioned: Apache Airflow (airflow.apache.org)Magalu Cloud / MGCLuiza Labs (luizalabs.com) and Magazine Luiza / MagaluAstro Observe (https://www.astronomer.io/product)Mateus Ferreira on LinkedIn (linkedin.com/in/mateusmferreira) Thanks for listening to "The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI." If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    33 min
  2. Enhancing DAGs for Data Processing  with William Orgertrice III at Cargill

    May 21

    Enhancing DAGs for Data Processing with William Orgertrice III at Cargill

    In the data engineering world, the difference between a pipeline that works and one that's truly production-ready often comes down to a handful of deliberate decisions. William Orgertrice III, Data Engineer at Cargill, joins us to share the DAG design and monitoring practices he presented at Airflow Summit 2025 and how his team is rolling out Airflow across 60+ internal teams as part of Cargill's new Minerva data platform. Key Takeaways: 00:00 Introduction. 01:45 Cargill is one of the largest privately owned companies in the US, operating across 70 countries and serving 125+ markets. 03:45 William's team on the Cargill Data Platform supports 60+ internal teams, providing data products that drive decisions across finance, inventory and operations. 05:10 Cargill chose Airflow as a core component of its new Minerva data platform to replace older ETL tooling with a more supportable, observable stack. 06:26 Native SLA sensors and dependency management were specific features that made Airflow the right fit for Cargill's batch ingestion pipelines. 09:00 Cargill is running Airflow through Astronomer as their managed solution, with some teams already in production. 13:22 Every task in a DAG should have a single, documented purpose — one task doing everything makes troubleshooting significantly harder. 14:40 A DAG that never enters a failed state but keeps running indefinitely will spend compute budget without alerting anyone. 15:25 In shared Airflow environments, embedding contact information and owner tags in DAGs ensures the right team is reached when something breaks upstream. 21:00 William flags connection testing as a friction point in pipeline development — verifying a connection string before building the full job would reduce iteration time. Resources Mentioned: Cargill | Website https://www.cargill.com/food-beverage Airflow Community on Slack  https://airflow.apache.org/community/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    26 min
  3. Getting Into Data Engineering with Shrividya Hegde, Data and AI Engineer

    May 14

    Getting Into Data Engineering with Shrividya Hegde, Data and AI Engineer

    In this episode, we take a step back from implementation-specific topics to explore what it actually takes to build a career in data engineering — and how AI is reshaping that path. Shrividya Hegde,  a data and AI engineer and an Airflow champion in Astronomer’s Champions program, joins us to discuss getting into data engineering, contributing to open source and why good data engineering should make AI output trustworthy rather than confidently wrong. Key Takeaways: 00:00 Introduction. 04:08 Build fundamentals before chasing trending tools — understanding what a tool does, why it exists and what problem it solves has to come first.  07:19 Data engineering fundamentals mean SQL query performance under joins and aggregations, how data moves between pipelines, DAG failure recovery and idempotency — not just writing queries.  08:10 The most common mistake newer data engineers make is skipping fundamentals to chase trends — it is a sequencing problem, not a talent problem.  13:15 AI creates more opportunity for data engineers because AI output quality is directly determined by the quality of the data pipeline feeding it — confidently wrong output is harder to catch than obviously wrong output.  15:06 Airflow's supporting operators make AI outputs production-ready — orchestration is what converts experimental AI into something reliable.  17:14 AI-generated DAGs help newer engineers understand underlying concepts rather than just producing working code.  23:12 The Airflow open source community is more welcoming than most people expect for a project of its size — raising issues and reviewing PRs are viable entry points for first contributions. Resources Mentioned: Shrividya Hegde https://www.linkedin.com/in/shrividya-hegde-shri-91562365/ Astronomer | LinkedIn https://www.linkedin.com/company/astronomer/ Astronomer | Website https://www.astronomer.io Women in Data | Website https://womenindata.mn.co/landing Apache Airflow Slack  https://airflow.apache.org/ Shrividya's Medium writing https://medium.com/@shrihegde Shrividya’ Substack writing https://substack.com/@shrividyahegde Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    28 min
  4. Orchestrating DBT With Cosmos and Airflow with Filip Kunčar at ShipMonk Product Development

    May 7

    Orchestrating DBT With Cosmos and Airflow with Filip Kunčar at ShipMonk Product Development

    We explore how a third-party logistics platform built its entire data orchestration layer on Airflow, and what that makes possible for developer teams and merchant-facing products alike. Filip Kunčar, Platform Director at ShipMonk Product Development, discusses migrating from a closed source tool to Airflow, orchestrating dbt with both Cosmos and the BashOperator and using Airflow to power customer-facing data delivery. Key Takeaways: 00:00 Introduction. 01:07 ShipMonk is a third-party logistics company guaranteeing two-day delivery across the US. The data platform team's mission is to lower cognitive load for developers working with data.  05:13 ShipMonk migrated to Airflow in 2022, moving away from a closed-source UI-based tool, driven by the need for a code-first approach, open source extensibility and broad cloud provider support.  10:02 The team uses Cosmos for developer-facing visibility and lineage and BashOperator for internal pipelines where runtime performance matters.  12:20 Switching from Cosmos to the BashOperator for a frequently running pipeline reduced runtime from over 15 minutes to three minutes.  13:14 Because the full dbt chain runs inside Airflow, a configurable downstream DAG can deliver processed data directly to each merchant's preferred destination, with secrets management and SLA tracking already handled.  15:03 Per-team alerting is hooked to each DAG by owner and severity, so teams can react to SLA breaches immediately.  18:09 ShipMonk uses Airflow in three ways for AI: authoring DAGs faster with skills, orchestrating AI workloads in Lambda and containers and using Astronomer's skills repo to simplify Airflow version upgrades. Resources Mentioned: Filip Kunčar https://www.linkedin.com/in/filipkuncar/ ShipMonk Product Development https://www.linkedin.com/company/shipmonk-product-development/ ShipMonk | Website http://www.shipmonk.com Astronomer Cosmos http://www.astronomer.io/cosmos Astronomer AI Skills Repo http://www.github.com/astronomer/airflow-llm-providers-demo Datadog http://www.datadoghq.com Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    25 min
  5. Building Airflow CTL with Buğra Öztürk at Mollie

    Apr 30

    Building Airflow CTL with Buğra Öztürk at Mollie

    Buğra Öztürk, Senior Data Engineer at Mollie and Committer and PMC member on the Apache Airflow project, joins us to walk through Airflow CTL — what it is, how it differs from the existing Airflow CLI and where it is headed under AIP-94. Key Takeaways: 00:00 Introduction. 03:10 Buğra has contributed to Airflow since 2022, from docs changes up to Committer and PMC member — a path he hopes inspires others to start small and contribute.  04:05 Airflow CTL solves secure user interaction by abstracting database credentials behind the public core API.  05:13 Airflow CLI and Airflow CTL are complementary — CLI handles administration and database management while CTL handles secure user interactions via the API.  07:08 Airflow CTL authenticates via the API, acquires a JWT token and stores it securely in the OS keyring — running on the user's machine and never requiring direct database access. 08:21 Concrete use cases include local DAG development without the UI and CI/CD automation using headless mode with short-lived JWT tokens. 10:08 AIP-94 describes the long-term vision — decoupling all remote commands from the Airflow CLI and routing them through Airflow CTL.  13:12 Airflow CTL is currently at 0.X and already being used in CI and deployment automations. The move to 1.0 with full CLI parity is the next milestone under AIP-94.   16:09 Multi-team deployment becoming generally available in a future Airflow release is Buğra's most-anticipated upcoming feature beyond Airflow CTL. Resources Mentioned: Buğra Öztürk https://www.linkedin.com/in/bugraozturk93/ Mollie https://www.linkedin.com/company/mollie/ Mollie | Website https://www.mollie.com/ Apache Airflow CTL  https://airflow.apache.org/ AIP-94 on Airflow Confluence https://lists.apache.org/thread/d2o1pr78wxdp1wozq519stp0pkcv6k6c Apache Airflow GitHub https://www.github.com/apache/airflow Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    20 min
  6. Introducing Airflow’s Common AI Provider with Pavan Kumar Gopidesu and Kaxil Naik

    Apr 23

    Introducing Airflow’s Common AI Provider with Pavan Kumar Gopidesu and Kaxil Naik

    In this episode, we explore the newly released Apache Airflow common AI provider — what problem it solves, how it was built and what's coming next. Kaxil Naik, Senior Director of Engineering at Astronomer and Apache Airflow PMC member, and Pavan Kumar Gopidesu, Lead Data Engineer at Experian and Apache Airflow PMC member, join us to walk through the provider's first release and the technical decisions behind it. Key Takeaways: 00:00 Introduction. 04:05 The common AI provider was born from a real production problem. 07:10 Airflow already had the primitives needed for durable agent execution, making it the natural foundation for AI orchestration.  09:15 The LLM schema compare operator uses Apache DataFusion to fetch source schemas. 11:07 Apache DataFusion was chosen for its speed. 13:09 Hook tool sets expose Airflow's provider hooks to agents with an allowed methods list that blocks destructive operations. 15:20 Passing durable=True to an LLM operator caches tool calls and LLM outputs mid-task.  18:13 The provider offers three abstraction levels.  21:20 The provider currently requires Airflow 3 — the team is open to adding Airflow 2.11 support if demand is high enough.  24:10 MCP server configs can be stored as Airflow connections. Resources Mentioned: Kaxil Naik https://www.linkedin.com/in/kaxil/ Pavan Kumar Gopidesu https://www.linkedin.com/in/pavan-kumar-gopidesu/ Astronomer | LinkedIn https://www.linkedin.com/company/astronomer/ Astronomer | Website https://www.astronomer.io Experian https://www.linkedin.com/company/experian/ Apache Airflow https://www.linkedin.com/company/apache-airflow Apache Airflow common AI provider docs https://airflow.apache.org/docs/apache-airflow-providers-common-ai/stable/commits.html Apache DataFusion https://datafusion.apache.org/ Pydantic AI https://pydantic.dev/docs/ai/overview/ Airflow Slack https://airflow.apache.org/docs/apache-airflow-providers-slack/stable/index.html Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow https://airflow.apache.org/blog/common-ai-provider/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #Automation #Airflow #MachineLearning

    29 min
  7. Building AI Debugging Agents Into Airflow DAGs at Jeppesen ForeFlight with Samantha Blaney Cuevas

    Apr 16

    Building AI Debugging Agents Into Airflow DAGs at Jeppesen ForeFlight with Samantha Blaney Cuevas

    Aviation data pipelines run on strict 28-day publication cycles, and the margin for error is zero. In this episode, we're joined by Samantha Blaney Cuevas, Software Engineer at Jeppesen ForeFlight, to explore how her team orchestrates a complex, time-sensitive data pipeline with Airflow and where AI is starting to fit into that picture. Key Takeaways: 00:00 Introduction. 04:05 Airflow orchestrates almost all business logic and data transformations across the cycle, with custom timetables built to track busy and slow periods programmatically. 06:10 Cycle-aware sensing tasks handle irregular source deliveries, including duplicates and early or late arrivals, without disrupting the pipeline. 08:07 The two main AI use cases are pipeline debugging and cycle awareness — both designed to reduce the manual overhead of monitoring a complex DAG dependency graph. 09:03 The Data Port agent is a two-task DAG that routes Slack pipeline alerts to either a predefined command list or an AI token, depending on whether the fix is already known. 13:10 AI is still in development at Jeppesen ForeFlight — the team is focused on token efficiency and scoping how much autonomy to give agents across different environments. 15:04 Airflow setup and MCP configuration were straightforward — the harder design work was deciding which environments agents could access across QA staging and production. 17:06 Airflow's skills repo and agent tooling are helping onboard new developers and extend pipeline awareness to analysts who work alongside engineers on the cycle. 19:10 Samantha would like to see single-task retries with different parameters in Airflow — resetting one task without clearing the full pipeline run. 21:05 A future AI use case under consideration is live DAG editing and re-upload within Airflow to make one-off fixes without halting pipeline progress. Resources Mentioned: Samantha Blaney Cuevas https://www.linkedin.com/in/samantha-blaney/ Jeppesen ForeFlight | LinkedIn https://www.linkedin.com/company/jeppesen-foreflight/ Jeppesen ForeFlight | Website http://www.foreflight.com Astronomer Airflow Skills Repo http://www.github.com/astronomer/airflow-llm-providers-demo Apache Airflow  https://airflow.apache.org/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    22 min
  8. Introducing Airflow 3.2

    Apr 9

    Introducing Airflow 3.2

    We introduce Airflow 3.2 and its updates for teams that build and operate data pipelines. Astronomer’s Head of Customer Education, Marc Lamberti, and Senior Manager of Developer Relations, Kenten Danas, break down what’s new, from asset partitioning to Async Python tasks and DAG versioning. They explore how these updates improve scheduling, performance and observability in production workflows. Key Takeaways: 00:00 Introduction. 02:10 Airflow 3 architecture separates workers from the metadata database. 03:05 Plugin versioning and UI-based backfills simplify operations. 06:20 Asset partitioning enables granular, partition-level scheduling. 07:15 Triggering DAGs on partitions instead of full datasets. 11:05 Deferrable operators reduce worker slot usage. 12:00 Async operators reduce database pressure and overhead. 14:10 Async improves throughput, not single task speed. 22:20 Inlets and outlets improve asset lineage visibility. 23:00 DAG version markers show changes directly in the UI. Resources Mentioned: Marc Lamberti https://www.linkedin.com/in/marclamberti/ Apache Airflow  https://airflow.apache.org/ Astronomer | LinkedIn https://www.linkedin.com/company/astronomer/ Astronomer | Website https://www.astronomer.io/ 3.2 Webinar https://www.astronomer.io/events/webinars/introducing-airflow-3-2-video Asset Partitioning Guide https://www.astronomer.io/docs/learn/airflow-partitioned-runs Asynchronous Processes Guide https://www.astronomer.io/docs/learn/deferrable-operators Release Notes https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#airflow-3-2-0-2026-04-07 Provider Registry https://airflow.apache.org/registry/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    26 min
5
out of 5
20 Ratings

About

Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

You Might Also Like