The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

Astronomer

Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

  1. Building Airflow CTL with Buğra Öztürk at Mollie

    5 天前

    Building Airflow CTL with Buğra Öztürk at Mollie

    Buğra Öztürk, Senior Data Engineer at Mollie and Committer and PMC member on the Apache Airflow project, joins us to walk through Airflow CTL — what it is, how it differs from the existing Airflow CLI and where it is headed under AIP-94. Key Takeaways: 00:00 Introduction. 03:10 Buğra has contributed to Airflow since 2022, from docs changes up to Committer and PMC member — a path he hopes inspires others to start small and contribute.  04:05 Airflow CTL solves secure user interaction by abstracting database credentials behind the public core API.  05:13 Airflow CLI and Airflow CTL are complementary — CLI handles administration and database management while CTL handles secure user interactions via the API.  07:08 Airflow CTL authenticates via the API, acquires a JWT token and stores it securely in the OS keyring — running on the user's machine and never requiring direct database access. 08:21 Concrete use cases include local DAG development without the UI and CI/CD automation using headless mode with short-lived JWT tokens. 10:08 AIP-94 describes the long-term vision — decoupling all remote commands from the Airflow CLI and routing them through Airflow CTL.  13:12 Airflow CTL is currently at 0.X and already being used in CI and deployment automations. The move to 1.0 with full CLI parity is the next milestone under AIP-94.   16:09 Multi-team deployment becoming generally available in a future Airflow release is Buğra's most-anticipated upcoming feature beyond Airflow CTL. Resources Mentioned: Buğra Öztürk https://www.linkedin.com/in/bugraozturk93/ Mollie https://www.linkedin.com/company/mollie/ Mollie | Website https://www.mollie.com/ Apache Airflow CTL  https://airflow.apache.org/ AIP-94 on Airflow Confluence https://lists.apache.org/thread/d2o1pr78wxdp1wozq519stp0pkcv6k6c Apache Airflow GitHub https://www.github.com/apache/airflow Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    20 分鐘
  2. Introducing Airflow’s Common AI Provider with Pavan Kumar Gopidesu and Kaxil Naik

    4月23日

    Introducing Airflow’s Common AI Provider with Pavan Kumar Gopidesu and Kaxil Naik

    In this episode, we explore the newly released Apache Airflow common AI provider — what problem it solves, how it was built and what's coming next. Kaxil Naik, Senior Director of Engineering at Astronomer and Apache Airflow PMC member, and Pavan Kumar Gopidesu, Lead Data Engineer at Experian and Apache Airflow PMC member, join us to walk through the provider's first release and the technical decisions behind it. Key Takeaways: 00:00 Introduction. 04:05 The common AI provider was born from a real production problem. 07:10 Airflow already had the primitives needed for durable agent execution, making it the natural foundation for AI orchestration.  09:15 The LLM schema compare operator uses Apache DataFusion to fetch source schemas. 11:07 Apache DataFusion was chosen for its speed. 13:09 Hook tool sets expose Airflow's provider hooks to agents with an allowed methods list that blocks destructive operations. 15:20 Passing durable=True to an LLM operator caches tool calls and LLM outputs mid-task.  18:13 The provider offers three abstraction levels.  21:20 The provider currently requires Airflow 3 — the team is open to adding Airflow 2.11 support if demand is high enough.  24:10 MCP server configs can be stored as Airflow connections. Resources Mentioned: Kaxil Naik https://www.linkedin.com/in/kaxil/ Pavan Kumar Gopidesu https://www.linkedin.com/in/pavan-kumar-gopidesu/ Astronomer | LinkedIn https://www.linkedin.com/company/astronomer/ Astronomer | Website https://www.astronomer.io Experian https://www.linkedin.com/company/experian/ Apache Airflow https://www.linkedin.com/company/apache-airflow Apache Airflow common AI provider docs https://airflow.apache.org/docs/apache-airflow-providers-common-ai/stable/commits.html Apache DataFusion https://datafusion.apache.org/ Pydantic AI https://pydantic.dev/docs/ai/overview/ Airflow Slack https://airflow.apache.org/docs/apache-airflow-providers-slack/stable/index.html Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow https://airflow.apache.org/blog/common-ai-provider/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #Automation #Airflow #MachineLearning

    29 分鐘
  3. Building AI Debugging Agents Into Airflow DAGs at Jeppesen ForeFlight with Samantha Blaney Cuevas

    4月16日

    Building AI Debugging Agents Into Airflow DAGs at Jeppesen ForeFlight with Samantha Blaney Cuevas

    Aviation data pipelines run on strict 28-day publication cycles, and the margin for error is zero. In this episode, we're joined by Samantha Blaney Cuevas, Software Engineer at Jeppesen ForeFlight, to explore how her team orchestrates a complex, time-sensitive data pipeline with Airflow and where AI is starting to fit into that picture. Key Takeaways: 00:00 Introduction. 04:05 Airflow orchestrates almost all business logic and data transformations across the cycle, with custom timetables built to track busy and slow periods programmatically. 06:10 Cycle-aware sensing tasks handle irregular source deliveries, including duplicates and early or late arrivals, without disrupting the pipeline. 08:07 The two main AI use cases are pipeline debugging and cycle awareness — both designed to reduce the manual overhead of monitoring a complex DAG dependency graph. 09:03 The Data Port agent is a two-task DAG that routes Slack pipeline alerts to either a predefined command list or an AI token, depending on whether the fix is already known. 13:10 AI is still in development at Jeppesen ForeFlight — the team is focused on token efficiency and scoping how much autonomy to give agents across different environments. 15:04 Airflow setup and MCP configuration were straightforward — the harder design work was deciding which environments agents could access across QA staging and production. 17:06 Airflow's skills repo and agent tooling are helping onboard new developers and extend pipeline awareness to analysts who work alongside engineers on the cycle. 19:10 Samantha would like to see single-task retries with different parameters in Airflow — resetting one task without clearing the full pipeline run. 21:05 A future AI use case under consideration is live DAG editing and re-upload within Airflow to make one-off fixes without halting pipeline progress. Resources Mentioned: Samantha Blaney Cuevas https://www.linkedin.com/in/samantha-blaney/ Jeppesen ForeFlight | LinkedIn https://www.linkedin.com/company/jeppesen-foreflight/ Jeppesen ForeFlight | Website http://www.foreflight.com Astronomer Airflow Skills Repo http://www.github.com/astronomer/airflow-llm-providers-demo Apache Airflow  https://airflow.apache.org/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    22 分鐘
  4. Introducing Airflow 3.2

    4月9日

    Introducing Airflow 3.2

    We introduce Airflow 3.2 and its updates for teams that build and operate data pipelines. Astronomer’s Head of Customer Education, Marc Lamberti, and Senior Manager of Developer Relations, Kenten Danas, break down what’s new, from asset partitioning to Async Python tasks and DAG versioning. They explore how these updates improve scheduling, performance and observability in production workflows. Key Takeaways: 00:00 Introduction. 02:10 Airflow 3 architecture separates workers from the metadata database. 03:05 Plugin versioning and UI-based backfills simplify operations. 06:20 Asset partitioning enables granular, partition-level scheduling. 07:15 Triggering DAGs on partitions instead of full datasets. 11:05 Deferrable operators reduce worker slot usage. 12:00 Async operators reduce database pressure and overhead. 14:10 Async improves throughput, not single task speed. 22:20 Inlets and outlets improve asset lineage visibility. 23:00 DAG version markers show changes directly in the UI. Resources Mentioned: Marc Lamberti https://www.linkedin.com/in/marclamberti/ Apache Airflow  https://airflow.apache.org/ Astronomer | LinkedIn https://www.linkedin.com/company/astronomer/ Astronomer | Website https://www.astronomer.io/ 3.2 Webinar https://www.astronomer.io/events/webinars/introducing-airflow-3-2-video Asset Partitioning Guide https://www.astronomer.io/docs/learn/airflow-partitioned-runs Asynchronous Processes Guide https://www.astronomer.io/docs/learn/deferrable-operators Release Notes https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#airflow-3-2-0-2026-04-07 Provider Registry https://airflow.apache.org/registry/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    26 分鐘
  5. Reflections on a Decade of Data Engineering at Seattle Data Guy

    4月3日

    Reflections on a Decade of Data Engineering at Seattle Data Guy

    Lessons from the past decade of data engineering reveal how much the ecosystem has changed and what has stayed surprisingly consistent. In this episode, Benjamin Rogojan, Owner and Data Consultant at Seattle Data Guy, joins us to reflect on how the data engineering landscape has evolved alongside Apache Airflow. We explore when Airflow makes sense as an orchestrator, why batch processing is still dominant and how AI is reshaping the workflows and responsibilities of modern data engineers. Key Takeaways: 00:00 Introduction. 03:00 Airflow becomes valuable when workflows involve many pipelines, teams and dependencies. 05:00 Data engineers are still focused on making data accessible and aligning work with business needs. 05:30 Batch pipelines remain the most common approach even as real-time use cases grow. 07:45 Many “real-time” requests are actually event-driven batch workflows. 09:00 Airflow replaced many custom-built pipeline systems with built-in dependency management. 11:00 Modern orchestration tools often build on Airflow concepts or differentiate from them. 14:00 AI can assist with writing SQL and pipelines but still requires experienced engineers. 15:30 Organizations are collecting increasingly granular data creating more engineering demand. 19:00 The data stack has shifted rapidly from Hadoop-era systems to modern cloud platforms. Resources Mentioned: Benjamin Rogojan https://www.linkedin.com/in/benjaminrogojan/ Seattle Data Guy https://www.linkedin.com/company/seattle-data-guy/ Apache Airflow https://airflow.apache.org Airflow Summit / Airflow Conference https://airflowsummit.org Snowflake https://www.snowflake.com HubSpot Data Sharing / APIs https://developers.hubspot.com MLflow https://mlflow.org Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    26 分鐘
  6. Managing Data Quality and Governance With Airflow at Credit Karma with Ashir Alam

    3月26日

    Managing Data Quality and Governance With Airflow at Credit Karma with Ashir Alam

    Data quality is not optional when you manage credit data at scale. In this episode, Ashir Alam, Senior Data Engineer at Credit Karma, joins us to share how his team acts as the gatekeeper for credit data ingestion, how they standardize data quality with Airflow and DAG Factory and how they scale safely across thousands of DAGs. We explore how governance, PII protection and orchestration come together inside a modern data platform.  Key Takeaways: 00:00 Introduction. 01:00 Overview of Credit Karma’s products and financial data ecosystem. 02:00 The team acts as gatekeepers for ingesting data from TransUnion and Equifax. 03:00 Why PII handling and controlled downstream access led to adopting Airflow. 04:00 BigQuery as the warehouse and Airflow as the primary orchestrator. 05:00 Why data quality and governance are critical in financial systems. 07:00 Why Airflow was selected: ease of use and unified ETL plus data quality. 09:00 Introduction to DAG Factory and YAML-based DAG generation. 10:00 GitHub executor creates PR-driven DAG workflows with CI checks. 12:00 BigQuery operators, structured checks and custom Slack and PagerDuty alerts. 13:00 Failed checks stop ETL pipelines and trigger notifications. 17:00 Scaling DAG Factory across thousands of DAGs and runtime vs compile-time concerns. 19:00 Future improvements: better defaults, retries and GenAI workflows in Airflow. Resources Mentioned: Ashir Alam https://www.linkedin.com/in/ashir-alam/ Credit Karma https://www.linkedin.com/company/intuit-credit-karma/ Apache Airflow https://airflow.apache.org/ DAG Factory https://github.com/astronomer/dag-factory BigQuery (Google Cloud) https://cloud.google.com/bigquery GitHub https://github.com/ Slack https://slack.com/ PagerDuty https://www.pagerduty.com/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    22 分鐘
  7. Open Source Airflow Contributions and Performance Improvements at G-Research with Christos Bisias

    3月19日

    Open Source Airflow Contributions and Performance Improvements at G-Research with Christos Bisias

    Modern Airflow isn’t just orchestration. It's a contribution.  In this episode, we explore how open source investment drives real performance gains and deeper observability. We’re joined by Christos Bisias, Open Source Software Engineer, Apache Airflow at G-Research, to discuss how his team uses Airflow for large-scale data transformations, contributes upstream and improves scheduler throughput and OpenTelemetry support. From trace-level observability to CI-enforced metrics governance and a major scheduler optimization, this conversation spans strategy, engineering and community impact. Key Takeaways: 00:00 Introduction. 01:20 How G-Research applies machine learning and big data to predict financial market movements. 02:15 Contributing to open source is a business decision. 03:10 Maintaining a fork is costly. 04:30 OpenTelemetry collects metrics, logs and traces to provide deep system visibility. 06:10 Custom spans help identify bottlenecks inside tasks and enable performance optimization. 08:05 OpenTelemetry integration works properly in Airflow 3.0 and above. 10:00 A YAML-based metrics registry with CI enforcement ensures consistency between docs and exported metrics. 12:10 Scheduler throughput improved significantly by applying concurrency limits earlier in the database query.  15:20 Future Task SDK changes may enable language-agnostic DAG authoring beyond Python. Resources Mentioned: Christos Bisias https://www.linkedin.com/in/xbis/ G-Research https://www.linkedin.com/company/g-research/ Apache Airflow https://airflow.apache.org/ OpenTelemetry https://opentelemetry.io/ Prometheus https://prometheus.io/ Grafana https://grafana.com/ Jaeger https://www.jaegertracing.io/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    18 分鐘
  8. Automating Threat Intelligence Using Airflow with Karan Alang

    3月12日

    Automating Threat Intelligence Using Airflow with Karan Alang

    In this episode, Karan Alang, Principal Software Engineer at Versa Networks, joins the conversation to discuss how Airflow can be used to automate threat intelligence in modern cybersecurity environments. He explains the growing scale of cloud computing, the profitability of hacking and the shortage of SOC analysts. Karan also outlines a novel architecture that combines Airflow, XDR, graph databases and LLMs to orchestrate automated threat detection and response. Key Takeaways: 00:00 Introduction. 05:00 Organizations face massive log volumes and a shortage of SOC analysts. 07:00 The solution integrates Airflow, XDR, Neo4j graph databases and LLMs into one architecture. 08:00 MITRE ATT&CK provides a global framework for mapping tactics and techniques. 11:00 Airflow acts as the orchestration backbone for ingestion graph transformation and LLM workflows. 13:00 Graph databases provide a full relationship view of attackers’ systems and entities. 14:00 LLMs automate mapping activity to MITRE ATT&CK and assign explainable risk scores. 17:00 Traditional signature-based detection allows lateral movement and exfiltration before teams can react. 18:00 End-to-end automation is essential to mitigating modern cybersecurity threats. 20:00 Future opportunities include deeper LLM integration as first-class citizens within Airflow. Resources Mentioned: Karan Alang https://www.linkedin.com/in/karan-alang-4173437 Versa Networks | LinkedIn https://www.linkedin.com/company/versa-networks Versa Networks | Website https://versa-networks.com Google Cloud Composer (Managed Airflow on GCP) https://cloud.google.com/composer Microsoft Defender XDR  https://www.microsoft.com/es-es/security/business/siem-and-xdr/microsoft-defender-xdr Neo4j (Graph Database) https://neo4j.com MITRE ATT&CK Framework https://attack.mitre.org Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    22 分鐘
5
(滿分 5 顆星)
20 則評分

簡介

Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

你可能也會喜歡