The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

Astronomer

Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

  1. Automating Threat Intelligence Using Airflow with Karan Alang

    1D AGO

    Automating Threat Intelligence Using Airflow with Karan Alang

    In this episode, Karan Alang, Principal Software Engineer at Versa Networks, joins the conversation to discuss how Airflow can be used to automate threat intelligence in modern cybersecurity environments. He explains the growing scale of cloud computing, the profitability of hacking and the shortage of SOC analysts. Karan also outlines a novel architecture that combines Airflow, XDR, graph databases and LLMs to orchestrate automated threat detection and response. Key Takeaways: 00:00 Introduction. 05:00 Organizations face massive log volumes and a shortage of SOC analysts. 07:00 The solution integrates Airflow, XDR, Neo4j graph databases and LLMs into one architecture. 08:00 MITRE ATT&CK provides a global framework for mapping tactics and techniques. 11:00 Airflow acts as the orchestration backbone for ingestion graph transformation and LLM workflows. 13:00 Graph databases provide a full relationship view of attackers’ systems and entities. 14:00 LLMs automate mapping activity to MITRE ATT&CK and assign explainable risk scores. 17:00 Traditional signature-based detection allows lateral movement and exfiltration before teams can react. 18:00 End-to-end automation is essential to mitigating modern cybersecurity threats. 20:00 Future opportunities include deeper LLM integration as first-class citizens within Airflow. Resources Mentioned: Karan Alang https://www.linkedin.com/in/karan-alang-4173437 Versa Networks | LinkedIn https://www.linkedin.com/company/versa-networks Versa Networks | Website https://versa-networks.com Google Cloud Composer (Managed Airflow on GCP) https://cloud.google.com/composer Microsoft Defender XDR  https://www.microsoft.com/es-es/security/business/siem-and-xdr/microsoft-defender-xdr Neo4j (Graph Database) https://neo4j.com MITRE ATT&CK Framework https://attack.mitre.org Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow #MachineLearning

    22 min
  2. Using Plugins To Customize Airflow at Ponder Labs with Egor Tarasenko

    MAR 5

    Using Plugins To Customize Airflow at Ponder Labs with Egor Tarasenko

    In this episode, we explore how teams scale Apache Airflow in complex environments and what it takes to make orchestration work across many stakeholders. We look at real-world challenges around visibility, ownership and predictability as data platforms grow. Egor Tarasenko, Data and AI Engineer at Ponder Labs, joins us to share how Ponder Labs customizes Airflow for education organizations using plugins, event-driven architectures and AI-powered tooling. He explains how his team supports large charter school networks and why structure, consistency and extensibility become critical at scale. Key Takeaways: 00:00 Introduction. 01:21 Ponder Labs helps education organizations bring data from many systems together so it becomes useful for teachers, school leaders and administrators. 03:10 Airflow serves as the backbone for orchestrating ingestion, transformation and reverse ETL across client data platforms. 05:43 Everything is triggered from Airflow to maintain dependency, visibility and a single operational picture. 09:05 Managing hundreds of DAGs requires a focus on structure, visibility and consistency across teams. 09:51 Treating DAGs like APIs helps teams scale without needing deep knowledge of upstream logic. 12:00 Custom plugins like schedule insights help predict DAG run times across layered dependencies. 15:00 AI-powered Airflow chat enables non-technical stakeholders to understand DAG ownership dependencies and cluster activity. 22:06 Migrating plugins to Airflow 3 improves developer experience through cleaner APIs and faster extensibility. Resources Mentioned: Egor Tarasenko https://www.linkedin.com/in/egorseno/ Apache Airflow https://airflow.apache.org dbt https://www.getdbt.com Astronomer Astro Platform https://www.astronomer.io Egor Tarasenko on Substack  https://egortarasenko.substack.com Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    28 min
  3. Scaling Airflow at Wix for Analytics and AI with Ethan Shalev

    FEB 26

    Scaling Airflow at Wix for Analytics and AI with Ethan Shalev

    Modern data orchestration at scale demands reliability, speed and thoughtful adoption of new tooling. As organizations grow, keeping pipelines efficient while supporting more teams becomes a critical challenge. In this episode, we’re joined by Ethan Shalev, Data Engineer at Wix, to discuss how Wix operates Airflow at massive scale, migrates to Airflow 3 and uses AI to accelerate development. Key Takeaways: 00:00 Introduction. 02:13 Wix structures data engineering across multiple product-focused organizations. 03:40 Migrating nearly 8,000 DAGs to Airflow 3 requires careful planning. 04:31 Migration creates an opportunity to remove long-standing legacy Airflow code. 05:32 Internal playbooks and Cursor rules standardize and speed up DAG migrations. 07:39 Airflow 3 introduces backfills, DAG versioning and asset-aware scheduling. 09:16 Deferrable operators reduce scheduler congestion in large Airflow environments. 12:54 AI-generated code still requires review and strong testing practices. 14:52 Moving to managed Airflow reduces operational burden on internal platform teams. 15:57 Improving multi-tenancy and UI personalization remains a key Airflow need. Resources Mentioned: Ethan Shalev https://www.linkedin.com/in/eshalev/ Wix | LinkedIn https://www.linkedin.com/company/wix-com/ Wix | Website https://www.wix.com/ Apache Airflow https://airflow.apache.org/ Astronomer https://www.astronomer.io/ Trino https://trino.io/ Apache Iceberg https://iceberg.apache.org/ Cursor https://cursor.sh/ Airflow Summit https://airflowsummit.org/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    18 min
  4. Using Airflow To Orchestrate Billions of Events at Addi with Carlos Daniel Puerto Niño

    FEB 19

    Using Airflow To Orchestrate Billions of Events at Addi with Carlos Daniel Puerto Niño

    Strong data orchestration is as much about culture and visibility as it is about technology. As data platforms scale, teams need systems that reduce cognitive load while increasing reliability and observability. In this episode, Carlos Daniel Puerto Niño, Senior Analytics Engineer and Data Analyst at Addi, joins us to share how Addi uses Airflow to support batch orchestration, manage organizational complexity and improve monitoring across its data platform. Key Takeaways: 00:00 Introduction. 01:25 Changes in company strategy increase data platform complexity over time. 04:00 Centralized data teams help manage organizational and technical change. 06:08 Scalable architectures support growing data volumes and use cases. 09:10 Adopting orchestration tools introduces operational and maintenance challenges. 14:43 Abstraction layers lower technical barriers for onboarding new team members. 15:36 Modularity and visibility improve the reliability of data pipelines. 18:14 Integrated monitoring supports faster incident response and resolution. 22:19 Limited access to orchestration metadata constrains proactive analysis. Resources Mentioned: Carlos Daniel Puerto Niño https://www.linkedin.com/in/carlospuertoni%C3%B1o/ Addi | LinkedIn https://www.linkedin.com/company/addicol/ Addi | Website https://www.addi.com Apache Airflow https://airflow.apache.org/ Astronomer https://www.astronomer.io/ Databricks https://www.databricks.com/ dbt https://www.getdbt.com/ Grafana https://grafana.com/ Slack https://slack.com/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    25 min
  5. Building Event-Driven Data Pipelines With Airflow 3 at Astrafy with Andrea Bombino

    FEB 12

    Building Event-Driven Data Pipelines With Airflow 3 at Astrafy with Andrea Bombino

    Real-time data expectations are reshaping how modern data teams think about orchestration and dependencies. As event-driven architectures become more common, teams need to rethink how pipelines react to data changes, rather than schedules. In this episode, Andrea Bombino, Co-Founder and Head of Analytics Engineering at Astrafy, joins us to discuss how event-driven scheduling in Airflow is evolving and how Astrafy applies it to deliver faster, more responsive data pipelines. Key Takeaways: 00:00 Introduction. 02:02 Astrafy’s role in guiding clients across the modern data stack. 03:15 Strong DAG dependencies create challenges for time-based scheduling. 04:48 Event-driven pipelines respond to increasing real-time data demands. 05:30 Airflow 3 introduces native support for event-driven orchestration. 06:27 Sensor-based workflows reveal scalability and efficiency limitations. 11:32 Event-driven assets improve efficiency and pipeline elegance. 14:45 Governance and cross-instance coordination emerge as ongoing challenges. Resources Mentioned: Andrea Bombino https://www.linkedin.com/in/andrea-bombino/ Astrafy | LinkedIn https://www.linkedin.com/company/astrafy/ Astrafy | Website https://www.astrafy.io Apache Airflow https://airflow.apache.org/ Google Cloud https://cloud.google.com/ Google Pub/Sub https://cloud.google.com/pubsub Google BigQuery https://cloud.google.com/bigquery Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    19 min
  6. Uphold’s Approach to Orchestrating Modern Data Workflows with Jaime Oliveira

    FEB 5

    Uphold’s Approach to Orchestrating Modern Data Workflows with Jaime Oliveira

    A strong data-driven mindset underpins how fintech teams scale analytics, infrastructure and decision-making across the business. In this episode, Jaime Oliveira, Lead Data Engineer at Uphold, joins us to discuss how Uphold structures its data organization and orchestration strategy. Jaime shares how the team uses Airflow and dbt to support analytics, reporting and data activation while evolving their approach as the stack grows. Key Takeaways: 00:00 Introduction. 01:23 A data-driven mindset supports product development and business decisions. 02:55 Diverse ingestion pipelines enable scalable analytics. 04:18 A single orchestration platform simplifies analytics workflows. 05:17 Early experience with orchestration tools shapes engineering practices. 08:16 Analytics orchestration works best when aligned with transformation workflows. 09:25 Infrastructure choices involve tradeoffs in testing, visibility and overhead. 16:39 More collaborative workflow tools could improve accessibility and autonomy. Resources Mentioned: Jaime Oliveira https://www.linkedin.com/in/jaime-oliveira-b075855a/ Uphold | LinkedIn https://www.linkedin.com/company/upholdinc/ Uphold | Website https://uphold.com Apache Airflow https://airflow.apache.org dbt https://www.getdbt.com Snowflake https://www.snowflake.com Kubernetes https://kubernetes.io Astronomer Cosmos https://astronomer.github.io/astronomer-cosmos Cosmos e-book https://www.astronomer.io/ebooks/orchestrating-dbt-with-airflow-using-cosmos/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    19 min
  7. Modern Airflow Best Practices for Scalable Data Pipelines with Bhavani Ravi

    JAN 29

    Modern Airflow Best Practices for Scalable Data Pipelines with Bhavani Ravi

    Building reliable data pipelines at scale requires more than writing code. It depends on thoughtful design, infrastructure trade-offs and an understanding of how orchestration platforms evolve over time. In this episode, Airflow best practices shaped by real-world implementation are examined. Bhavani Ravi, Independent Software Consultant and Apache Airflow Champion, shares lessons on pipeline design, architectural decisions and the evolution of the Airflow ecosystem in modern data environments. Key Takeaways: 00:00 Introduction. 01:30 Independent consulting supports effective Airflow adoption. 02:38 Early challenges shaped modern Airflow practices. 03:21 Airflow setup has become significantly simpler. 04:30 New features expanded workflow capabilities. 06:03 Frequent releases support long-term sustainability. 07:34 Community and providers strengthen the ecosystem. 10:03 Pipeline design should come before coding. 10:55 Decoupling logic requires careful trade-offs. 13:30 Plugins extend Airflow into new use cases. Resources Mentioned: Bhavani Ravi https://www.linkedin.com/in/bhavanicodes/ Apache Airflow https://airflow.apache.org/ Kubernetes https://kubernetes.io/ Azure Fabric https://learn.microsoft.com/en-us/fabric/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    17 min
  8. Inside Conviva’s Decision To Power Its Data Platform With Airflow with Han Zhang

    JAN 22

    Inside Conviva’s Decision To Power Its Data Platform With Airflow with Han Zhang

    Conviva operates at a massive scale, delivering outcome-based intelligence for digital businesses through real-time and batch data processing. As new use cases emerged, the team needed a way to extend a streaming-first architecture without rebuilding core systems. In this episode, Han Zhang joins us to explain how Conviva uses Apache Airflow as the orchestration backbone for its batch workloads, how the control plane is designed and what trade-offs shaped their platform decisions. Key Takeaways: 00:00 Introduction. 01:17 Large-scale data platforms require low-latency processing capabilities. 02:08 Batch workloads can complement streaming pipelines for additional use cases. 03:45 An orchestration framework can act as the core coordination layer. 06:12 Batch processing enables workloads that streaming alone cannot support. 08:50 Ecosystem maturity and observability are key orchestration considerations. 10:15 Built-in run history and logs make failures easier to diagnose. 14:20 Platform users can monitor workflows without managing orchestration logic. 17:08 Identity, secrets and scheduling present ongoing optimization challenges. 19:59 Configuration history and change visibility improve operational reliability. Resources Mentioned: Han Zhang https://www.linkedin.com/in/zhanghan177 Conviva | Website http://www.conviva.com Apache Airflow https://airflow.apache.org/ Celery https://docs.celeryq.dev/ Temporal https://temporal.io/ Kubernetes https://kubernetes.io/ LDAP https://ldap.com/ Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations. #AI #Automation #Airflow

    22 min
5
out of 5
20 Ratings

About

Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/

You Might Also Like