Data Engineering Podcast

Tobias Macey

5.0 (1)
Technology
Updated weekly

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

1 day ago

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Summary In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, text-to-transformations, and text-to-dashboard workflows. He also digs into on-prem versus SaaS deployments, domain-specialized agents for privacy and accuracy, code blocks for custom Python/SQL, and the roadmap for a marketplace and desktop assistant. Shravan highlights how Kaarvi compresses weeks of work into hours and bridges the gap between business users and data engineers by turning AI into a dependable force multiplier. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Shravan Gunda about building an agent-driven data platform at KaarviInterview IntroductionHow did you get involved in the area of data management?Can you describe what Kaarvi is and the story behind it?"AI" is a very broad term that encompasses numerous possible implementations. Can you give some more detail about the different types and applications of AI in Kaarvi's architecture?What are some of the core assumptions of data workflows that need to be reconsidered when AI is embedded in the execution path?What are the most interesting, innovative, or unexpected ways that you have seen Kaarvi used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kaarvi?When is Kaarvi the wrong choice?What do you have planned for the future of Kaarvi? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links KaarviSynthetic Datan8n The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

53 min
1 Jun

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Summary In this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Weimo Liu about the engineering behind PuppyGraph's zero-copy ETL for querying your lakehouse as a graphInterview IntroductionHow did you get involved in the area of data management?Can you start by describing what PuppyGraph is and the story behind it?What are some of the key use cases that people are turning to PuppyGraph and graph data models for?Graph engines have struggled to take off for several years, not least of which is due to the difficulty of scaling them to large data volumes as a result of the topological nature of the data. Can you describe the architecture of PuppyGraph and some of the ways that you are addressing that challenge of data volume for graphs?latency/data explorationtypes of traversals and limitationslakehouse architecture pros/cons for graphsdata modeling/translationshortcomings of zero-ETL and how transforming the underlying representation could provide benefitsFor someone who is looking for a graph engine to support a connected data use case, what are the guiding questions that you would ask to lead them toward PuppyGraph vs. a dedicated graph database like Memgraph/Neo4J/etc.?What are the most interesting, innovative, or unexpected ways that you have seen PuppyGraph used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on PuppyGraph?When is PuppyGraph the wrong choice?What do you have planned for the future of PuppyGraph and graph data exploration on large data volumes?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links PuppyGraphTigerGraphGoogle F1Graph DatabaseGoogle PregelIcebergGraph SupernodeMPP == Massively Parallel ProcessingSpark GraphXTrinoLadybug DBlance-graphKuzuDBMemGraphLabelled Property GraphRDF TriplesCypher Query LanguageGremlinCDC == Change Data CaptureNeo4JJanusGraphNetworkXPyTorchDuckDBIceberg ArrayLanceDBPalo Alto NetworksColumnar ADBCThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA %

54 min
6 May

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

Summary In this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applicationsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the major contributors to wasted or idle compute?Why does it matter if the available compute isn't being maximized?What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)? What are the most interesting, innovative, or unexpected ways that you have seen Ray used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?When is Ray the wrong choice?What do you have planned for the future of Ray?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links AnyScaleRayDeep LearningComputer VisionKubernetesCursorClaude CodeKube-RayPyTorchTensorflowTheanoCaffevLLMSGLangRay TuneNeural NetworkLearning RatesReinforcement LearningAlphaGoCursor Composer 2ImageNetTransformer ArchitectureStochastic Gradient DescentAirflowDagsterFlyteMixture of ExpertsPrefillTemporalActor FrameworkRDMA == Remote Direct Memory AccessNeocloudsAI Engineering Podcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

59 min
7 Apr

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Summary In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026 Interview IntroductionHow did you get involved in the area of data management?What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links Blog PostDatafoldClaude Opus 4.5Harry Potter - MugglesJevon's ParadoxModern Data StackDagster CompassGravity OrionMCP == Model Context ProtocolQwen The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

59 min
29 Mar

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Summary In this episode Himant Goyal, Senior Product Manager at Salesforce, talks about how data platform investments enable reliable, accurate metering for consumption-based business models. Himant explains why consumption turns operations into a real-time optimization problem spanning metering, cost attribution, billing, governance, and cross-functional ownership. He explores the richness required in usage data to support sophisticated pricing, the importance of treating metering like a financial system, and the architectural foundations - event schemas, durable ingestion, normalization/validation, a usage ledger, and clear serving layers - needed to power near-real-time visibility with fine-grained drilldowns. He also digs into anti-patterns and reliability concerns such as late or duplicate data, time zone pitfalls, SLAs, and automated policy decisions for pipeline failures. Himant shares practical guidance for capturing usage events from products and logs, balancing push vs. pull and real-time vs. batch processing to manage costs. He highlights configurable metering and rate-card versioning for rapid onboarding of new products, and the cultural shift required for finance, product, and engineering to co-own metering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Himant Goyal about how data platform investments support consumption based business modelsAnnouncements IntroductionHow did you get involved in managing the data products or data management?Can you start by outlining the types of businesses and products that are "consumption based" and the impact that it has on the economics of the company?What are the unique operational challenges that are presented by having consumption as the unit of cost?How does the availability and accessibility of metering data impact the level of detail/nuance that the business can employ in their pricing strategies?When we talk about the infrastructure for usage tracking, it often feels like a high-stakes stream processing problem. What are the core architectural components required to build a reliable metering pipeline?How do you think about the trade-offs between "push" models (application emits events) vs. "pull" models (the platform scrapes resource usage)?Accuracy is non-negotiable when data is tied directly to revenue. What are the strategies for ensuring idempotency and handling deduplication in the ingestion layer?How do you address the "late-arriving data" problem in a usage-based world, especially when dealing with monthly billing cycles or credit exhaustion?From an uptime and reliability perspective, should the metering system be in the critical path of the service itself?If the metering service is down, do you "fail open" and provide free service, or "fail closed" and impact availability? How do you build for that kind of resilience?One of the common pitfalls is treating metering like logging or observability. How do you ensure that usage metering is treated as a first-class product priority rather than an afterthought for the platform team?What does the interface look like for product engineers to "register" a new billable event without breaking the downstream data contract?Once you have this data, there is often a requirement for real-time visibility for the end user. What are the data modeling requirements to support both "high-volume ingestion" and "low-latency querying" for customer-facing billing dashboards?How do you bridge the gap between the raw event stream and the aggregated "billable unit" in the data warehouse or lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen usage-based metering used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on building consumption-based data platforms?When is usage-based metering the wrong choice? (e.g., When does the complexity of the data platform outweigh the economic benefits?)What are your predictions for the future of consumption-based data architectures? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hackernoon PostCOGS == Cost of Good SoldMedallion Architecture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

50 min
22 Mar

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Summary In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards and tools like Jupyter, Jupyter Book, and the push toward cloud-optimized formats (e.g., Zarr), along with graceful degradation strategies that keep interactive research usable over time. Rowan details how CurveNote enables interactive, reproducible articles that spin up compute on demand while delegating large dataset storage to specialized partners, and how community efforts like the Continuous Science Foundation and initiatives with Creative Commons aim to fix credit, licensing, and attribution. He also discusses the Open Exchange Architecture (OXA) initiative to establish a modular, computational standard for sharing science, the momentum in computational biosciences and neuroscience, and why true progress hinges on interoperability and composability across data, code, and narrative. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduce Interview IntroductionHow did you get involved in the area of data management?Can you describe what your interest is in reproducibility of scientific research?What role does data play in the set of challenges that plague reproducibility of published research?What are some of the notable changes in the areas of scientific process, and data systems that have contributed to the current crisis of reproducibility?Beyond technological shortcomings, what are the processes that lead to problematic experiment/research design, and how does that complicate the work of other teams trying to build on the experimental findings?How does a monolithic approach change the types of research that would be possible with more modular/composable experimentation and research?Focusing now on the data-oriented aspects of research, what are the habits of research teams that lead to friction and waste in storing, processing, publishing, and ultimately consuming the information that supports the research findings?What are the elements of the work that you are doing at the Continous Science Foundation and Curvenote to break the status quo?Are there any areas of study that you are more susceptible to friction and siloing of their data?What does a typical engagement with a research group look like as you try to improve the accessibility of their work?What are the most interesting, innovative, or unexpected ways that you have seen research data (re-)used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on reproducibility of scientific research?What are the next set of challenges that you are focused on addressing in the research/reproducibility space? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links Continuous Science FoundationCurvenoteZenodoDryadHDF5IcebergZarrMyst MarkdownJupyter NotebookArXivJournal of Open Source Software (JOSS)Data CarpentrySoftware CarpentryOpen RxivBio RxivMed RxivForce 11JupyterBookOpen Exchange Architecture (OXA) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

43 min
16 Mar

Beyond Prompts: Practical Paths to Self‑Improving AI

Summary In this episode Raj Shukla, CTO of SymphonyAI, explores what it really takes to build self‑improving AI systems that work in production. Raj unpacks how agentic systems interact with real-world environments, the feedback loops that enable continuous learning, and why intelligent memory layers often provide the most practical middle ground between prompt tweaks and full Reinforcement Learning. He discusses the architecture needed around models - data ingestion, sensors, action layers, sandboxes, RBAC, and agent lifecycle management - to reach enterprise-grade reliability, as well as the policy alignment steps required for regulated domains like financial crime. Raj shares hard-won lessons on tool use evolution (from bespoke tools to filesystem and Unix primitives), dynamic code-writing subagents, model version brittleness, and how organizations can standardize process and entity graphs to accelerate time-to-value. He also dives into pitfalls such as policy gaps and tribal knowledge, strategies for staged rollouts and monitoring, and where small models and cost optimization make sense. Raj closes with a vision for bringing RL-style improvement to enterprises without requiring a research team - letting businesses own the reasoning and memory layers that truly differentiate their AI systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey, and today I’m interviewing Raj Shukla about building self-improving AI systems — and how they enable AI scalability in real production environments. Interview IntroductionHow did you get involved in AI/ML?Can you start by outlining what actually improves over time in a self-improving AI system? How is that different from simply improving a model or an agent? How would you differentiate between an agent/agentic system vs. a self-improving system? One of the components that are becoming common in agentic architectures is a "memory" layer. What are some of the ways that contributes to a self-improvement feedback loop? In what ways are memory layers insufficient for a generalized self-improvement capability? For engineering and technology leaders, what are the key architectural and operational steps you recommend to build AI that can move from pilots into scalable, production systems? One of the perennial challenges for technology leaders is how to build AI systems that scale over time. How has AI changed the way you think about long-term advantage? How do self-improvement feedback loops contribute to AI scalability in real systems? What are some of the other key elements necessary to build a truly evolutionary AI system? What are the hidden costs of building these AI systems that teams should know before starting? I’m talking about enterprise who are deploying AI into their internal mission-critical workflows. What are the most interesting, innovative, or unexpected ways that you have seen self-improving AI systems implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on evolutionary AI systems? What are some of the ways that you anticipate agentic architectures and frameworks evolving to be more capable of self-improvement? Contact Info LinkedIn Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Parting Question From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today? Links Symphony AIReinforcement LearningAgentic MemoryIn-Context LearningContext EngineeringFew-Shot LearningOpenClawDeep Research AgentRAG == Retrieval Augmented GenerationAgentic SearchGoogle Gemma ModelsOllama The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

1hr 2min
8 Mar

Orion at Gravity: Trustworthy AI Analysts for the Enterprise

Summary In this episode of the Data Engineering Podcast, Lucas Thelosen and Drew Gilson, co-founders of Gravity, discuss their vision for agentic analytics in the enterprise, enabled by semantic layers and broader context engineering. They share their journey from Looker and Google to building Orion, an AI analyst that combines data semantics with rich business context to deliver trustworthy and actionable insights. Lucas and Drew explain how Orion uses governed, role-specific "custom agents" to drive analysis, recommendations, and proactive preparation for meetings, while maintaining accuracy, lineage transparency, and human-in-the-loop feedback. The conversation covers evolving views on semantic layers, agent memory, retrieval, and operating across messy data, multiple warehouses, and external context like documents and weather. They emphasize the importance of trust, governance, and the path to AI coworkers that act as reliable colleagues. Lucas and Drew also share field stories from public companies where Orion has surfaced board-level issues, accelerated executive prep with last-minute research, and revealed how BI investments are actually used, highlighting a shift from static dashboards to dynamic, dialog-driven decisions. They stress the need for accessible (non-proprietary) models, managing context and technical debt over time, and focusing on business actions - not just metrics - to unlock real ROI. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the application of semantic layers to context engineering for agentic analytics Interview IntroductionHow did you get involved in the area of data management?Can you start by digging into the practical elements of what is involved in the creation and maintenance of a "semantic layer"?How does the semantic layer relate to and differ from the physical schema of a data warehouse?In generative AI and agentic systems the latest term of art is "context engineering". How does a semantic layer factor into the context management for an agentic analyst?What are some of the ways that LLMs/agents can help to populate the semantic layer?What are the cases where you want to guard against hallucinations by keeping a human in the loop?Beyond a physical semantic layer, what are the other elements of context that you rely on for guiding the activities of your agents?What are some utilities that you have found helpful for bootstrapping the structural guidelines for an existing warehouse environment?What are the most interesting, innovative, or unexpected ways that you have seen Orion used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Orion?When is Orion the wrong choice?What do you have planned for the future of Orion? Contact Info LucasLinkedInDrewLinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links GravityOrionLookerSemantic LayerdbtLookMLTableauOpenClawPareto Distribution The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1hr 5min

See All (512)

Creator

Tobias Macey
Years Active

2017 - 2026
Episodes

512
Rating

Clean
Show Website

Data Engineering Podcast

Technology

Technology

Updated weekly
Technology

Technology

Updated twice weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated twice weekly

Data Engineering Podcast

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Beyond Prompts: Practical Paths to Self‑Improving AI

Orion at Gravity: Trustworthy AI Analysts for the Enterprise

About

Information

You Might Also Like

Data Engineering Podcast

Episodes

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Beyond Prompts: Practical Paths to Self‑Improving AI

Orion at Gravity: Trustworthy AI Analysts for the Enterprise

About

Information

You Might Also Like