Data Engineering Podcast

Tobias Macey

4.6 (16)
Technology
Updated weekly

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

6 Jul

Building the Context Flywheel for AI Data Agents

Summary In this episode Prukalpa Sankar, co-founder of Atlan, talks about what it takes to build a “context flywheel” for AI agents in data-intensive organizations. She explained why model intelligence alone isn’t enough to make AI useful in production, and how real performance depends on contextual intelligence: institutional knowledge, semantic meaning, procedural know-how, and access to the right tools. She also dug into how metadata catalogs are evolving into broader context layers that serve both humans and agents, and why agentic systems are changing the economics of metadata and governance work. Prakulpa shared Atlan’s perspective on bootstrapping context from existing systems such as warehouses, BI tools, query logs, and SaaS applications, then using simulation, traces, and human governance loops to improve agent accuracy over time. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Prukalpa Sankar about strategies for building a context flywheel for your data agentsInterview IntroductionHow did you get involved in the area of data management?You have spent several years working in the metadata catalog space with Atlan. What are the notable changes in scope, adoption, and application that you have seen since we last spoke (June 2022)?The recurring theme since the start of 2026 has been agentic augmentation of all engineering workflows, including data. How do you differentiate between data catalogs, semantic layers, agent memory, context layers, etc. when architecting an AI-powered data-oriented system?One of the perennial problems with data catalogs, business glossaries, master data management, etc. is the up-front investment required to get a real-world impact. How can agents help reduce the activation energy needed to get to that return on effort?One of the perennial problems in data engineering is fragmentation and siloing of data. This is exacerbated by AI systems due to the introduction of vector data as a new specialization. What are the forces that you are seeing play into the current set of tensions and the architectural primitives that we need to bring to bear to keep things maintainable?Since the introduction of transformer-based generative models we have been combating hallucinations. While we have made progress, it is still critical to ensure accuracy and trustworthiness when working with business data. What are the policy elements of governance and technical controls to ensure a high degree of confidence in agent-generated context and business semantics?What are the most interesting, innovative, or unexpected ways that you have seen teams build context layers for their agentic data workloads?What are the most interesting, unexpected, or challenging lessons that you have learned while working on business context engineering?When is agent-managed context the wrong choice?What are your predictions for the next set of architectural shifts that will be driven by the pressures of AI-powered systems? Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links AtlanAtlan Context LakehouseIcebergBusiness GlossaryMaster Data ManagementSemantic LayerCube.devMCP == Model Context ProtocolA2A == Agent to Agent ProtocolDecision TracesApache DorisStarRocks The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
18 Jun

Holding Kafka Right: Product-Friendly Streaming with TypeStream

Summary In this episode Jevin Maltais talks about the practical realities of building reliable, product-focused streaming systems with Kafka. Jevin shares lessons from roles at Zapier, Humi, and Clio, where real-time synchronization, customer data unification, and document sync at scale highlighted both the strengths and common misuses of Kafka. He digs into using events as the source of truth, materialized views with KTables, and how schema registries and type safety prevent downstream breakage. Jevin explains why teams often reach for heavyweight Kafka clusters without leveraging Streams, Connect, or interactive queries—and how his project, TypeStream, aims to make those capabilities accessible via config-as-code while keeping a thin abstraction and clear escape hatches. He also explore trade-offs across Kafka-compatible alternatives, CDC with Debezium in the real world, and where abstractions should stop so teams can scale responsibility as complexity grows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Jevin Maltais about the challenges of building a reliable streaming Interview IntroductionHow did you get involved in the area of data management?Can you describe what Typestream is and the story behind it?What are the common challenges that teams encounter when trying to build on top of Kafka?How do those challenges/misconfigurations impact the team's ability to deliver on product goals?What are the fundamental design aspects of Kafka that contribute to the difficulties that teams encounter when using it as an element of their architecture?There have been numerous projects taking aim at Kafka, with varying approaches and degrees of effectiveness (e.g. RedPanda, AutoMQ, Pulsar, etc.). What are the tradeoffs that each of those approaches requires?What makes the original Kafka project so resilient in the face of all of that competition?Can you describe the architecture of Typestream and how each of the core elements contribute to a better user experience?For teams who want to take advantage of streaming capabilities, but don't want to invest in becoming Kafka experts, what does the Typestream workflow look like?If they don't want to manage the operational overhead of a Kafka cluster, how tightly coupled is Typestream to the original Kafka? (can someone use RedPanda or AutoMQ instead?)What are the most interesting, innovative, or unexpected ways that you have seen Typestream used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Typestream?When is Typestream the wrong choice?What do you have planned for the future of Typestream? Contact Info Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links TypestreamZapierAirflowKafkaKTablesKSQLRedPandaPulsarAutoMQKafka Schema RegistryDebeziumChange Data CaptureKafka ConnectTerraformKafka Compacted Topic The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
8 Jun

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Summary In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, text-to-transformations, and text-to-dashboard workflows. He also digs into on-prem versus SaaS deployments, domain-specialized agents for privacy and accuracy, code blocks for custom Python/SQL, and the roadmap for a marketplace and desktop assistant. Shravan highlights how Kaarvi compresses weeks of work into hours and bridges the gap between business users and data engineers by turning AI into a dependable force multiplier. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Shravan Gunda about building an agent-driven data platform at KaarviInterview IntroductionHow did you get involved in the area of data management?Can you describe what Kaarvi is and the story behind it?"AI" is a very broad term that encompasses numerous possible implementations. Can you give some more detail about the different types and applications of AI in Kaarvi's architecture?What are some of the core assumptions of data workflows that need to be reconsidered when AI is embedded in the execution path?What are the most interesting, innovative, or unexpected ways that you have seen Kaarvi used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kaarvi?When is Kaarvi the wrong choice?What do you have planned for the future of Kaarvi? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links KaarviSynthetic Datan8n The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
1 Jun

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Summary In this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Weimo Liu about the engineering behind PuppyGraph's zero-copy ETL for querying your lakehouse as a graphInterview IntroductionHow did you get involved in the area of data management?Can you start by describing what PuppyGraph is and the story behind it?What are some of the key use cases that people are turning to PuppyGraph and graph data models for?Graph engines have struggled to take off for several years, not least of which is due to the difficulty of scaling them to large data volumes as a result of the topological nature of the data. Can you describe the architecture of PuppyGraph and some of the ways that you are addressing that challenge of data volume for graphs?latency/data explorationtypes of traversals and limitationslakehouse architecture pros/cons for graphsdata modeling/translationshortcomings of zero-ETL and how transforming the underlying representation could provide benefitsFor someone who is looking for a graph engine to support a connected data use case, what are the guiding questions that you would ask to lead them toward PuppyGraph vs. a dedicated graph database like Memgraph/Neo4J/etc.?What are the most interesting, innovative, or unexpected ways that you have seen PuppyGraph used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on PuppyGraph?When is PuppyGraph the wrong choice?What do you have planned for the future of PuppyGraph and graph data exploration on large data volumes?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links PuppyGraphTigerGraphGoogle F1Graph DatabaseGoogle PregelIcebergGraph SupernodeMPP == Massively Parallel ProcessingSpark GraphXTrinoLadybug DBlance-graphKuzuDBMemGraphLabelled Property GraphRDF TriplesCypher Query LanguageGremlinCDC == Change Data CaptureNeo4JJanusGraphNetworkXPyTorchDuckDBIceberg ArrayLanceDBPalo Alto NetworksColumnar ADBCThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA %
6 May

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

Summary In this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applicationsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the major contributors to wasted or idle compute?Why does it matter if the available compute isn't being maximized?What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)? What are the most interesting, innovative, or unexpected ways that you have seen Ray used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?When is Ray the wrong choice?What do you have planned for the future of Ray?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links AnyScaleRayDeep LearningComputer VisionKubernetesCursorClaude CodeKube-RayPyTorchTensorflowTheanoCaffevLLMSGLangRay TuneNeural NetworkLearning RatesReinforcement LearningAlphaGoCursor Composer 2ImageNetTransformer ArchitectureStochastic Gradient DescentAirflowDagsterFlyteMixture of ExpertsPrefillTemporalActor FrameworkRDMA == Remote Direct Memory AccessNeocloudsAI Engineering Podcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
7 Apr

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Summary In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026 Interview IntroductionHow did you get involved in the area of data management?What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links Blog PostDatafoldClaude Opus 4.5Harry Potter - MugglesJevon's ParadoxModern Data StackDagster CompassGravity OrionMCP == Model Context ProtocolQwen The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
29 Mar

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Summary In this episode Himant Goyal, Senior Product Manager at Salesforce, talks about how data platform investments enable reliable, accurate metering for consumption-based business models. Himant explains why consumption turns operations into a real-time optimization problem spanning metering, cost attribution, billing, governance, and cross-functional ownership. He explores the richness required in usage data to support sophisticated pricing, the importance of treating metering like a financial system, and the architectural foundations - event schemas, durable ingestion, normalization/validation, a usage ledger, and clear serving layers - needed to power near-real-time visibility with fine-grained drilldowns. He also digs into anti-patterns and reliability concerns such as late or duplicate data, time zone pitfalls, SLAs, and automated policy decisions for pipeline failures. Himant shares practical guidance for capturing usage events from products and logs, balancing push vs. pull and real-time vs. batch processing to manage costs. He highlights configurable metering and rate-card versioning for rapid onboarding of new products, and the cultural shift required for finance, product, and engineering to co-own metering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Himant Goyal about how data platform investments support consumption based business modelsAnnouncements IntroductionHow did you get involved in managing the data products or data management?Can you start by outlining the types of businesses and products that are "consumption based" and the impact that it has on the economics of the company?What are the unique operational challenges that are presented by having consumption as the unit of cost?How does the availability and accessibility of metering data impact the level of detail/nuance that the business can employ in their pricing strategies?When we talk about the infrastructure for usage tracking, it often feels like a high-stakes stream processing problem. What are the core architectural components required to build a reliable metering pipeline?How do you think about the trade-offs between "push" models (application emits events) vs. "pull" models (the platform scrapes resource usage)?Accuracy is non-negotiable when data is tied directly to revenue. What are the strategies for ensuring idempotency and handling deduplication in the ingestion layer?How do you address the "late-arriving data" problem in a usage-based world, especially when dealing with monthly billing cycles or credit exhaustion?From an uptime and reliability perspective, should the metering system be in the critical path of the service itself?If the metering service is down, do you "fail open" and provide free service, or "fail closed" and impact availability? How do you build for that kind of resilience?One of the common pitfalls is treating metering like logging or observability. How do you ensure that usage metering is treated as a first-class product priority rather than an afterthought for the platform team?What does the interface look like for product engineers to "register" a new billable event without breaking the downstream data contract?Once you have this data, there is often a requirement for real-time visibility for the end user. What are the data modeling requirements to support both "high-volume ingestion" and "low-latency querying" for customer-facing billing dashboards?How do you bridge the gap between the raw event stream and the aggregated "billable unit" in the data warehouse or lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen usage-based metering used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on building consumption-based data platforms?When is usage-based metering the wrong choice? (e.g., When does the complexity of the data platform outweigh the economic benefits?)What are your predictions for the future of consumption-based data architectures? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hackernoon PostCOGS == Cost of Good SoldMedallion Architecture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
22 Mar

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Summary In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards and tools like Jupyter, Jupyter Book, and the push toward cloud-optimized formats (e.g., Zarr), along with graceful degradation strategies that keep interactive research usable over time. Rowan details how CurveNote enables interactive, reproducible articles that spin up compute on demand while delegating large dataset storage to specialized partners, and how community efforts like the Continuous Science Foundation and initiatives with Creative Commons aim to fix credit, licensing, and attribution. He also discusses the Open Exchange Architecture (OXA) initiative to establish a modular, computational standard for sharing science, the momentum in computational biosciences and neuroscience, and why true progress hinges on interoperability and composability across data, code, and narrative. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduce Interview IntroductionHow did you get involved in the area of data management?Can you describe what your interest is in reproducibility of scientific research?What role does data play in the set of challenges that plague reproducibility of published research?What are some of the notable changes in the areas of scientific process, and data systems that have contributed to the current crisis of reproducibility?Beyond technological shortcomings, what are the processes that lead to problematic experiment/research design, and how does that complicate the work of other teams trying to build on the experimental findings?How does a monolithic approach change the types of research that would be possible with more modular/composable experimentation and research?Focusing now on the data-oriented aspects of research, what are the habits of research teams that lead to friction and waste in storing, processing, publishing, and ultimately consuming the information that supports the research findings?What are the elements of the work that you are doing at the Continous Science Foundation and Curvenote to break the status quo?Are there any areas of study that you are more susceptible to friction and siloing of their data?What does a typical engagement with a research group look like as you try to improve the accessibility of their work?What are the most interesting, innovative, or unexpected ways that you have seen research data (re-)used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on reproducibility of scientific research?What are the next set of challenges that you are focused on addressing in the research/reproducibility space? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story. Links Continuous Science FoundationCurvenoteZenodoDryadHDF5IcebergZarrMyst MarkdownJupyter NotebookArXivJournal of Open Source Software (JOSS)Data CarpentrySoftware CarpentryOpen RxivBio RxivMed RxivForce 11JupyterBookOpen Exchange Architecture (OXA) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

See All (514)

4.6

out of 5

16 Ratings

The missing Data Engineering Podcast!

15/11/2017

GreatStuff123

I just found out about this podcast while browsing Twitter and seeing that the host of another of my favourite podcasts (Tobias Macey from Podcast.__Init__) had a new podcast on data engineering. With the demise of several older Hadoop podcasts and O'Reilley's more buisiness-focused data podcast, a new series like this one was sorely needed for discussions of current data architectures and pipelines. Thanks and keep up the good work Tobias, I've already learned so much after binging the first several podcasts! Looking forward to the next interviews.

Creator

Tobias Macey
Years Active

2017 - 2026
Episodes

514
Rating

Clean
Show Website

Data Engineering Podcast

Technology

Technology

Updated weekly
Technology

Technology

Updated twice weekly
Tech News

Tech News

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly

Data Engineering Podcast

Building the Context Flywheel for AI Data Agents

Holding Kafka Right: Product-Friendly Streaming with TypeStream

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Ratings & Reviews

The missing Data Engineering Podcast!

About

Information

You Might Also Like

Data Engineering Podcast

Episodes

Building the Context Flywheel for AI Data Agents

Holding Kafka Right: Product-Friendly Streaming with TypeStream

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Ratings & Reviews

About

Information

You Might Also Like