Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

  1. -5 J

    From Academia to Industry: Bridging Data Engineering Challenges

    Summary In this episode of the Data Engineering Podcast Professor Paul Groth, from the University of Amsterdam, talks about his research on knowledge graphs and data engineering. Paul shares his background in AI and data management, discussing the evolution of data provenance and lineage, as well as the challenges of data integration. He explores the impact of large language models (LLMs) on data engineering, highlighting their potential to simplify knowledge graph construction and enhance data integration. The conversation covers the evolving landscape of data architectures, managing semantics and access control, and the interplay between industry and academia in advancing data engineering practices, with Paul also sharing insights into his work with the intelligent data engineering lab and the importance of human-AI collaboration in data engineering pipelines. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Paul Groth about his research on knowledge graphs and data engineeringInterview IntroductionHow did you get involved in the area of data management?Can you start by describing the focus and scope of your academic efforts?Given your focus on data management for machine learning as part of the INDELab, what are some of the developing trends that practitioners should be aware of?ML architectures / systems changing (matteo interlandi) GPUs for data mangementYou have spent a large portion of your career working with knowledge graphs, which have largely been a niche area until recently. What are some of the notable changes in the knowledge graph ecosystem that have resulted from the introduction of LLMs?What are some of the other ways that you are seeing LLMs change the methods of data engineering?There are numerous vague and anecdotal references to the power of LLMs to unlock value from unstructured data. What are some of the realitites that you are seeing in your research?A majority of the conversations in this podcast are focused on data engineering in the context of a business organization. What are some of the ways that management of research data is disjoint from the methods and constraints that are present in business contexts?What are the most interesting, innovative, or unexpected ways that you have seen LLM used in data management?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering research?What do you have planned for the future of your research in the context of data engineering, knowledge graphs, and AI?Contact Info WebsiteemailParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links INDELabData ProvenanceElsevierSIGMOD 2025Digital TwinKnowledge GraphWikiDataKuzuDBPodcast Episodedata.worldPodcast EpisodeGraphRAGSPARQLSemantic WebGQL == Graph Query LanguageCypherAmazon NeptuneRDF == Resource Description FrameworkSwellDBFlockMTLDuckDBPodcast EpisodeMatteo InterlandiPaolo PapottiNeuromorphic ComputingPoint CloudsLongform.aiBASIL DBThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    51 min
  2. 18 AOÛT

    High Performance And Low Overhead Graphs With KuzuDB

    Summary In this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. He discusses the usability and scalability of KuzuDB, emphasizing its open-source nature and potential for various graph applications. The conversation explores the growing interest in graph databases due to their AI and data engineering applications, and Prashanth highlights KuzuDB's potential in edge computing, ephemeral workloads, and integration with other formats like Iceberg and Parquet. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Prashanth Rao about KuzuDB, an embeddable graph databaseInterview IntroductionHow did you get involved in the area of data management?Can you describe what KuzuDB is and the story behind it?What are the core use cases that Kuzu is focused on addressing?What is explicitly out of scope?Graph engines have been available and in use for a long time, but generally for more niche use cases. How would you characterize the current state of the graph data ecosystem?You note scalability as a feature of Kuzu, which is a phrase with many potential interpretations. Typically horizontal scaling of graphs has been complicated, in what sense does Kuzu make that claim?Can you describe some of the typical architecture and integration patterns of Kuzu?What are some of the more interesting or esoteric means of architecting with Kuzu?For cases where Kuzu is rendering a graph across an external data repository (e.g. Iceberg, etc.), what are the patterns for balancing data freshness with network/compute efficiency? (e.g. read and create every time or persist the Kuzu state)Can you describe the internal architecture of Kuzu and key design factors?What are the benefits and tradeoffs of using a columnar store with adjacency lists vs. a more graph-native storage format?What are the most interesting, innovative, or unexpected ways that you have seen Kuzu used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kuzu?When is Kuzu the wrong choice?What do you have planned for the future of Kuzu?Contact Info WebsiteLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links KuzuDBBERTTransformer ArchitectureDuckDBPodcast EpisodeMonetDBUmbra DBsqliteCypher Query LanguageProperty GraphNeo4JGraphRAGContext EngineeringWrite-Ahead LogBauplanIcebergDuckLakeLanceLanceDBArrowPolarsArrow DataFusionGQLClickHouseAdjacency ListWhy Graph Databases Need New Join AlgorithmsKuzuDB WASMRAG == Retrieval Augmented GenerationNetworkXThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    1 h 1 min
  3. 12 AOÛT

    Bridging Data and Decision-Making: AI's Role in Modern Analytics

    Summary In this episode of the Data Engineering Podcast Lucas Thelosen and Drew Gilson from Gravity talk about their development of Orion, an autonomous data analyst that bridges the gap between data availability and business decision-making. Lucas and Drew share their backgrounds in data analytics and how their experiences have shaped their approach to leveraging AI for data analysis, emphasizing the potential of AI to democratize data insights and make sophisticated analysis accessible to companies of all sizes. They discuss the technical aspects of Orion, a multi-agent system designed to automate data analysis and provide actionable insights, highlighting the importance of integrating AI into existing workflows with accuracy and trustworthiness in mind. The conversation also explores how AI can free data analysts from routine tasks, enabling them to focus on strategic decision-making and stakeholder management, as they discuss the future of AI in data analytics and its transformative impact on businesses. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the engineering and impact of building an autonomous data analystInterview IntroductionHow did you get involved in the area of data management?Can you describe what Orion is and the story behind it?How do you envision the role of an agentic analyst in an organizational context?There have been several attempts at building LLM-powered data analysis, many of which are essentially a text-to-SQL interface. How have the capabilities and architectural patterns grown in the past ~2 years to enable a more capable system?One of the key success factors for a data analyst is their ability to translate business questions into technical representations. How can an autonomous AI-powered system understand the complex nuance of the business to build effective analyses?Many agentic approaches to analytics require a substantial investment in data architecture, documentation, and semantic models to be effective. What are the gradations of effectiveness for autonomous analytics for companies who are at different points on their journey to technical maturity?Beyond raw capability, there is also a significant need to invest in user experience design for an agentic analyst to be useful. What are the key interaction patterns that you have found to be helpful as you have developed your system?How does the introduction of a system like Orion shift the workload for data teams?Can you describe the overall system design and technical architecture of Orion?How has that changed as you gained further experience and understanding of the problem space?What are the most interesting, innovative, or unexpected ways that you have seen Orion used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Orion?When is Orion/agentic analytics the wrong choice?What do you have planned for the future of Orion?Contact Info LucasLinkedInDrewLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links OrionLookerGravityVBA == Visual Basic for ApplicationsText-To-SQLOne-shotLookMLData GrainLLM As A JudgeGoogle Large Time Series ModelThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    1 h 11 min
  4. 5 AOÛT

    From Bits to Tables: The Evolution of S3 Storage

    Summary In this episode of the Data Engineering Podcast Andy Warfield talks about the innovative functionalities of S3 Tables and Vectors and their integration into modern data stacks. Andy shares his journey through the tech industry and his role at Amazon, where he collaborates to enhance storage capabilities, discussing the evolution of S3 from a simple storage solution to a sophisticated system supporting advanced data types like tables and vectors crucial for analytics and AI-driven applications. He explains the motivations behind introducing S3 Tables and Vectors, highlighting their role in simplifying data management and enhancing performance for complex workloads, and shares insights into the technical challenges and design considerations involved in developing these features. The conversation explores potential applications of S3 Tables and Vectors in fields like AI, genomics, and media, and discusses future directions for S3's development to further support data-driven innovation. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Andy Warfield about S3 Tables and VectorsInterview IntroductionHow did you get involved in the area of data management?Can you describe what your goals are with the Tables and Vector features of S3?How did the experience of building S3 Tables inform your work on S3 Vectors?There are numerous implementations of vector storage and search. How do you view the role of S3 in the context of that ecosystem?The most directly analogous implementation that I'm aware of is the Lance table format. How would you compare the implementation and capabilities of Lance with what you are building with S3 Vectors?What opportunity do you see for being able to offer a protocol compatible implementation similar to the Iceberg compatibility that you provide with S3 Tables?Can you describe the technical implementation of the Vectors functionality in S3?What are the sources of inspiration that you looked to in designing the service?Can you describe some of the ways that S3 Vectors might be integrated into a typical AI application?What are the most interesting, innovative, or unexpected ways that you have seen S3 Tables/Vectors used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3 Tables/Vectors?When is S3 the wrong choice for Iceberg or Vector implementations?What do you have planned for the future of S3 Tables and Vectors?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links S3 TablesS3 VectorsS3 ExpressParquetIcebergVector IndexVector DatabasepgvectorEmbedding ModelRetrieval Augmented GenerationTwelveLabsAmazon BedrockIceberg REST CatalogLog-Structured Merge TreeS3 MetadataSentence TransformerSparkTrinoDaftThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    50 min
  5. 28 JUIL.

    Revolutionizing Python Notebooks with Marimo

    Summary In this episode of the Data Engineering Podcast Akshay Agrawal from Marimo discusses the innovative new Python notebook environment, which offers a reactive execution model, full Python integration, and built-in UI elements to enhance the interactive computing experience. He discusses the challenges of traditional Jupyter notebooks, such as hidden states and lack of interactivity, and how Marimo addresses these issues with features like reactive execution and Python-native file formats. Akshay also explores the broader landscape of programmatic notebooks, comparing Marimo to other tools like Jupyter, Streamlit, and Hex, highlighting its unique approach to creating data apps directly from notebooks and eliminating the need for separate app development. The conversation delves into the technical architecture of Marimo, its community-driven development, and future plans, including a commercial offering and enhanced AI integration, emphasizing Marimo's role in bridging the gap between data exploration and production-ready applications. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Akshay Agrawal about Marimo, a reusable and reproducible Python notebook environmentInterview IntroductionHow did you get involved in the area of data management?Can you describe what Marimo is and the story behind it?What are the core problems and use cases that you are focused on addressing with Marimo?What are you explicitly not trying to solve for with Marimo?Programmatic notebooks have been around for decades now. Jupyter was largely responsible for making them popular outside of academia. How have the applications of notebooks changed in recent years?What are the limitations that have been most challenging to address in production contexts?Jupyter has long had support for multi-language notebooks/notebook kernels. What is your opinion on the utility of that feature as a core concern of the notebook system?Beyond notebooks, Streamlit and Hex have become quite popular for publishing the results of notebook-style analysis. How would you characterize the feature set of Marimo for those use cases?For a typical data team that is working across data pipelines, business analytics, ML/AI engineering, etc. How do you see Marimo applied within and across those contexts?One of the common difficulties with notebooks is that they are largely a single-player experience. They may connect into a shared compute cluster for scaling up execution (e.g. Ray, Dask, etc.). How does Marimo address the situation where a data platform team wants to offer notebooks as a service to reduce the friction to getting started with analyzing data in a warehouse/lakehouse context?How are you seeing teams integrate Marimo with orchestrators (e.g. Dagster, Airflow, Prefect)?What are some of the most interesting or complex engineering challenges that you have had to address while building and evolving Marimo?\What are the most interesting, innovative, or unexpected ways that you have seen Marimo used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Marimo?When is Marimo the wrong choice?What do you have planned for the future of Marimo?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links MarimoJupyterIPythonStreamlitPodcast.__init__ EpisodeVector EmbeddingsDimensionality ReductionKagglePytestPEP 723 script dependency metadataMatLabVisicalcMathematicaRMarkdownRShinyElixir LivebookDatabricks NotebooksPapermillPluto - Julia NotebookHexDirected Acyclic Graph (DAG)Sumble Kaggle founder Anthony Goldblum's startupRayDaskJupytextnbdevDuckDBPodcast EpisodeIcebergSupersetjupyter-marimo-proxyJupyterHubBinderNixAnyWidgetJupyter WidgetsMatplotlibAltairPlotlyDataFusionPolarsMotherDuckThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    52 min
  6. 21 JUIL.

    Warehouse Native Incremental Data Processing With Dynamic Tables And Delayed View Semantics

    Summary In this episode of the Data Engineering Podcast Dan Sotolongo from Snowflake talks about the complexities of incremental data processing in warehouse environments. Dan discusses the challenges of handling continuously evolving datasets and the importance of incremental data processing for optimized resource use and reduced latency. He explains how delayed view semantics can address these challenges by maintaining up-to-date results with minimal work, leveraging Snowflake's dynamic tables feature. The conversation also explores the broader landscape of data processing, comparing batch and streaming systems, and highlights the trade-offs between them. Dan emphasizes the need for a unified theoretical framework to discuss semantic guarantees in data pipelines and introduces the concept of delayed view semantics, touching on the limitations of current systems and the potential of dynamic tables to simplify complex data workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Dan Sotolongo about the challenges of incremental data processing in warehouse environments and how delayed view semantics help to address the problemInterview IntroductionHow did you get involved in the area of data management?Can you start by defining the scope of the term "incremental data processing"?What are some of the common solutions that data engineers build when creating workflows to implement that pattern?What are some common difficulties that they encounter in the pursuit of incremental data?Can you describe what delayed view semantics are and the story behind it?What are the problems that DVS explicitly doesn't address?How does the approach that you have taken in Dynamic View Semantics compare to systems like Materialize, Feldera, etc.Can you describe the technical architecture of the implementation of Dynamic Tables?What are the elements of the problem that are as-yet unsolved?How has the implementation changed/evolved as you learned more about the solution space?What would be involved in implementing the delayed view semantics pattern in other dbms engines?For someone who wants to use DVS/Dyamic Tables for managing their incremental data loads, what does the workflow look like?What are the options for being able to apply tests/validation logic to a dynamic table while it is operating?What are the most interesting, innovative, or unexpected ways that you have seen Dynamic Tables used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dynamic Tables/Delayed View Semantics?When are Dynamic Tables/DVS the wrong choice?What do you have planned for the future of Dynamic Tables?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Delayed View Semantics: Presentation SlidesSnowflakeNumPyIPythonJupyterFlinkSpark StreamingKafkaSnowflake Dynamic TablesAirflowDagsterStreaming WatermarksMaterializeFelderaACIDCAP Theorem)LinearizabilitySerializable ConsistencySIGMODMaterialized ViewsdbtData VaultApache IcebergDatabricks DeltaHudiDead Letter Queuepg_ivmProperty Based TestingIceberg V3 Row LineagePrometheusThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    55 min
  7. 15 JUIL.

    Streamlining Data Pipelines with MCP Servers and Vector Engines

    Summary In this episode of the Data Engineering Podcast Kacper Łukawski from Qdrant about integrating MCP servers with vector databases to process unstructured data. Kacper shares his experience in data engineering, from building big data pipelines in the automotive industry to leveraging large language models (LLMs) for transforming unstructured datasets into valuable assets. He discusses the challenges of building data pipelines for unstructured data and how vector databases facilitate semantic search and retrieval-augmented generation (RAG) applications. Kacper delves into the intricacies of vector storage and search, including metadata and contextual elements, and explores the evolution of vector engines beyond RAG to applications like semantic search and anomaly detection. The conversation covers the role of Model Context Protocol (MCP) servers in simplifying data integration and retrieval processes, highlighting the need for experimentation and evaluation when adopting LLMs, and offering practical advice on optimizing vector search costs and fine-tuning embedding models for improved search quality. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Kacper Łukawski about how MCP servers can be paired with vector databases to streamline processing of unstructured dataInterview IntroductionHow did you get involved in the area of data management?LLMs are enabling the derivation of useful data assets from unstructured sources. What are the challenges that teams face in building the pipelines to support that work?How has the role of vector engines grown or evolved in the past ~2 years as LLMs have gained broader adoption?Beyond its role as a store of context for agents, RAG, etc. what other applications are common for vector databaes?In the ecosystem of vector engines, what are the distinctive elements of Qdrant?How has the MCP specification simplified the work of processing unstructured data?Can you describe the toolchain and workflow involved in building a data pipeline that leverages an MCP for generating embeddings?helping data engineers gain confidence in non-deterministic workflowsbringing application/ML/data teams into collaboration for determining the impact of e.g. chunking strategies, embedding model selection, etc.What are the most interesting, innovative, or unexpected ways that you have seen MCP and Qdrant used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on vector use cases?When is MCP and/or Qdrant the wrong choice?What do you have planned for the future of MCP with Qdrant?Contact Info LinkedInTwitter/XPersonal websiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links QdrantKafkaApache OoziNamed Entity RecognitionGraphRAGpgvectorElasticsearchApache LuceneOpenSearchBM25Semantic SearchMCP == Model Context ProtocolAnthropic Contextualized ChunkingCohereThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    52 min
  8. 6 JUIL.

    Foundational Data Engineering At Two Sigma

    Summary In this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing data quality with delivery speed, and the socio-technical challenges of building a foundational data platform that supports research and operational needs while maintaining regulatory compliance and data quality. Effie also shares insights into treating data as code, leveraging modern data warehouses, and the evolving role of data engineers in a rapidly changing technological landscape. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Effie Baram about data engineering in the finance sectorInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the role of data in the context of Two Sigma?What are some of the key characteristics of the types of data sources that you work with?Your role is leading "foundational data engineering" at Two Sigma. Can you unpack that title and how it shapes the ways that you think about what you build?How does the concept of "foundational data" influence the ways that the business thinks about the organizational patterns around data?Given the regulatory environment around finance, how does that impact the ways that you think about the "what" and "how" of the data that you deliver to data consumers?Being the foundational team for data use at Two Sigma, how have you approached the design and architecture of your technical systems?How do you think about the boundaries between your responsibilities and the rest of the organization?What are the design patterns that you have found most helpful in empowering data consumers to build on top of your work?What are some of the elements of sociotechnical friction that have been most challenging to address?What are the most interesting, innovative, or unexpected ways that you have seen the ideas around "foundational data" applied in your organization?What are the most interesting, unexpected, or challenging lessons that you have learned while working with financial data?When is a foundational data team the wrong approach?What do you have planned for the future of your platform design?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links 2SigmaReliability EngineeringSLA == Service-Level AgreementAirflowParquet File FormatBigQuerySnowflakedbtGemini AssistMCP == Model Context ProtocoldtraceThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    55 min
4,6
sur 5
138 notes

À propos

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Vous aimeriez peut‑être aussi