13 episodes

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

How AI Is Built Nicolay Gerold

    • Technology

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

    Mastering Vector Databases: Product & Binary Quantization, Multi-Vector Search

    Mastering Vector Databases: Product & Binary Quantization, Multi-Vector Search

    Ever wondered how AI systems handle images and videos, or how they make lightning-fast recommendations? Tune in as Nicolay chats with Zain Hassan, an expert in vector databases from Weaviate. They break down complex topics like quantization, multi-vector search, and the potential of multimodal search, making them accessible for all listeners. Zain even shares a sneak peek into the future, where vector databases might connect our brains with computers!

    Zain Hasan:


    LinkedIn
    X (Twitter)
    Weaviate

    Nicolay Gerold:


    ⁠LinkedIn⁠
    ⁠X (Twitter)

    Key Insights:


    Vector databases can handle not just text, but also image, audio, and video data
    Quantization is a powerful technique to significantly reduce costs and enable in-memory search
    Binary quantization allows efficient brute force search for smaller datasets


    Multi-vector search enables retrieval of heterogeneous data types within the same index
    The future lies in multimodal search and recommendations across different senses
    Brain-computer interfaces and EEG foundation models are exciting areas to watch

    Key Quotes:


    "Vector databases are pretty much the commercialization and the productization of representation learning."
    "I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality."
    "Going from text to multimedia in vector databases is really simple."
    "Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."

    Chapters

    00:00 - 01:24 Introduction

    01:24 - 03:48 Underappreciated aspects of vector databases

    03:48 - 06:06 Quantization trade-offs and techniques


    Various quantization techniques: binary quantization, product quantization, scalar quantization

    06:06 - 08:24 Binary quantization


    Reducing vectors from 32-bits per dimension down to 1-bit
    Enables efficient in-memory brute force search for smaller datasets
    Requires normally distributed data between negative and positive values

    08:24 - 10:44 Product quantization and other techniques


    Alternative to binary quantization, segments vectors and clusters each segment
    Scalar quantization reduces vectors to 8-bits per dimension

    10:44 - 13:08 Quantization as a "superpower" to reduce costs

    13:08 - 15:34 Comparing quantization approaches

    15:34 - 17:51 Placing vector databases in the database landscape

    17:51 - 20:12 Pruning unused vectors and nodes

    20:12 - 22:37 Improving precision beyond similarity thresholds

    22:37 - 25:03 Multi-vector search

    25:03 - 27:11 Impact of vector databases on data interaction

    27:11 - 29:35 Interesting and weird use cases

    29:35 - 32:00 Future of multimodal search and recommendations

    32:00 - 34:22 Extending recommendations to user data

    34:22 - 36:39 What's next for Weaviate

    36:39 - 38:57 Exciting technologies beyond vector databases and LLMs

    vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications


    ---

    Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message

    • 40 min
    Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

    Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

    In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.

    Summary by Section

    Introduction


    Anjan Banerjee, a data architect, discusses building complex AI and data systems
    Explains the basics of data architecture using Lego and chat app examples

    Sources and Tools


    Identifying data sources is the first step in designing a data architecture
    Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
    Use one tool for most activities if possible, but specialized tools offer benefits
    Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)

    Airflow and Orchestration


    Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
    For less technical orgs, GUI-based tools like Talend, Alteryx may be better
    AWS Step Functions and managed Airflow are improving native orchestration capabilities
    For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte

    AI and Data Processing


    ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
    TinyML and edge computing enable ML inference on device (drones, manufacturing)
    Cloud batch processing still dominates for user targeting, recommendations

    Data Lakes and Storage


    Storage choice depends on data types, use cases, cloud ecosystem
    Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
    Pulling data into separate system often needed for advanced analytics beyond source system

    Data Quality and Standardization


    "Poka-yoke" error-proofing of input screens is vital for downstream data quality
    Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
    Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization

    Hot Takes and Wishes


    Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
    Automated data set joining and entity resolution across systems would be a game-changer

    Anjan Banerjee:


    LinkedIn

    Nicolay Gerold:


    ⁠LinkedIn⁠
    ⁠X (Twitter)

    00:00 Understanding Data Architecture

    12:36 Choosing the Right Tools

    20:36 The Benefits of Serverless Functions

    21:34 Integrating AI in Data Acquisition

    24:31 The Trend Towards Single Node Engines

    26:51 Choosing the Right Database Management System and Storage

    29:45 Adding Additional Storage Components

    32:35 Reducing Human Errors for Better Data Quality

    39:07 Overhyped and Underutilized Tools

    Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution


    ---

    Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message

    • 45 min
    Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack | ep 9

    Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack | ep 9

    Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.


    Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
    Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
    Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
    Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
    Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.

    Key Takeaways:


    Lake houses offer a powerful and flexible architecture for modern data analytics.
    Open-source solutions provide cost-effective and customizable alternatives.
    Carefully consider your specific use cases and preferences when choosing tools and components.
    Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
    The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.

    Sound Bites

    "The Lake house is sort of a modular setup where you decouple the storage and the compute."
    "A lake house is an architecture, an architecture for data analytics platforms."
    "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."

    Jorrit Sandbrink:


    LinkedIn
    dlt

    Nicolay Gerold:


    ⁠LinkedIn⁠
    ⁠X (Twitter)

    Chapters

    00:00 Introduction to the Lake House Architecture

    03:59 Choosing Storage and Table Formats

    06:19 Comparing Compute Engines

    21:37 Simplifying Data Ingress

    25:01 Building a Preferred Data Stack

    lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage


    ---

    Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message

    • 27 min
    Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8

    Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8

    Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG.

    Key Points:


    Knowledge Graphs: Graphlet utilizes knowledge graphs as a filtering layer on top of keyword metadata and vector search, aiding in information retrieval.
    Storage for KGs: A single piece of content in their data model resides across multiple systems: a document store with JSON, a graph node, and a search index. This hybrid approach creates a virtual entity with representations in different databases.
    Entity Extraction: Azure Cognitive Services and other models are employed to extract entities from text for improved understanding.
    Metadata-first approach: The metadata-first strategy involves extracting comprehensive metadata from various sources, ensuring it is canonicalized and filterable. This approach aids in better indexing and retrieval of data, crucial for effective RAG.
    Challenges: Entity resolution and deduplication remain significant challenges in knowledge graph development.

    Notable Quotes:


    "Knowledge graphs is a filtering [mechanism]...but then I think also the kind of spidering and pulling extra content in is the other place this comes into play."
    "Knowledge graphs to me are kind of like index per se...you're providing a new type of index on top of that."
    "[For RAG]...you have to find constraints to make it workable."
    "Entity resolution, deduping, I think is probably the number one thing."
    "I've essentially built a connector infrastructure that would be like a FiveTran or something that Airflow would have..."
    "One of the reasons is because we're a platform as a service, the burstability of it is really important. We can spin up to a hundred instances without any problem, and we don't have to think about it."
    "Once cost and performance become a no-brainer, we're going to start seeing LLMs be more of a compute tool. I think that would be a game-changer for how applications are built in the future."

    Kirk Marple:


    LinkedIn
    X (Twitter)
    Graphlit
    Graphlit Docs

    Nicolay Gerold:


    ⁠LinkedIn⁠
    ⁠X (Twitter)

    Chapters

    00:00 Graphlit’s Hybrid Approach
    02:23 Use Cases and Transition to Graphlit
    04:19 Knowledge Graphs as a Filtering Mechanism
    13:23 Using Gremlin for Querying the Graph
    32:36 XML in Prompts for Better Segmentation
    35:04 The Future of LLMs and Graphlit
    36:25 Getting Started with Graphlit

    Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal


    ---

    Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message

    • 36 min
    Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

    Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

    From Problem to Requirements to Architecture.

    In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.


    "Don't overcomplicate what you're actually doing."
    "Getting your basic programming software development skills down is super important to becoming a good data engineer."
    "Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."

    Key Takeaways:


    Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this.
    Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use.
    Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts.
    Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management.
    The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success.
    The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.

    Jon Erik Kemi Warghed:


    LinkedIn

    Nicolay Gerold:


    ⁠LinkedIn⁠
    ⁠X (Twitter)

    Chapters

    00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords

    00:57 How to Choose the Right Tools: Considerations for startups and large companies

    03:13 Evaluating Open Source Tools: Background checks and due diligence

    07:52 Defining Data Governance: Transparency and understanding of data

    10:15 The Importance of Data Governance: Challenges and solutions

    12:21 Data Governance Tools: dbt and Dagster

    17:05 The Impact of Dagster: Software-defined assets and declarative thinking

    19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage

    21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management

    26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines

    28:47 The Importance of Tool Selection: Thinking about long-term sustainability

    31:10 When to Adopt Orchestration: Identifying the need for orchestration tools


    ---

    Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message

    • 38 min
    Data Orchestration Tools: Choosing the right one for your needs | ep 6

    Data Orchestration Tools: Choosing the right one for your needs | ep 6

    In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools.

    Sound Bites

    "The modern era, definitely airflow. Took the market share, a lot of people running it themselves."
    "It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator."
    "The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops."

    Key Topics


    The evolution of data orchestration: From basic task scheduling to complex DAG-based solutions
    What is a data orchestrator and when do you need one? Understanding the role of orchestrators in handling complex dependencies and scaling data pipelines.
    The crowded market: A look at popular options like Airflow, Daxter, Prefect, and more.
    Best practices: Choosing the right tool, prioritizing serverless solutions when possible, and focusing on solving the use case before implementing complex tools.
    Data residency and GDPR: How regulations influence tool selection, especially in Europe.
    Future of the field: The need for consolidation and finding the right balance between features and usability.

    John Wessel:


    LinkedIn
    Data Stack Show
    Agreeable Data

    Nicolay Gerold:


    ⁠LinkedIn⁠
    ⁠X (Twitter)

    Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI.

    Chapters

    00:00 Introduction and Overview

    00:34 The Evolution of Data Orchestration Tools

    04:54 Components and Flow of Data in Orchestrators

    08:24 Deployment Options: Serverless vs. Kubernetes

    11:14 Considerations for Data Residency and Security

    13:02 The Need for a Clear Winner in the Orchestration Space

    20:47 Optimization Techniques for Memory and Time-Limited Issues

    23:09 Integrating Orchestrators with Infrastructure-as-Code

    24:33 Bridging the Gap Between Data and Engineering Practices

    27:2 2Exciting Technologies Outside of Data Orchestration

    30:09 The Feature of Dagster


    ---

    Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message

    • 32 min

Top Podcasts In Technology

Acquired
Ben Gilbert and David Rosenthal
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Lex Fridman Podcast
Lex Fridman
Catalyst with Shayle Kann
Latitude Media
Hard Fork
The New York Times
TED Radio Hour
NPR

You Might Also Like