13 episodes

How AI Is Built Nicolay Gerold

- Technology

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

- JUN 7, 2024
Mastering Vector Databases: Product & Binary Quantization, Multi-Vector Search

Mastering Vector Databases: Product & Binary Quantization, Multi-Vector Search

Ever wondered how AI systems handle images and videos, or how they make lightning-fast recommendations? Tune in as Nicolay chats with Zain Hassan, an expert in vector databases from Weaviate. They break down complex topics like quantization, multi-vector search, and the potential of multimodal search, making them accessible for all listeners. Zain even shares a sneak peek into the future, where vector databases might connect our brains with computers!

Zain Hasan:

LinkedIn
X (Twitter)
Weaviate

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Key Insights:

Vector databases can handle not just text, but also image, audio, and video data
Quantization is a powerful technique to significantly reduce costs and enable in-memory search
Binary quantization allows efficient brute force search for smaller datasets

Multi-vector search enables retrieval of heterogeneous data types within the same index
The future lies in multimodal search and recommendations across different senses
Brain-computer interfaces and EEG foundation models are exciting areas to watch

Key Quotes:

"Vector databases are pretty much the commercialization and the productization of representation learning."
"I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality."
"Going from text to multimedia in vector databases is really simple."
"Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."

Chapters

00:00 - 01:24 Introduction

01:24 - 03:48 Underappreciated aspects of vector databases

03:48 - 06:06 Quantization trade-offs and techniques

Various quantization techniques: binary quantization, product quantization, scalar quantization

06:06 - 08:24 Binary quantization

Reducing vectors from 32-bits per dimension down to 1-bit
Enables efficient in-memory brute force search for smaller datasets
Requires normally distributed data between negative and positive values

08:24 - 10:44 Product quantization and other techniques

Alternative to binary quantization, segments vectors and clusters each segment
Scalar quantization reduces vectors to 8-bits per dimension

10:44 - 13:08 Quantization as a "superpower" to reduce costs

13:08 - 15:34 Comparing quantization approaches

15:34 - 17:51 Placing vector databases in the database landscape

17:51 - 20:12 Pruning unused vectors and nodes

20:12 - 22:37 Improving precision beyond similarity thresholds

22:37 - 25:03 Multi-vector search

25:03 - 27:11 Impact of vector databases on data interaction

27:11 - 29:35 Interesting and weird use cases

29:35 - 32:00 Future of multimodal search and recommendations

32:00 - 34:22 Extending recommendations to user data

34:22 - 36:39 What's next for Weaviate

36:39 - 38:57 Exciting technologies beyond vector databases and LLMs

vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 40 min
- MAY 31, 2024
Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.

Summary by Section

Introduction

Anjan Banerjee, a data architect, discusses building complex AI and data systems
Explains the basics of data architecture using Lego and chat app examples

Sources and Tools

Identifying data sources is the first step in designing a data architecture
Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
Use one tool for most activities if possible, but specialized tools offer benefits
Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)

Airflow and Orchestration

Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
For less technical orgs, GUI-based tools like Talend, Alteryx may be better
AWS Step Functions and managed Airflow are improving native orchestration capabilities
For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte

AI and Data Processing

ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
TinyML and edge computing enable ML inference on device (drones, manufacturing)
Cloud batch processing still dominates for user targeting, recommendations

Data Lakes and Storage

Storage choice depends on data types, use cases, cloud ecosystem
Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
Pulling data into separate system often needed for advanced analytics beyond source system

Data Quality and Standardization

"Poka-yoke" error-proofing of input screens is vital for downstream data quality
Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization

Hot Takes and Wishes

Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
Automated data set joining and entity resolution across systems would be a game-changer

Anjan Banerjee:

LinkedIn

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

00:00 Understanding Data Architecture

12:36 Choosing the Right Tools

20:36 The Benefits of Serverless Functions

21:34 Integrating AI in Data Acquisition

24:31 The Trend Towards Single Node Engines

26:51 Choosing the Right Database Management System and Storage

29:45 Adding Additional Storage Components

32:35 Reducing Human Errors for Better Data Quality

39:07 Overhyped and Underutilized Tools

Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 45 min
- MAY 23, 2024
Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack | ep 9

Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack | ep 9

Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.

Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.

Key Takeaways:

Lake houses offer a powerful and flexible architecture for modern data analytics.
Open-source solutions provide cost-effective and customizable alternatives.
Carefully consider your specific use cases and preferences when choosing tools and components.
Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.

Sound Bites

"The Lake house is sort of a modular setup where you decouple the storage and the compute."
"A lake house is an architecture, an architecture for data analytics platforms."
"The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."

Jorrit Sandbrink:

LinkedIn
dlt

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Chapters

00:00 Introduction to the Lake House Architecture

03:59 Choosing Storage and Table Formats

06:19 Comparing Compute Engines

21:37 Simplifying Data Ingress

25:01 Building a Preferred Data Stack

lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 27 min
- MAY 20, 2024
Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8

Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8

Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG.

Key Points:

Knowledge Graphs: Graphlet utilizes knowledge graphs as a filtering layer on top of keyword metadata and vector search, aiding in information retrieval.
Storage for KGs: A single piece of content in their data model resides across multiple systems: a document store with JSON, a graph node, and a search index. This hybrid approach creates a virtual entity with representations in different databases.
Entity Extraction: Azure Cognitive Services and other models are employed to extract entities from text for improved understanding.
Metadata-first approach: The metadata-first strategy involves extracting comprehensive metadata from various sources, ensuring it is canonicalized and filterable. This approach aids in better indexing and retrieval of data, crucial for effective RAG.
Challenges: Entity resolution and deduplication remain significant challenges in knowledge graph development.

Notable Quotes:

"Knowledge graphs is a filtering [mechanism]...but then I think also the kind of spidering and pulling extra content in is the other place this comes into play."
"Knowledge graphs to me are kind of like index per se...you're providing a new type of index on top of that."
"[For RAG]...you have to find constraints to make it workable."
"Entity resolution, deduping, I think is probably the number one thing."
"I've essentially built a connector infrastructure that would be like a FiveTran or something that Airflow would have..."
"One of the reasons is because we're a platform as a service, the burstability of it is really important. We can spin up to a hundred instances without any problem, and we don't have to think about it."
"Once cost and performance become a no-brainer, we're going to start seeing LLMs be more of a compute tool. I think that would be a game-changer for how applications are built in the future."

Kirk Marple:

LinkedIn
X (Twitter)
Graphlit
Graphlit Docs

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Chapters

00:00 Graphlit’s Hybrid Approach
02:23 Use Cases and Transition to Graphlit
04:19 Knowledge Graphs as a Filtering Mechanism
13:23 Using Gremlin for Querying the Graph
32:36 XML in Prompts for Better Segmentation
35:04 The Future of LLMs and Graphlit
36:25 Getting Started with Graphlit

Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 36 min
- MAY 17, 2024
Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

From Problem to Requirements to Architecture.

In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.

"Don't overcomplicate what you're actually doing."
"Getting your basic programming software development skills down is super important to becoming a good data engineer."
"Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."

Key Takeaways:

Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this.
Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use.
Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts.
Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management.
The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success.
The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.

Jon Erik Kemi Warghed:

LinkedIn

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Chapters

00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords

00:57 How to Choose the Right Tools: Considerations for startups and large companies

03:13 Evaluating Open Source Tools: Background checks and due diligence

07:52 Defining Data Governance: Transparency and understanding of data

10:15 The Importance of Data Governance: Challenges and solutions

12:21 Data Governance Tools: dbt and Dagster

17:05 The Impact of Dagster: Software-defined assets and declarative thinking

19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage

21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management

26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines

28:47 The Importance of Tool Selection: Thinking about long-term sustainability

31:10 When to Adopt Orchestration: Identifying the need for orchestration tools

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 38 min
- MAY 10, 2024
Data Orchestration Tools: Choosing the right one for your needs | ep 6

Data Orchestration Tools: Choosing the right one for your needs | ep 6

In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools.

Sound Bites

"The modern era, definitely airflow. Took the market share, a lot of people running it themselves."
"It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator."
"The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops."

Key Topics

The evolution of data orchestration: From basic task scheduling to complex DAG-based solutions
What is a data orchestrator and when do you need one? Understanding the role of orchestrators in handling complex dependencies and scaling data pipelines.
The crowded market: A look at popular options like Airflow, Daxter, Prefect, and more.
Best practices: Choosing the right tool, prioritizing serverless solutions when possible, and focusing on solving the use case before implementing complex tools.
Data residency and GDPR: How regulations influence tool selection, especially in Europe.
Future of the field: The need for consolidation and finding the right balance between features and usability.

John Wessel:

LinkedIn
Data Stack Show
Agreeable Data

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI.

Chapters

00:00 Introduction and Overview

00:34 The Evolution of Data Orchestration Tools

04:54 Components and Flow of Data in Orchestrators

08:24 Deployment Options: Serverless vs. Kubernetes

11:14 Considerations for Data Residency and Security

13:02 The Need for a Clear Winner in the Orchestration Space

20:47 Optimization Techniques for Memory and Time-Limited Issues

23:09 Integrating Orchestrators with Infrastructure-as-Code

24:33 Bridging the Gap Between Data and Engineering Practices

27:2 2Exciting Technologies Outside of Data Orchestration

30:09 The Feature of Dagster

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 32 min