How AI Is Built Nicolay Gerold
-
- Technology
How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
-
Mastering Vector Databases: Product & Binary Quantization, Multi-Vector Search
Ever wondered how AI systems handle images and videos, or how they make lightning-fast recommendations? Tune in as Nicolay chats with Zain Hassan, an expert in vector databases from Weaviate. They break down complex topics like quantization, multi-vector search, and the potential of multimodal search, making them accessible for all listeners. Zain even shares a sneak peek into the future, where vector databases might connect our brains with computers!
Zain Hasan:
LinkedIn
X (Twitter)
Weaviate
Nicolay Gerold:
LinkedIn
X (Twitter)
Key Insights:
Vector databases can handle not just text, but also image, audio, and video data
Quantization is a powerful technique to significantly reduce costs and enable in-memory search
Binary quantization allows efficient brute force search for smaller datasets
Multi-vector search enables retrieval of heterogeneous data types within the same index
The future lies in multimodal search and recommendations across different senses
Brain-computer interfaces and EEG foundation models are exciting areas to watch
Key Quotes:
"Vector databases are pretty much the commercialization and the productization of representation learning."
"I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality."
"Going from text to multimedia in vector databases is really simple."
"Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."
Chapters
00:00 - 01:24 Introduction
01:24 - 03:48 Underappreciated aspects of vector databases
03:48 - 06:06 Quantization trade-offs and techniques
Various quantization techniques: binary quantization, product quantization, scalar quantization
06:06 - 08:24 Binary quantization
Reducing vectors from 32-bits per dimension down to 1-bit
Enables efficient in-memory brute force search for smaller datasets
Requires normally distributed data between negative and positive values
08:24 - 10:44 Product quantization and other techniques
Alternative to binary quantization, segments vectors and clusters each segment
Scalar quantization reduces vectors to 8-bits per dimension
10:44 - 13:08 Quantization as a "superpower" to reduce costs
13:08 - 15:34 Comparing quantization approaches
15:34 - 17:51 Placing vector databases in the database landscape
17:51 - 20:12 Pruning unused vectors and nodes
20:12 - 22:37 Improving precision beyond similarity thresholds
22:37 - 25:03 Multi-vector search
25:03 - 27:11 Impact of vector databases on data interaction
27:11 - 29:35 Interesting and weird use cases
29:35 - 32:00 Future of multimodal search and recommendations
32:00 - 34:22 Extending recommendations to user data
34:22 - 36:39 What's next for Weaviate
36:39 - 38:57 Exciting technologies beyond vector databases and LLMs
vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10
In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.
Summary by Section
Introduction
Anjan Banerjee, a data architect, discusses building complex AI and data systems
Explains the basics of data architecture using Lego and chat app examples
Sources and Tools
Identifying data sources is the first step in designing a data architecture
Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
Use one tool for most activities if possible, but specialized tools offer benefits
Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)
Airflow and Orchestration
Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
For less technical orgs, GUI-based tools like Talend, Alteryx may be better
AWS Step Functions and managed Airflow are improving native orchestration capabilities
For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte
AI and Data Processing
ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
TinyML and edge computing enable ML inference on device (drones, manufacturing)
Cloud batch processing still dominates for user targeting, recommendations
Data Lakes and Storage
Storage choice depends on data types, use cases, cloud ecosystem
Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
Pulling data into separate system often needed for advanced analytics beyond source system
Data Quality and Standardization
"Poka-yoke" error-proofing of input screens is vital for downstream data quality
Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization
Hot Takes and Wishes
Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
Automated data set joining and entity resolution across systems would be a game-changer
Anjan Banerjee:
LinkedIn
Nicolay Gerold:
LinkedIn
X (Twitter)
00:00 Understanding Data Architecture
12:36 Choosing the Right Tools
20:36 The Benefits of Serverless Functions
21:34 Integrating AI in Data Acquisition
24:31 The Trend Towards Single Node Engines
26:51 Choosing the Right Database Management System and Storage
29:45 Adding Additional Storage Components
32:35 Reducing Human Errors for Better Data Quality
39:07 Overhyped and Underutilized Tools
Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack | ep 9
Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.
Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.
Key Takeaways:
Lake houses offer a powerful and flexible architecture for modern data analytics.
Open-source solutions provide cost-effective and customizable alternatives.
Carefully consider your specific use cases and preferences when choosing tools and components.
Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.
Sound Bites
"The Lake house is sort of a modular setup where you decouple the storage and the compute."
"A lake house is an architecture, an architecture for data analytics platforms."
"The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."
Jorrit Sandbrink:
LinkedIn
dlt
Nicolay Gerold:
LinkedIn
X (Twitter)
Chapters
00:00 Introduction to the Lake House Architecture
03:59 Choosing Storage and Table Formats
06:19 Comparing Compute Engines
21:37 Simplifying Data Ingress
25:01 Building a Preferred Data Stack
lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8
Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG.
Key Points:
Knowledge Graphs: Graphlet utilizes knowledge graphs as a filtering layer on top of keyword metadata and vector search, aiding in information retrieval.
Storage for KGs: A single piece of content in their data model resides across multiple systems: a document store with JSON, a graph node, and a search index. This hybrid approach creates a virtual entity with representations in different databases.
Entity Extraction: Azure Cognitive Services and other models are employed to extract entities from text for improved understanding.
Metadata-first approach: The metadata-first strategy involves extracting comprehensive metadata from various sources, ensuring it is canonicalized and filterable. This approach aids in better indexing and retrieval of data, crucial for effective RAG.
Challenges: Entity resolution and deduplication remain significant challenges in knowledge graph development.
Notable Quotes:
"Knowledge graphs is a filtering [mechanism]...but then I think also the kind of spidering and pulling extra content in is the other place this comes into play."
"Knowledge graphs to me are kind of like index per se...you're providing a new type of index on top of that."
"[For RAG]...you have to find constraints to make it workable."
"Entity resolution, deduping, I think is probably the number one thing."
"I've essentially built a connector infrastructure that would be like a FiveTran or something that Airflow would have..."
"One of the reasons is because we're a platform as a service, the burstability of it is really important. We can spin up to a hundred instances without any problem, and we don't have to think about it."
"Once cost and performance become a no-brainer, we're going to start seeing LLMs be more of a compute tool. I think that would be a game-changer for how applications are built in the future."
Kirk Marple:
LinkedIn
X (Twitter)
Graphlit
Graphlit Docs
Nicolay Gerold:
LinkedIn
X (Twitter)
Chapters
00:00 Graphlit’s Hybrid Approach
02:23 Use Cases and Transition to Graphlit
04:19 Knowledge Graphs as a Filtering Mechanism
13:23 Using Gremlin for Querying the Graph
32:36 XML in Prompts for Better Segmentation
35:04 The Future of LLMs and Graphlit
36:25 Getting Started with Graphlit
Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7
From Problem to Requirements to Architecture.
In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.
"Don't overcomplicate what you're actually doing."
"Getting your basic programming software development skills down is super important to becoming a good data engineer."
"Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."
Key Takeaways:
Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this.
Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use.
Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts.
Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management.
The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success.
The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.
Jon Erik Kemi Warghed:
LinkedIn
Nicolay Gerold:
LinkedIn
X (Twitter)
Chapters
00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords
00:57 How to Choose the Right Tools: Considerations for startups and large companies
03:13 Evaluating Open Source Tools: Background checks and due diligence
07:52 Defining Data Governance: Transparency and understanding of data
10:15 The Importance of Data Governance: Challenges and solutions
12:21 Data Governance Tools: dbt and Dagster
17:05 The Impact of Dagster: Software-defined assets and declarative thinking
19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage
21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management
26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines
28:47 The Importance of Tool Selection: Thinking about long-term sustainability
31:10 When to Adopt Orchestration: Identifying the need for orchestration tools
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Data Orchestration Tools: Choosing the right one for your needs | ep 6
In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools.
Sound Bites
"The modern era, definitely airflow. Took the market share, a lot of people running it themselves."
"It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator."
"The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops."
Key Topics
The evolution of data orchestration: From basic task scheduling to complex DAG-based solutions
What is a data orchestrator and when do you need one? Understanding the role of orchestrators in handling complex dependencies and scaling data pipelines.
The crowded market: A look at popular options like Airflow, Daxter, Prefect, and more.
Best practices: Choosing the right tool, prioritizing serverless solutions when possible, and focusing on solving the use case before implementing complex tools.
Data residency and GDPR: How regulations influence tool selection, especially in Europe.
Future of the field: The need for consolidation and finding the right balance between features and usability.
John Wessel:
LinkedIn
Data Stack Show
Agreeable Data
Nicolay Gerold:
LinkedIn
X (Twitter)
Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI.
Chapters
00:00 Introduction and Overview
00:34 The Evolution of Data Orchestration Tools
04:54 Components and Flow of Data in Orchestrators
08:24 Deployment Options: Serverless vs. Kubernetes
11:14 Considerations for Data Residency and Security
13:02 The Need for a Clear Winner in the Orchestration Space
20:47 Optimization Techniques for Memory and Time-Limited Issues
23:09 Integrating Orchestrators with Infrastructure-as-Code
24:33 Bridging the Gap Between Data and Engineering Practices
27:2 2Exciting Technologies Outside of Data Orchestration
30:09 The Feature of Dagster
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message