How AI Is Built Nicolay Gerold
-
- Technologie
-
How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
-
Lance v2: Rethinking Columnar Storage for Faster Lookups, Nulls, and Flexible Encodings | changelog 2
In this episode of Changelog, Weston Pace dives into the latest updates to LanceDB, an open-source vector database and file format. Lance's new V2 file format redefines the traditional notion of columnar storage, allowing for more efficient handling of large multimodal datasets like images and embeddings. Weston discusses the goals driving LanceDB's development, including null value support, multimodal data handling, and finding an optimal balance for search performance.
Sound Bites
"A little bit more power to actually just try."
"We're becoming a little bit more feature complete with returns of arrow."
"Weird data representations that are actually really optimized for your use case."
Key Points
Weston introduces LanceDB, an open-source multimodal vector database and file format.
The goals behind LanceDB's design: handling null values, multimodal data, and finding the right balance between point lookups and full dataset scan performance.
Lance V2 File Format:
Potential Use Cases
Conversation Highlights
On the benefits of Arrow integration: Strengthening the connection with the Arrow data ecosystem for seamless data handling.
Why "columnar container format"?: A broader definition than "table format" to encompass more unconventional use cases.
Tackling multimodal data: How LanceDB V2 enables storage of large multimodal data efficiently and without needing tons of memory.
Python's role in encoding experimentation: Providing a way to rapidly prototype custom encodings and plug them into LanceDB.
LanceDB:
X (Twitter)
GitHub
Web
Discord
VectorDB Recipes
Lance V2
Weston Pace:
LinkedIn
GitHub
Nicolay Gerold:
LinkedIn
X (Twitter)
Chapters
00:00 Introducing Lance: A New File Format
06:46 Enabling Custom Encodings in Lance
11:51 Exploring the Relationship Between Lance and Arrow
20:04 New Chapter
Lance file format, nulls, round-tripping data, optimized data representations, full-text search, encodings, downsides, multimodal data, compression, point lookups, full scan performance, non-contiguous columns, custom encodings
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Unlocking AI with Supabase: Postgres Configuration, Real-Time Processing, and Extensions
Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:
Core components and how they power real-time AI solutions
Optimizing Postgres for AI workloads
The magic of PG Vector and other key extensions
Supabase’s future and exciting new features
Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:
Core components and how they power real-time AI solutions
Optimizing Postgres for AI workloads
The magic of PG Vector and other key extensions
Supabase’s future and exciting new features
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
AI Inside Your Database, Real-Time AI, Declarative ML/AI | ep 3
If you've ever wanted a simpler way to integrate AI directly into your database, SuperDuperDB might be the answer. SuperDuperDB lets you easily apply AI processes to your data while keeping everything up-to-date with real-time calculations. It works with various databases and aims to make AI development less of a headache.
In this podcast, we explore:
How SuperDuperDB bridges the gap between AI and databases.
The benefits of real-time AI processes within your data deployment.
SuperDuperDB's framework for configuring AI workflows.
The future of AI-powered databases.
Takeaways
SuperDuperDB enables developers to apply AI processes directly to their data stores
The platform supports real-time computation of embeddings or classifications, keeping the data deployment up to date
SuperDuperDB provides a framework for configuring AI processes that work in close combination with the data deployment
The platform supports a variety of databases, including operational and analytical databases
SuperDuperDB aims to simplify AI development by abstracting the data layer and infrastructure
Duncan Blythe:
LinkedIn
SuperDuperDB:
Docs
Website
LinkedIn
Nicolay Gerold:
LinkedIn
X (Twitter)
Chapters
00:00 Introduction to SuperDuperDB
04:19 Real-time Computation and Data Deployment
13:46 Bringing Compute and Database Closer Together
29:30 Declarative Machine Learning with SuperDuperDB
35:09 Future Plans for SuperDuperDB
SuperDuperDB, AI, databases, embeddings, classifications, data deployment, operational databases, analytical databases, AI development, data science
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Supabase acquires OrioleDB, A New Database Engine for PostgreSQL | changelog 1
Supabase just acquired OrioleDB, a storage engine for PostgreSQL.
Oriole gets creative with MVCC! It uses an UNDO log rather than keeping multiple versions of an entire data row (tuple). This means when you update data, Oriole tracks the changes needed to "undo" the update if necessary. Think of this like the "undo" function in a text editor. Instead of keeping a full copy of the old text, it just remembers what changed. This can be much smaller. This also saves space by eliminating the need for a garbage collection process.
It also has a bunch of additional performance boosters like data compression, easy integration with data lakes, and index-organized tables.
Show notes:
Oriole joins Supabase
Oriole Git
Percona Talk on OrioleDB
Supabase
Chris Gwilliams:
LinkedIn
Nicolay Gerold:
LinkedIn
X (Twitter)
00:42 Introduction to OrioleDB
04:38 The Undo Log Approach
08:39 Improving Performance for High Throughput Databases
11:08 My take on OrioleDB
OrioleDB, storage engine, Postgres, table access methods, undo log, high throughput databases, automated features, new use cases, S3, data migration
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
AI Powered Data Transformation, Combining gen & trad AI, Semantic Validation | ep 2
Today’s guest is Antonio Bustamante, a serial entrepreneur who previously built Kite and Silo and is now working to fix bad data. He is building bem, the data tool to transform any data into the schema your AI and software needs.
bem.ai is a data tool that focuses on transforming any data into the schema needed for AI and software. It acts as a system's interoperability layer, allowing systems that couldn't communicate before to exchange information. Learn what place LLMs play in data transformation, how to build reliable data infrastructure and more.
"Surprisingly, the hardest was semi-structured data. That is data that should be structured, but is unreliable, undocumented, hard to work with."
"We were spending close to four or five million dollars a year just in integrations, which is no small budget for a company that size. So I was pretty much determined to fix this problem once and for all."
"bem focuses on being the system's interoperability layer."
"We basically take in anything you send us, we transform it exactly into your internal data schema so that you don't have to parse, process, transform anything of that sort."
"LLMs are a 30% of it... A lot of it is very, very like thorough validation layers, great infrastructure, just ensuring reliability and connection to our user systems.”
"You can use a million token context window and feed an entire document to an LLM. I can guarantee you if you don't, semantically chunk it out before you're not going to get the right results.”
"We're obsessed with time to value... Our milestone is basically five minute onboarding max, and then you're ready to go."
Antonio Bustamante
LinkedIn
X (Twitter)
bem.ai
LinkedIn
Website
Nicolay Gerold:
LinkedIn
X (Twitter)
Semi-structured data, Data integrations, Large language models (LLMs), Data transformation, Schema interoperability, Fault tolerance, Validation layers, System reliability, Schema evolution, Enterprise software, Data pipelines.
Chapters
00:00 The Problem of Integrations
05:58 Building Fault Tolerant Systems
13:51 Versioning and Semantic Validation
27:33 BEM in the Data Ecosystem
34:40 Future Plans and Onboarding
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message -
Multimodal AI, Storing 1 Billion Vectors, Building Data Infrastructure | ep 1
Imagine a world where data bottlenecks, slow data loaders, or memory issues on the VM don't hold back machine learning.
Machine learning and AI success depends on the speed you can iterate. LanceDB is here to to enable fast experiments on top of terabytes of unstructured data. It is the database for AI. Dive with us into how LanceDB was built, what went into the decision to use Rust as the main implementation language, the potential of AI on top of LanceDB, and more.
"LanceDB is the database for AI...to manage their data, to do a performant billion scale vector search."
“We're big believers in the composable data systems vision."
"You can insert data into LanceDB using Panda's data frames...to sort of really large 'embed the internet' kind of workflows."
"We wanted to create a new generation of data infrastructure that makes their [AI engineers] lives a lot easier."
"LanceDB offers up to 1,000 times faster performance than Parquet."
Change She:
LinkedIn
X (Twitter)
LanceDB:
X (Twitter)
GitHub
Web
Discord
VectorDB Recipes
Nicolay Gerold:
LinkedIn
X (Twitter)
Chapters:
00:00 Introduction to LanceDB
02:16 Building LanceDB in Rust
12:10 Optimizing Data Infrastructure
26:20 Surprising Use Cases for LanceDB
32:01 The Future of LanceDB
LanceDB, AI, database, Rust, multimodal AI, data infrastructure, embeddings, images, performance, Parquet, machine learning, model database, function registries, agents.
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message