10 Folgen

How AI Is Built Nicolay Gerold

- Technologie
- 5,0 • 3 Bewertungen

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

- 20. MAI 2024
Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8

Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8

Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG.

Key Points:

Knowledge Graphs: Graphlet utilizes knowledge graphs as a filtering layer on top of keyword metadata and vector search, aiding in information retrieval.
Storage for KGs: A single piece of content in their data model resides across multiple systems: a document store with JSON, a graph node, and a search index. This hybrid approach creates a virtual entity with representations in different databases.
Entity Extraction: Azure Cognitive Services and other models are employed to extract entities from text for improved understanding.
Metadata-first approach: The metadata-first strategy involves extracting comprehensive metadata from various sources, ensuring it is canonicalized and filterable. This approach aids in better indexing and retrieval of data, crucial for effective RAG.
Challenges: Entity resolution and deduplication remain significant challenges in knowledge graph development.

Notable Quotes:

"Knowledge graphs is a filtering [mechanism]...but then I think also the kind of spidering and pulling extra content in is the other place this comes into play."
"Knowledge graphs to me are kind of like index per se...you're providing a new type of index on top of that."
"[For RAG]...you have to find constraints to make it workable."
"Entity resolution, deduping, I think is probably the number one thing."
"I've essentially built a connector infrastructure that would be like a FiveTran or something that Airflow would have..."
"One of the reasons is because we're a platform as a service, the burstability of it is really important. We can spin up to a hundred instances without any problem, and we don't have to think about it."
"Once cost and performance become a no-brainer, we're going to start seeing LLMs be more of a compute tool. I think that would be a game-changer for how applications are built in the future."

Kirk Marple:

LinkedIn
X (Twitter)
Graphlit
Graphlit Docs

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Chapters

00:00 Graphlit’s Hybrid Approach
02:23 Use Cases and Transition to Graphlit
04:19 Knowledge Graphs as a Filtering Mechanism
13:23 Using Gremlin for Querying the Graph
32:36 XML in Prompts for Better Segmentation
35:04 The Future of LLMs and Graphlit
36:25 Getting Started with Graphlit

Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 36 Min.
- 17. MAI 2024
Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

From Problem to Requirements to Architecture.

In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.

"Don't overcomplicate what you're actually doing."
"Getting your basic programming software development skills down is super important to becoming a good data engineer."
"Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."

Key Takeaways:

Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this.
Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use.
Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts.
Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management.
The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success.
The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.

Jon Erik Kemi Warghed:

LinkedIn

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Chapters

00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords

00:57 How to Choose the Right Tools: Considerations for startups and large companies

03:13 Evaluating Open Source Tools: Background checks and due diligence

07:52 Defining Data Governance: Transparency and understanding of data

10:15 The Importance of Data Governance: Challenges and solutions

12:21 Data Governance Tools: dbt and Dagster

17:05 The Impact of Dagster: Software-defined assets and declarative thinking

19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage

21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management

26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines

28:47 The Importance of Tool Selection: Thinking about long-term sustainability

31:10 When to Adopt Orchestration: Identifying the need for orchestration tools

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 38 Min.
- 10. MAI 2024
Data Orchestration Tools: Choosing the right one for your needs | ep 6

Data Orchestration Tools: Choosing the right one for your needs | ep 6

In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools.

Sound Bites

"The modern era, definitely airflow. Took the market share, a lot of people running it themselves."
"It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator."
"The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops."

Key Topics

The evolution of data orchestration: From basic task scheduling to complex DAG-based solutions
What is a data orchestrator and when do you need one? Understanding the role of orchestrators in handling complex dependencies and scaling data pipelines.
The crowded market: A look at popular options like Airflow, Daxter, Prefect, and more.
Best practices: Choosing the right tool, prioritizing serverless solutions when possible, and focusing on solving the use case before implementing complex tools.
Data residency and GDPR: How regulations influence tool selection, especially in Europe.
Future of the field: The need for consolidation and finding the right balance between features and usability.

John Wessel:

LinkedIn
Data Stack Show
Agreeable Data

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI.

Chapters

00:00 Introduction and Overview

00:34 The Evolution of Data Orchestration Tools

04:54 Components and Flow of Data in Orchestrators

08:24 Deployment Options: Serverless vs. Kubernetes

11:14 Considerations for Data Residency and Security

13:02 The Need for a Clear Winner in the Orchestration Space

20:47 Optimization Techniques for Memory and Time-Limited Issues

23:09 Integrating Orchestrators with Infrastructure-as-Code

24:33 Bridging the Gap Between Data and Engineering Practices

27:2 2Exciting Technologies Outside of Data Orchestration

30:09 The Feature of Dagster

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 32 Min.
- 3. MAI 2024
Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5

Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5

In this episode of "How AI is Built", we learn how to build and evaluate real-world language model applications with Shahul and Jithin, creators of Ragas. Ragas is a powerful open-source library that helps developers test, evaluate, and fine-tune Retrieval Augmented Generation (RAG) applications, streamlining their path to production readiness.

Main Insights

Challenges of Open-Source Models: Open-source large language models (LLMs) can be powerful tools, but require significant post-training optimization for specific use cases.
Evaluation Before Deployment: Thorough testing and evaluation are key to preventing unexpected behaviors and hallucinations in deployed RAGs. Ragas offers metrics and synthetic data generation to support this process.
Data is Key: The quality and distribution of data used to train and evaluate LLMs dramatically impact their performance. Ragas is enabling novel synthetic data generation techniques to make this process more effective and cost-efficient.
RAG Evolution: Techniques for improving RAGs are continuously evolving. Developers must be prepared to experiment and keep up with the latest advancements in chunk embedding, query transformation, and model alignment.

Practical Takeaways

Start with a solid testing strategy: Before launching, define the quality metrics aligned with your RAG's purpose. Ragas helps in this process.
Embrace synthetic data: Manually creating test data sets is time-consuming. Tools within Ragas help automate the creation of synthetic data to mirror real-world use cases.
RAGs are iterative: Be prepared for continuous improvement as better techniques and models emerge.

Interesting Quotes

"...models are very stochastic and grading it directly would rather trigger them to give some random number..." - Shahul, on the dangers of naive model evaluation.
"Reducing the developer time in acquiring these test data sets by 90%." - Shahul, on the efficiency gains of Ragas' synthetic data generation.
"We want to ensure maximum diversity..." - Shahul, on creating realistic and challenging test data for RAG evaluation.

Ragas:

Web
Docs

Jithin James:

LinkedIn

Shahul ES:

LinkedIn
X (Twitter)

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

00:00 Introduction

02:03 Introduction to Open Assistant project

04:05 Creating Customizable and Fine-Tunable Models

06:07 Ragas and the LLM Use Case

08:09 Introduction to Language Model Metrics (LLMs)

11:12 Reducing the Cost of Data Generation

13:19 Evaluation of Components at Melvess

15:40 Combining Ragas Metrics with AutoML Providers

20:08 Improving Performance with Fine-tuning and Reranking

22:56 End-to-End Metrics and Component-Specific Metrics

25:14 The Importance of Deep Knowledge and Understanding

25:53 Robustness vs Optimization

26:32 Challenges of Evaluating Models

27:18 Creating a Dream Tech Stack

27:47 The Future Roadmap for Ragas

28:02 Doubling Down on Grid Data Generation

28:12 Open-Source Models and Expanded Support

28:20 More Metrics for Different Applications

RAG, Ragas, LLM, Evaluation, Synthetic Data, Open-Source, Language Model Applications, Testing.

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 29 Min.
- 29. APR. 2024
Lance v2: Rethinking Columnar Storage for Faster Lookups, Nulls, and Flexible Encodings | changelog 2

Lance v2: Rethinking Columnar Storage for Faster Lookups, Nulls, and Flexible Encodings | changelog 2

In this episode of Changelog, Weston Pace dives into the latest updates to LanceDB, an open-source vector database and file format. Lance's new V2 file format redefines the traditional notion of columnar storage, allowing for more efficient handling of large multimodal datasets like images and embeddings. Weston discusses the goals driving LanceDB's development, including null value support, multimodal data handling, and finding an optimal balance for search performance.

Sound Bites

"A little bit more power to actually just try."
"We're becoming a little bit more feature complete with returns of arrow."
"Weird data representations that are actually really optimized for your use case."

Key Points

Weston introduces LanceDB, an open-source multimodal vector database and file format.
The goals behind LanceDB's design: handling null values, multimodal data, and finding the right balance between point lookups and full dataset scan performance.
Lance V2 File Format:
Potential Use Cases

Conversation Highlights

On the benefits of Arrow integration: Strengthening the connection with the Arrow data ecosystem for seamless data handling.
Why "columnar container format"?: A broader definition than "table format" to encompass more unconventional use cases.
Tackling multimodal data: How LanceDB V2 enables storage of large multimodal data efficiently and without needing tons of memory.
Python's role in encoding experimentation: Providing a way to rapidly prototype custom encodings and plug them into LanceDB.

LanceDB:

X (Twitter)
GitHub
Web
Discord
VectorDB Recipes
Lance V2

Weston Pace:

LinkedIn
GitHub

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

Chapters

00:00 Introducing Lance: A New File Format

06:46 Enabling Custom Encodings in Lance

11:51 Exploring the Relationship Between Lance and Arrow

20:04 New Chapter

Lance file format, nulls, round-tripping data, optimized data representations, full-text search, encodings, downsides, multimodal data, compression, point lookups, full scan performance, non-contiguous columns, custom encodings

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 21 Min.
- 26. APR. 2024
Unlocking AI with Supabase: Postgres Configuration, Real-Time Processing, and Extensions | ep 4

Unlocking AI with Supabase: Postgres Configuration, Real-Time Processing, and Extensions | ep 4

Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:

Core components and how they power real-time AI solutions
Optimizing Postgres for AI workloads
The magic of PG Vector and other key extensions
Supabase’s future and exciting new features

Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:

Core components and how they power real-time AI solutions
Optimizing Postgres for AI workloads
The magic of PG Vector and other key extensions
Supabase’s future and exciting new features

---

Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
- 31 Min.