How to Evaluate Large Language Models and RAG Applications with Pasquale Antonante ODSC's Ai X Podcast

    • Technology

Listen on Apple Podcasts
Requires macOS 11.4 or higher

How to Evaluate Large Language Models and RAG Applications
In this episode, Pasquale Antonante, Co-Founder & CTO of Relari AI, joins us to discuss evaluation methods for LLM and RAG applications. Pasquale has a PhD from MIT, where he focused on the reliability of complex AI systems. At Relari AI, they are building an open-source platform to simulate, test, and validate complex generative AI (GenAI) applications.

During the interview, we’ll discuss Relari AI's innovative approach to improving generative AI and RAG applications, which were inspired by the testing methodologies from the autonomous vehicle industry.

We’ll cover topics like the complexity of GenAI workflows, the challenges in evaluating these systems, and various evaluation methods such as reference-based, and synthetic data-based approaches. We’ll also explore metrics like precision, recall, faithfulness, and relevance, and compare GPT auto-evaluators with simulated user feedback.

Finally, we'll highlight Relari's continuous-eval open-source project and the future of leveraging synthetic data for LLM finetuning.

Topics
- Guest background and about the startup - Relari AI
- What the LLM industry can learn from the autonomous vehicle space
- What do companies view as the biggest challenge to the adoption of generative AI?
- Why are GenAI application workflows and pipelines so complex?
- Explanation of how Retrieval-Augmented Generation (RAG) works and its benefits over traditional generation models
- The challenges of evaluating these workflows
- Different ways to evaluate LLM pipelines
- Reference-free, reverence-based, and using synthetic data based evaluation for LLMs and RAG
- Measuring precision, recall, faithfulness, relevance, and correctness in RAG systems
- The key metrics used to evaluate RAG pipelines
- Semantics metrics and LLM-based metrics
- GPT auto-evaluators versus the advantages of simulated user feedback evaluators
- The role human evaluation plays in assessing the quality of generated text
- The continuous-eval open-source project and various metrics container therein
- Leveraging synthetic data to improve LLM finetuning
- What’s next for Relari?

Show Notes:

Learn more about Pasquale:
https://www.linkedin.com/in/pasquale-antonante/
https://www.mit.edu/~antonap/
https://scholar.google.com/citations?user=7Vvpd-YAAAAJ&hl=it

Learn more about Relari
https://www.relari.ai/
https://github.com/relari-ai/continuous-eval

Task-Aware Risk Estimation of Perception Failures for Autonomous Vehicles
https://arxiv.org/abs/2305.01870

BM25
https://en.wikipedia.org/wiki/Okapi_BM25

Precision, Recall, F1 score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
Monitoring Text-Based Generative AI Models Using Metrics Like BLEU, ROUGE, and METEOR score
https://arize.com/blog-course/generative-ai-metrics-bleu-score/
A Practical Guide to RAG Pipeline Evaluation (Part 1: Retrieval)
https://blog.relari.ai/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893
A Practical Guide to RAG Pipeline Evaluation (Part 2: Generation
https://blog.relari.ai/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d

This episode was sponsored by:
Ai+ Training https://aiplus.training/
Home to hundreds of hours of on-demand, self-paced AI training, ODSC interviews, free webinars, and certifications in in-demand skills like LLMs and Prompt Engineering

And created in partnership with ODSC https://odsc.com/
The Leading AI Training Conference, featuring expert-led, hands-on workshops, training sessions, and talks on cutting-edge AI topics and tools, from data science and machine learning to generative AI to LLMOps

Never miss an episode, subscribe now!

How to Evaluate Large Language Models and RAG Applications
In this episode, Pasquale Antonante, Co-Founder & CTO of Relari AI, joins us to discuss evaluation methods for LLM and RAG applications. Pasquale has a PhD from MIT, where he focused on the reliability of complex AI systems. At Relari AI, they are building an open-source platform to simulate, test, and validate complex generative AI (GenAI) applications.

During the interview, we’ll discuss Relari AI's innovative approach to improving generative AI and RAG applications, which were inspired by the testing methodologies from the autonomous vehicle industry.

We’ll cover topics like the complexity of GenAI workflows, the challenges in evaluating these systems, and various evaluation methods such as reference-based, and synthetic data-based approaches. We’ll also explore metrics like precision, recall, faithfulness, and relevance, and compare GPT auto-evaluators with simulated user feedback.

Finally, we'll highlight Relari's continuous-eval open-source project and the future of leveraging synthetic data for LLM finetuning.

Topics
- Guest background and about the startup - Relari AI
- What the LLM industry can learn from the autonomous vehicle space
- What do companies view as the biggest challenge to the adoption of generative AI?
- Why are GenAI application workflows and pipelines so complex?
- Explanation of how Retrieval-Augmented Generation (RAG) works and its benefits over traditional generation models
- The challenges of evaluating these workflows
- Different ways to evaluate LLM pipelines
- Reference-free, reverence-based, and using synthetic data based evaluation for LLMs and RAG
- Measuring precision, recall, faithfulness, relevance, and correctness in RAG systems
- The key metrics used to evaluate RAG pipelines
- Semantics metrics and LLM-based metrics
- GPT auto-evaluators versus the advantages of simulated user feedback evaluators
- The role human evaluation plays in assessing the quality of generated text
- The continuous-eval open-source project and various metrics container therein
- Leveraging synthetic data to improve LLM finetuning
- What’s next for Relari?

Show Notes:

Learn more about Pasquale:
https://www.linkedin.com/in/pasquale-antonante/
https://www.mit.edu/~antonap/
https://scholar.google.com/citations?user=7Vvpd-YAAAAJ&hl=it

Learn more about Relari
https://www.relari.ai/
https://github.com/relari-ai/continuous-eval

Task-Aware Risk Estimation of Perception Failures for Autonomous Vehicles
https://arxiv.org/abs/2305.01870

BM25
https://en.wikipedia.org/wiki/Okapi_BM25

Precision, Recall, F1 score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
Monitoring Text-Based Generative AI Models Using Metrics Like BLEU, ROUGE, and METEOR score
https://arize.com/blog-course/generative-ai-metrics-bleu-score/
A Practical Guide to RAG Pipeline Evaluation (Part 1: Retrieval)
https://blog.relari.ai/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893
A Practical Guide to RAG Pipeline Evaluation (Part 2: Generation
https://blog.relari.ai/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d

This episode was sponsored by:
Ai+ Training https://aiplus.training/
Home to hundreds of hours of on-demand, self-paced AI training, ODSC interviews, free webinars, and certifications in in-demand skills like LLMs and Prompt Engineering

And created in partnership with ODSC https://odsc.com/
The Leading AI Training Conference, featuring expert-led, hands-on workshops, training sessions, and talks on cutting-edge AI topics and tools, from data science and machine learning to generative AI to LLMOps

Never miss an episode, subscribe now!

Top Podcasts In Technology

Acquired
Ben Gilbert and David Rosenthal
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Lex Fridman Podcast
Lex Fridman
Search Engine
PJ Vogt, Audacy, Jigsaw
Hard Fork
The New York Times
TED Radio Hour
NPR