Neural Search Talks — Zeta Alpha Zeta Alpha
-
- Technologie
A monthly podcast where we discuss recent research and developments in the world of Neural Search, LLMs, RAG and Natural Language Processing with our co-hosts Jakub Zavrel (AI veteran and founder at Zeta Alpha) and Dinos Papakostas (AI Researcher at Zeta Alpha).
-
Baking the Future of Information Retrieval Models
In this episode of Neural Search Talks, we're chatting with Aamir Shakir from Mixed Bread AI, who shares his insights on starting a company that aims to make search smarter with AI. He details their approach to overcoming challenges in embedding models, touching on the significance of data diversity, novel loss functions, and the future of multilingual and multimodal capabilities. We also get insights on their journey, the ups and downs, and what they're excited about for the future.
Timestamps:
0:00 Introduction
0:25 How did mixedbread.ai start?
2:16 The story behind the company name and its "bakers"
4:25 What makes Berlin a great pool for AI talent
6:12 Building as a GPU-poor team
7:05 The recipe behind mxbai-embed-large-v1
9:56 The Angle objective for embedding models
15:00 Going beyond Matryoshka with mxbai-embed-2d-large-v1
17:45 Supporting binary embeddings & quantization
19:07 Collecting large-scale data is key for robust embedding models
21:50 The importance of multilingual and multimodal models for IR
24:07 Where will mixedbread.ai be in 12 months?
26:46 Outro -
Hacking JIT Assembly to Build Exascale AI Infrastructure
Ash shares his journey from software development to pioneering in the AI infrastructure space with Unum. He discusses Unum's focus on unleashing the full potential of modern computers for AI, search, and database applications through efficient data processing and infrastructure. Highlighting Unum's technical achievements, including SIMD instructions and just-in-time compilation, Ash also touches on the future of computing and his vision for Unum to contribute to advances in personalized medicine and extending human productivity.
Timestamps:
0:00 Introduction
0:44 How did Unum start and what is it about?
6:12 Differentiating from the competition in vector search
17:45 Supporting modern features like large dimensions & binary embeddings
27:49 Upcoming model releases from Unum
30:00 The future of hardware for AI
34:56 The impact of AI in society
37:35 Outro -
The Promise of Language Models for Search: Generative Information Retrieval
In this episode of Neural Search Talks, Andrew Yates (Assistant Prof at the University of Amsterdam) Sergi Castella (Analyst at Zeta Alpha), and Gabriel Bénédict (PhD student at the University of Amsterdam) discuss the prospect of using GPT-like models as a replacement for conventional search engines.
Generative Information Retrieval (Gen IR) SIGIR Workshop
Workshop organized by Gabriel Bénédict, Ruqing Zhang, and Donald Metzler https://coda.io/@sigir/gen-ir
Resources on Gen IR: https://github.com/gabriben/awesome-generative-information-retrieval
References
Rethinking Search: https://arxiv.org/abs/2105.02274
Survey on Augmented Language Models: https://arxiv.org/abs/2302.07842
Differentiable Search Index: https://arxiv.org/abs/2202.06991
Recommender Systems with Generative Retrieval: https://shashankrajput.github.io/Generative.pdf
Timestamps:
00:00 Introduction, ChatGPT Plugins
02:01 ChatGPT plugins, LangChain
04:37 What is even Information Retrieval?
06:14 Index-centric vs. model-centric Retrieval
12:22 Generative Information Retrieval (Gen IR)
21:34 Gen IR emerging applications
24:19 How Retrieval Augmented LMs incorporate external knowledge
29:19 What is hallucination?
35:04 Factuality and Faithfulness
41:04 Evaluating generation of Language Models
47:44 Do we even need to "measure" performance?
54:07 How would you evaluate Bing's Sydney?
57:22 Will language models take over commercial search?
1:01:44 NLP academic research in the times of GPT-4
1:06:59 Outro -
Task-aware Retrieval with Instructions
Andrew Yates (Assistant Prof at University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discuss the paper "Task-aware Retrieval with Instructions" by Akari Asai et al. This paper proposes to augment a conglomerate of existing retrieval and NLP datasets with natural language instructions (BERRI, Bank of Explicit RetRieval Instructions) and use it to train TART (Multi-task Instructed Retriever).
📄 Paper: https://arxiv.org/abs/2211.09260
🍻 BEIR benchmark: https://arxiv.org/abs/2104.08663
📈 LOTTE (Long-Tail Topic-stratified Evaluation, introduced in ColBERT v2): https://arxiv.org/abs/2112.01488
Timestamps:
00:00 Intro: "Task-aware Retrieval with Instructions"
02:20 BERRI, TART, X^2 evaluation
04:00 Background: recent works in domain adaptation
06:50 Instruction Tuning 08:50 Retrieval with descriptions
11:30 Retrieval with instructions
17:28 BERRI, Bank of Explicit RetRieval Instructions
21:48 Repurposing NLP tasks as retrieval tasks
23:53 Negative document selection
27:47 TART, Multi-task Instructed Retriever
31:50 Evaluation: Zero-shot and X^2 evaluation
39:20 Results on Table 3 (BEIR, LOTTE)
50:30 Results on Table 4 (X^2-Retrieval)
55:50 Ablations
57:17 Discussion: user modeling, future work, scale -
Generating Training Data with Large Language Models w/ Special Guest Marzieh Fadaee
Marzieh Fadaee — NLP Research Lead at Zeta Alpha — joins Andrew Yates and Sergi Castella to chat about her work in using large Language Models like GPT-3 to generate domain-specific training data for retrieval models with little-to-no human input. The two papers discussed are "InPars: Data Augmentation for Information Retrieval using Large Language Models" and "Promptagator: Few-shot Dense Retrieval From 8 Examples".
InPars: https://arxiv.org/abs/2202.05144
Promptagator: https://arxiv.org/abs/2209.11755
Timestamps:
00:00 Introduction
02:00 Background and journey of Marzieh Fadaee
03:10 Challenges of leveraging Large LMs in Information Retrieval
05:20 InPars, motivation and method
14:30 Vanilla vs GBQ prompting
24:40 Evaluation and Benchmark
26:30 Baselines
27:40 Main results and takeaways (Table 1, InPars)
35:40 Ablations: prompting, in-domain vs. MSMARCO input documents
40:40 Promptagator overview and main differences with InPars
48:40 Retriever training and filtering in Promptagator
54:37 Main Results (Table 2, Promptagator)
1:02:30 Ablations on consistency filtering (Figure 2, Promptagator)
1:07:39 Is this the magic black-box pipeline for neural retrieval on any documents
1:11:14 Limitations of using LMs for synthetic data
1:13:00 Future directions for this line of research -
ColBERT + ColBERTv2: late interaction at a reasonable inference cost
Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discus the two influential papers introducing ColBERT (from 2020) and ColBERT v2 (from 2022), which mainly propose a fast late interaction operation to achieve a performance close to full cross-encoders but at a more manageable computational cost at inference; along with many other optimizations.
📄 ColBERT: "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" by Omar Khattab and Matei Zaharia. https://arxiv.org/abs/2004.12832
📄 ColBERTv2: "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" by Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. https://arxiv.org/abs/2112.01488
📄 PLAID: "An Efficient Engine for Late Interaction Retrieval" by Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. https://arxiv.org/abs/2205.09707
📄 CEDR: "CEDR: Contextualized Embeddings for Document Ranking" by Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. https://arxiv.org/abs/1904.07094
🪃 Feedback form: https://scastella.typeform.com/to/rg7a5GfJ
Timestamps:
00:00 Introduction
00:42 Why ColBERT?
03:34 Retrieval paradigms recap
08:04 ColBERT query formulation and architecture
09:04 Using ColBERT as a reranker or as an end-to-end retriever
11:28 Space Footprint vs. MRR on MS MARCO
12:24 Methodology: datasets and negative sampling
14:37 Terminology for cross encoders, interaction-based models, etc.
16:12 Results (ColBERT v1) on MS MARCO
18:41 Ablations on model components
20:34 Max pooling vs. mean pooling
22:54 Why did ColBERT have a big impact?
26:31 ColBERTv2: knowledge distillation
29:34 ColBERTv2: indexing improvements
33:59 Effects of clustering compression in performance
35:19 Results (ColBERT v2): MS MARCO
38:54 Results (ColBERT v2): BEIR
41:27 Takeaway: strong specially in out-of-domain evaluation
43:59 Qualitatively how do ColBERT scores look like?
46:21 What's the most promising of all current neural IR paradigms
49:34 How come there's still so much interest in Dense retrieval?
51:09 Many to many similarity at different granularities
53:44 What would ColBERT v3 include?
56:39 PLAID: An Efficient Engine for Late Interaction Retrieval
Contact: castella@zeta-alpha.com