2024. 06. 10.
4시간 29분

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Our second wave of speakers for AI Engineer World’s Fair were announced! The conference sold out of Platinum/Gold/Silver sponsors and Early Bird tickets! See our Microsoft episode for more info and buy now with code LATENTSPACE.

This episode is straightforwardly a part 2 to our ICLR 2024 Part 1 episode, so without further ado, we’ll just get right on with it!

Timestamps

[00:03:43] Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger

* [00:07:44] WebArena

* [00:18:45] Sotopia

* [00:24:00] Performance Improving Code Edits

* [00:29:39] OpenDevin

* [00:47:40] Industry and Academia

[01:05:29] Section B: Benchmarks

* [01:05:52] SWEBench

* [01:17:05] SWEBench/SWEAgent Interview

* [01:27:40] Dataset Contamination Detection

* [01:39:20] GAIA Benchmark

* [01:49:18] Moritz Hart - Science of Benchmarks

[02:36:32] Section C: Reasoning and Post-Training

* [02:37:41] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

* [02:51:00] Let’s Verify Step By Step

* [02:57:04] Noam Brown

* [03:07:43] Lilian Weng - Towards Safe AGI

* [03:36:56] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

* [03:48:43] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

[04:00:51] Bonus: Notable Related Papers on LLM Capabilities

Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger

* Guests

* Graham Neubig

* Aman Sanger - Previous guest and NeurIPS friend of the pod!

* WebArena

* Sotopia (spotlight paper, website)

* Learning Performance-Improving Code Edits

* OpenDevin

* Junyang Opendevin

* Morph Labs, Jesse Han

* SWE-Bench

* SWE-Agent

* Aman tweet on swebench

* LiteLLM

* Livecodebench

* the role of code in reasoning

* Language Models of Code are Few-Shot Commonsense Learners

* Industry vs academia

* the matryoshka embeddings incident

* other directions

* Unlimiformer

Section A timestamps

* [00:00:00] Introduction to Guests and the Impromptu Nature of the Podcast

* [00:00:45] Graham's Experience in Japan and Transition into Teaching NLP

* [00:01:25] Discussion on What Constitutes a Good Experience for Students in NLP Courses

* [00:02:22] The Relevance and Teaching of Older NLP Techniques Like Ngram Language Models

* [00:03:38] Speculative Decoding and the Comeback of Ngram Models

* [00:04:16] Introduction to WebArena and Zotopia Projects

* [00:05:19] Deep Dive into the WebArena Project and Benchmarking

* [00:08:17] Performance Improvements in WebArena Using GPT-4

* [00:09:39] Human Performance on WebArena Tasks and Challenges in Evaluation

* [00:11:04] Follow-up Work from WebArena and Focus on Web Browsing as a Benchmark

* [00:12:11] Direct Interaction vs. Using APIs in Web-Based Tasks

* [00:13:29] Challenges in Base Models for WebArena and the Potential of Visual Models

* [00:15:33] Introduction to Zootopia and Exploring Social Interactions with Language Models

* [00:16:29] Different Types of Social Situations Modeled in Zootopia

* [00:17:34] Evaluation of Language Models in Social Simulations

* [00:20:41] Introduction to Performance-Improving Code Edits Project

* [00:26:28] Discussion on DevIn and the Future of Coding Agents

* [00:32:01] Planning in Coding Agents and the Development of OpenDevon

* [00:38:34] The Changing Role of Academia in the Context of Large Language Models

* [00:44:44] The Changing Nature of Industry and Academia Collaboration

* [00:54:07] Update on NLP Course Syllabus and Teaching about Large Language Models

* [01:00:40] Call to Action: Contributions to OpenDevon and Open Source AI Projects

* [01:01:56] Hiring at Cursor for Roles in Code Generation and Assistive Coding

* [01:02:12] Promotion of the AI Engineer Conference

Section B: Benchmarks

* Carlos Jimenez & John Yang (Princeton) et al: SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR Oral, Paper, website)

* “We introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.

Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks.

Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.”

* Yonatan Oren et al (Stanford): Proving Test Set Contamination in Black-Box Language Models (ICLR Oral, paper, aman tweet on swebench contamination)

* “We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples.

* We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus.”

* Outstanding Paper mention: “A simple yet elegant method to test whether a supervised-learning dataset has been included in LLM training.”

* Thomas Scialom (Meta AI-FAIR w/ Yann LeCun): GAIA: A Benchmark for General AI Assistants (paper)

* “We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.

* GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins.

* GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI)

에피소드 웹페이지

프로그램

Latent Space: The AI Engineer Podcast
주기

매주 업데이트
발행일

2024년 6월 10일 오전 3:06 UTC
길이

4시간 29분
등급

전체 연령 사용가

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

정보