2 hr 13 min

29 - Science of Deep Learning with Vikrant Varma AXRP - the AI X-risk Research Podcast

    • Technology

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
 
Topics we discuss, and timestamps:
0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS
  0:00:36 - What is CCS?
  0:09:54 - Consistent and contrastive features other than model beliefs
  0:20:34 - Understanding the banana/shed mystery
  0:41:59 - Future CCS-like approaches
  0:53:29 - CCS as principal component analysis
0:56:21 - Explaining grokking through circuit efficiency
  0:57:44 - Why research science of deep learning?
  1:12:07 - Summary of the paper's hypothesis
  1:14:05 - What are 'circuits'?
  1:20:48 - The role of complexity
  1:24:07 - Many kinds of circuits
  1:28:10 - How circuits are learned
  1:38:24 - Semi-grokking and ungrokking
  1:50:53 - Generalizing the results
1:58:51 - Vikrant's research approach
2:06:36 - The DeepMind alignment team
2:09:06 - Follow-up work
 
The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html
Vikrant's Twitter/X account: twitter.com/vikrantvarma_
 
Main papers:
 - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029
 - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390
 
Other works discussed:
 - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827
- Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit
- Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
- Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY
- Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4
- Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast
- Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177
- Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306
- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521
 
Episode art by Hamish Doodles: hamishdoodles.com

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
 
Topics we discuss, and timestamps:
0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS
  0:00:36 - What is CCS?
  0:09:54 - Consistent and contrastive features other than model beliefs
  0:20:34 - Understanding the banana/shed mystery
  0:41:59 - Future CCS-like approaches
  0:53:29 - CCS as principal component analysis
0:56:21 - Explaining grokking through circuit efficiency
  0:57:44 - Why research science of deep learning?
  1:12:07 - Summary of the paper's hypothesis
  1:14:05 - What are 'circuits'?
  1:20:48 - The role of complexity
  1:24:07 - Many kinds of circuits
  1:28:10 - How circuits are learned
  1:38:24 - Semi-grokking and ungrokking
  1:50:53 - Generalizing the results
1:58:51 - Vikrant's research approach
2:06:36 - The DeepMind alignment team
2:09:06 - Follow-up work
 
The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html
Vikrant's Twitter/X account: twitter.com/vikrantvarma_
 
Main papers:
 - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029
 - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390
 
Other works discussed:
 - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827
- Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit
- Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
- Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY
- Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4
- Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast
- Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177
- Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306
- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521
 
Episode art by Hamish Doodles: hamishdoodles.com

2 hr 13 min

Top Podcasts In Technology

Acquired
Ben Gilbert and David Rosenthal
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Hard Fork
The New York Times
TED Radio Hour
NPR
Lex Fridman Podcast
Lex Fridman
Darknet Diaries
Jack Rhysider