AI: post transformers

mcgrof

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.

  1. Advancing Mechanistic Interpretability with Sparse Autoencoders

    -1 Ч

    Advancing Mechanistic Interpretability with Sparse Autoencoders

    We review the latest papers which focus on advancements and critical uses of Sparse Autoencoders (SAEs), which are tools used to decode the internal "monosemantic" features of large language models. Research from **ICLR 2025** and other repositories introduces **TopK SAEs** and **Multi-Layer SAEs**, demonstrating that these architectures offer superior reconstruction and scalability compared to traditional ReLU-based models. **RouteSAE** further improves efficiency by using a **dynamic routing mechanism** to extract integrated features from across multiple layers of a model's residual stream. However, critical analysis reveals that many identified "reasoning" features may actually be **linguistic correlates** or syntactic templates rather than genuine cognitive traces. By utilizing **falsification frameworks** and **causal token injection**, researchers caution against over-interpreting feature activations without rigorous validation. Together, these documents provide a technical foundation for **mechanistic interpretability**, balancing new architectural breakthroughs with a skeptical look at current evaluation metrics. Sources: 1) 2025 Residual Stream Analysis with Multi-Layer SAEs Tim Lawson https://arxiv.org/abs/2409.04185 2) 2025 AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher Manning, Christopher Potts https://openreview.net/forum?id=XAjfjizaKs 3) 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda www.neuronpedia.org/sae-bench 4) 2025 Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models University of Illinois at Urbana-Champaign Ikhyun Cho, Julia Hockenmaier https://aclanthology.org/2025.emnlp-main.1474.pdf 5) 2025 Route Sparse Autoencoder to Interpret Large Language Models University of Science and Technology of China, Douyin Co., Ltd. Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He https://aclanthology.org/2025.emnlp-main.346.pdf 6) 2025 Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models Carnegie Mellon University Aashiq Muhamed, Mona Diab, Virginia Smith https://aclanthology.org/2025.findings-naacl.87.pdf 7) February 10 2026 Falsifying Sparse Autoencoder Reasoning Features in Language Models UC Berkeley, UCSF George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi https://arxiv.org/pdf/2601.05679 8) Under Review Sparse But Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders Anonymous authors https://openreview.net/pdf/035a5937c6a536c67b5999aa43e53dd3800ba3a4.pdf 9) 2025 Revising and Falsifying Sparse Autoencoder Feature Explanations University of California, Berkeley George Ma, Samuel Pfrommer, Somayeh Sojoudi https://openreview.net/pdf?id=OJAW2mHVND 10) 2025 Scaling and Evaluating Sparse Autoencoders OpenAI Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu https://proceedings.iclr.cc/paper_files/paper/2025/file/42ef3308c230942d223c411adf182c88-Paper-Conference.pdf

    13 мин.

Оценки и отзывы

5
из 5
Оценок: 2

Об этом подкасте

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.