This episode explores CL-BENCH, a benchmark designed to test whether language models can actually learn task-specific knowledge from long, messy context and then reason with it, rather than merely retrieving facts or mimicking examples. It explains the distinction between long-context understanding, in-context learning, and the stronger notion of context learning, using examples like legal codes, product manuals, and experimental notebooks to show what real-world adaptation demands. The discussion highlights how the benchmark’s 500 contexts, 1,899 tasks, and dense binary verification rubrics are built to stress models on rule-following, procedural reasoning, and inferring governing relationships from data. Listeners would find it interesting because it gets at a central question in modern AI: whether bigger context windows actually make systems more capable, or just better at holding more text without truly learning from it. Sources: 1. CL-bench: A Benchmark for Context Learning — Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, Shunyu Yao, 2026 http://arxiv.org/abs/2602.03587 2. Language Models are Few-Shot Learners — Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan and others, 2020 https://scholar.google.com/scholar?q=Language+Models+are+Few-Shot+Learners 3. MetaICL: Learning to Learn In Context — Sewon Min, Mike Lewis, Luke Zettlemoyer, Hannaneh Hajishirzi, 2021 https://scholar.google.com/scholar?q=MetaICL:+Learning+to+Learn+In+Context 4. Transformers learn in-context by gradient descent — Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov, 2022 https://scholar.google.com/scholar?q=Transformers+learn+in-context+by+gradient+descent 5. Lost in the Middle: How Language Models Use Long Contexts — Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2023 https://scholar.google.com/scholar?q=Lost+in+the+Middle:+How+Language+Models+Use+Long+Contexts 6. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding — Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 2023 https://scholar.google.com/scholar?q=LongBench:+A+Bilingual,+Multitask+Benchmark+for+Long+Context+Understanding 7. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack — Yurii Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev, 2024 https://scholar.google.com/scholar?q=BABILong:+Testing+the+Limits+of+LLMs+with+Long+Context+Reasoning-in-a-Haystack 8. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks — Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 2024 https://scholar.google.com/scholar?q=LongBench+v2:+Towards+Deeper+Understanding+and+Reasoning+on+Realistic+Long-context+Multitasks 9. NoLiMa: Long-Context Evaluation Beyond Literal Matching — Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, David Seunghyun Yoon, Hinrich Schutze, 2025 https://scholar.google.com/scholar?q=NoLiMa:+Long-Context+Evaluation+Beyond+Literal+Matching 10. LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion — Zhan Ling et al., 2025 https://scholar.google.com/scholar?q=LongReason:+A+Synthetic+Long-Context+Reasoning+Benchmark+via+Context+Expansion 11. DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities — Tianyi Zhuang et al., 2025 https://scholar.google.com/scholar?q=DocPuzzle:+A+Process-Aware+Benchmark+for+Evaluating+Realistic+Long-Context+Reasoning+Capabilities 12. In-Context Learning Creates Task Vectors — Roee Hendel, Mor Geva, Amir Globerson, 2023 https://scholar.google.com/scholar?q=In-Context+Learning+Creates+Task+Vectors 13. In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering — Sheng Liu, Haotian Ye, Lei Xing, James Zou, 2024 https://scholar.google.com/scholar?q=In-context+Vectors:+Making+In+Context+Learning+More+Effective+and+Controllable+Through+Latent+Space+Steering 14. Task Vectors in In-Context Learning: Emergence, Formation, and Benefit — Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak, 2025 https://scholar.google.com/scholar?q=Task+Vectors+in+In-Context+Learning:+Emergence,+Formation,+and+Benefit 15. Learn to Memorize: Scalable Continual Learning in Semiparametric Models with Mixture-of-Neighbors Induction Memory — Guangyue Peng, Tao Ge, Wen Luo, Wei Li, Houfeng Wang, 2025 https://scholar.google.com/scholar?q=Learn+to+Memorize:+Scalable+Continual+Learning+in+Semiparametric+Models+with+Mixture-of-Neighbors+Induction+Memory 16. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 17. AI Post Transformers: Real Context Size and Context Rot — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-real-context-size-and-context-rot-56cbb4.mp3 18. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 19. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 20. AI Post Transformers: Training LLMs for Divide-and-Conquer Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-training-llms-for-divide-and-conquer-rea-ea6e22.mp3 21. AI Post Transformers: Inverse IFEval: Unlearning LLM Cognitive Inertia — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/inverse-ifeval-unlearning-llm-cognitive-inertia/ Interactive Visualization: Can Models Learn from Long Context?