This paper argues that thinking language models (LLMs that reason step-by-step) do not acquire entirely new capabilities during post-training but rather learn when to deploy pre-existing reasoning mechanisms latent in their base counterparts. The authors use an unsupervised clustering methodology via Sparse Autoencoders (SAEs) to derive an interpretable taxonomy of distinct reasoning behaviors, such as numeric computation and planning next steps. They then implement a hybrid model that uses the base model for generation but is guided by the thinking model's activation patterns via steering vectors to activate specific reasoning behaviors. This hybrid approach successfully recovered up to 91% of the performance gap between base and thinking models on reasoning benchmarks like MATH500 while steering only a small fraction of tokens, supporting the idea that the primary benefit of complex training is teaching efficient mechanism deployment.
정보
- 프로그램
- 주기매주 업데이트
- 발행일2025년 10월 11일 오후 6:38 UTC
- 길이12분
- 등급전체 연령 사용가