9 АВГ.
14 МИН.

Multi-Layer Sparse Autoencoders for Transformer Interpretation

This paper introduces the Multi-Layer Sparse Autoencoder (MLSAE), a novel approach for interpreting the internal representations of transformer language models. Unlike traditional Sparse Autoencoders (SAEs) that analyze individual layers, MLSAEs are trained across all layers of a transformer's residual stream, enabling the study of information flow across layers. The research found that while individual "latents" (features learned by the SAE) tend to be active at a single layer for a given input, they are active at multiple layers when aggregated over many inputs, with this multi-layer activity increasing in larger models. The authors also explored the effect of "tuned-lens" transformations on latent activations, ultimately providing a new method for understanding how representations evolve within transformers.

Веб-страница выпуска

Подкаст

AI: AX - introspection
Частота

Ежемесячно
Опубликовано

9 августа 2025 г. в 07:03 UTC
Длительность

14 мин.
Ограничения

Без ненормативной лексики

Multi-Layer Sparse Autoencoders for Transformer Interpretation

Информация