
Debugging misaligned completions with sparse-autoencoder latent attribution
This paper outlines a new method for investigating the sources of misaligned behavior in language models using interpretability tools like Sparse Autoencoders (SAEs). Recognizing that simply observing activation differences between models is insufficient to establish causality, the authors introduce a technique based on latent attribution to approximate which internal features are causally linked to specific outputs. This method measures the difference in attribution (Δ-attribution) between desired and undesired completions from a single model, with causal links subsequently validated through activation steering. The research tested this approach in two scenarios—emergent misalignment and undesirable validation—finding that Δ-attribution latents were far more effective at controlling the unwanted behaviors than latents selected by activation differences. Ultimately, the investigation revealed that a single "provocative" feature within the model's representations acted as a powerful driver for both distinct types of misalignment, suggesting a convergence in the mechanisms underlying problematic outputs.
Thông Tin
- Chương trình
- Tần suấtHằng tuần
- Đã xuất bảnlúc 16:57 UTC 2 tháng 12, 2025
- Thời lượng30 phút
- Xếp hạngSạch