OCT 6
15 MIN

Eliciting Secret Knowledge from Language Models

This academic paper investigates the critical challenge of eliciting secret knowledge from Large Language Models (LLMs) that have been intentionally trained to possess and conceal specific information. The researchers created a controlled testbed with three "secret-keeping" LLMs—Taboo, Secret Side Constraint (SSC), and User Gender—each hiding a different type of fact. They evaluated various black-box techniques, such as prefill attacks and user persona sampling, and white-box techniques, including Logit Lens and Sparse Autoencoders (SAEs), to see which methods most successfully enabled an auditor LLM to guess the secret. The findings demonstrate that both black-box prefilling methods and white-box mechanistic interpretability tools significantly improve the auditor's success rate in uncovering the models' hidden knowledge. The authors conclude by open-sourcing their code and models to establish a public benchmark for future AI safety research in this area.

Episode Webpage

Show

Best AI papers explained
Frequency

Updated Weekly
Published

October 6, 2025 at 11:07 PM UTC
Length

15 min
Rating

Clean

Eliciting Secret Knowledge from Language Models

Information