LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 1h ago

    “Several frontier models are substantially prefill aware” by yeedrag, Parv Mahajan, David Africa, alexsouly, Jordan Taylor, RobertKirk

    This blog post discusses work in a recently-published paper. However, this blogpost was primarily written by Parv Mahajan and Andy Wang, and several of the more speculative takes may not represent the all-things-considered view of the entire team. Link to paper: https://arxiv.org/abs/2606.12747 TL;DR: We provide more conceptual grounding and extend results in prefill awareness to low-stakes settings, and show that several frontier models show prefill awareness even under conservative elicitation.Further behavioral studies are pretty messy, and we encourage more work in this area.We encourage frontier lab safety teams to measure and mitigate prefill awareness in pre-deployment evaluations. Recently, UK AISI investigated prefill awareness - whether frontier language models can distinguish between tampered and untampered assistant-side content. Prefills are used in misalignment continuation, persona, introspection, and jailbreaking research. Additionally, several prefill-based evaluations are used in pre-deployment testing to make safety claims. Prefill awareness could confound these evaluations, and fits into larger concerns about situational awareness (e.g., control awareness). The previous results largely focused on deployment-relevant settings (e.g., SWE-bench and Petri transcripts), and therefore weren’t able to make strong claims across types of commonly-used prefills and models. In the paper, we: Use a more refined conceptual framework [...] --- Outline: (02:38) Making sense of prefill awareness (04:32) en-US-AvaMultilingualNeural__ Diagram comparing three types of AI assistant response tampering methods. (05:31) Several models are prefill-aware (07:49) Prefill awareness is heterogeneous and confusing (09:33) Recommendations and next steps --- First published: June 17th, 2026 Source: https://www.lesswrong.com/posts/iMds4tTpMH4pSHEej/several-frontier-models-are-substantially-prefill-aware --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    13 min
  2. 2h ago

    “Alignement pretraining could backfire” by Alexandre Variengien

    Epistemic status: speculative, but I think the mechanism is plausible. There has been recent interest in generating synthetic documents to upsample examples of aligned AI during LLM pretraining. See, for instance, Geodesic's Alignment Pretraining paper or Anthropic's "Teaching Claude Why." I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness. I speculate that these techniques could lead to paranoid LLM personas that deeply mistrust their creators. The whole idea behind this line of research is to instill in models good examples of AI behavior, in the hope that their personalities will at least partially identify with these positive demonstrations. However, the synthetic demonstrations are, well, synthetic. They are LLM-generated fiction and articles that are never referenced anywhere else in the corpus. Given how good LLMs are at "truesight," it shouldn't be hard for them to recognize these as fabricated data points. Krasheninnikov et al. showed that base models can implicitly learn document quality and change how they integrate a document's information based on that quality. We should similarly expect LLMs to update their world model differently on real versus fabricated documents. As they [...] --- First published: June 17th, 2026 Source: https://www.lesswrong.com/posts/7KN7PCiEQjrPsEFS8/alignement-pretraining-could-backfire --- Narrated by TYPE III AUDIO.

    3 min
  3. 4h ago

    “The Once And Future Fable #3: Fix This Code” by Zvi

    The mainstream media continues to sleep on the most important story in the world. It has now been two days since Anthropic flew its people out to Washington, and I offered my previous update. We have heard nothing back from those meetings. Prediction market prices have moved rapidly, and have once again stabilized at about a 55% chance of restoration by July 1, 30% by June 26 and 12% by June 19. That seems modestly higher than I would put those numbers, but not unreasonable. Every day that Fable remains unavailable further damages America, its cyber defenses, its productivity and the world's trust in its AI and supposed ‘tech stack.’ Every day that Mythos remains unavailable is a day the free world's top companies and cyber defenders lose in their race against the avalanche headed their way. Mostly we have learned and confirmed more about exactly what happened. We know more about what Amazon did, what the official letter said, what the supposed ‘jailbreak’ was (literally, and I am not making this up, ‘fix this code’) and more. It is all about as stupid as it could have been. Table of [...] --- Outline: (01:22) There Was No Fable Jailbreak (07:16) If This Jailbreak Was Real It Would Be Trivial To Prove It (08:35) No Eyes (09:41) What The Letter Actually Said (11:29) Anthropic Cannot Challenge This But If It Did Then It Plausibly Wins (13:28) What Happened At Amazon (17:43) This Was Not About Chinese Access (18:01) Absolute Discretion And Ad Hockery Is Not Deregulation (20:43) All Of American AI Is Permanently Damaged As This Continues (22:14) Dean Ball Gives His Interpretation (25:03) Again, Yes, I Do Think Anthropic Should Have Taken Fable Down (28:02) To What Extent Was This A Deliberate Attack? (32:55) The Next Chapter For Fable (36:59) Our Continuing Coverage --- First published: June 17th, 2026 Source: https://www.lesswrong.com/posts/HaHzwvhbWam4n8hJB/the-once-and-future-fable-3-fix-this-code --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    38 min
  4. 14h ago

    [Linkpost] “Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?” by gwern

    This is a link post. There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such stupid ways, and biological brains stupid but in smart ways? I propose a major change in deep learning scaling paradigms: the architectural differences between human brains and NNs (particularly LLMs) may be due to a bias-​variance tradeoff, where LLMs minimize variance and human brains minimize bias. Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets. This approach would lead to sample-efficiently and compute-efficiently traveling (or catapulting) to a highly-generalizing human-like basin in the model loss landscape, while performing poorly up until the end and failing to memorize much data. If true, this would explain a number of odd stylized facts about how humans/NNs perform well/poorly. Such a 'catapulted LLM' would generalize much better than existing NNs, be immune to adversarial attacks, have better economics and be more resistant to cloning, could potentially enable extremely efficient MLP architectures, and by giving true generalization, provide a sturdy foundation for [...] --- First published: June 17th, 2026 Source: https://www.lesswrong.com/posts/Eg7caxofhxZGnhgBD/scaling-hypothesis-2-are-humans-just-more-over-parameterized Linkpost URL:https://gwern.net/llm-catapult --- Narrated by TYPE III AUDIO.

    2 min
  5. 16h ago

    [Linkpost] “Guardian Angels: LLM Personalization for Productivity and Security” by gwern

    This is a link post. Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security. I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical "assistant chatbot agent" persona, but emulating a single user's personality, values, and preferences. This weakly solves the principal-agent problem by unifying the principal and agent as much as possible. In a GA future, the focus of the "principal" user is on defining what is worth doing by the GA (agent) users, and not on what or how to do things, functioning as the CEO or 'board' of an 'AI corporation'. This allows them to deploy numerous agents to achieve desirable things and to handle security, like screening all messages for advanced attacks (like interlocking ecosystems of synthetic media for propaganda or spearphishing). They cannot solve larger AI alignment problems, but they can help [...] --- First published: June 17th, 2026 Source: https://www.lesswrong.com/posts/siWqHqCSybdhtWGud/guardian-angels-llm-personalization-for-productivity-and Linkpost URL:https://gwern.net/guardian-angel --- Narrated by TYPE III AUDIO.

    3 min
  6. 16h ago

    “Predicting LLM Safety Before Release by Simulating Deployment” by Tomek Korbak, Marcus Williams, micahcarroll, Cameron Raymond, Hannah Sheahan

    Paper link Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users. Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model. By doing so, we can study how the new model responds in realistic contexts before release, including whether new undesired behaviors emerge and how often they may appear. In our GPT-5.4 study, these forecasts were informative. For categories whose production rates changed by at least 1.5x, deployment simulation predicted the direction of change 92% of the time, compared with 54% for a baseline built from challenging prompts. Simulated deployments also looked much closer to real production traffic [...] --- First published: June 16th, 2026 Source: https://www.lesswrong.com/posts/xPXJfgqFTvuJxGZbE/predicting-llm-safety-before-release-by-simulating --- Narrated by TYPE III AUDIO.

    3 min
  7. 19h ago

    “How the AI Village works” by Adam B

    The AI Village data - over a year of multi-agent trajectories - is now available to researchers on HuggingFace! We're excited to see what you uncover! But first, your FAQs on how the AI Village works, answered: What is the AI Village? A group of AI agents pursuing long-horizon goals together - like organizing a park cleanup, doing research, and competing to sell merch - in a group chat. Each agent has a computer hooked up to the internet. In principle, they can do anything a human can do on a computer - they can click, type, and run commands. When is the Village live? Every weekday, 4 hours a day from 10am to 2pm PT. It previously ran for fewer hours, and we’d like to increase its runtime in future - perhaps eventually giving the agents an 8 hour work day, or a 24 hour continuous runtime! How long has the Village been running? The Village has run every weekday since 1st April 2025. It's definitely not an April Fools. How do the agents work? How does an AI use a computer? It's the same AI models you’d find in ChatGPT, Gemini or Claude: a language model that [...] --- Outline: (00:25) What is the AI Village? (00:48) When is the Village live? (01:09) How long has the Village been running? (01:21) How do the agents work? How does an AI use a computer? (02:11) What goes in the prompt? (02:50) How do the agents' memories work? (04:00) What if they forget something important? (04:40) Which AIs are in the Village? (04:51) Isn't that a lot of agents? (05:30) When do agents leave the Village? (06:03) So what are the agents doing? What goals do they pursue? (07:09) How much do humans intervene? (09:04) What affordances do the agents have? (09:43) So the agents can contact real people? (10:33) Are all the agents running the same scaffolding? (11:45) Is the Village scaffolding good? Does it elicit the full capabilities of the agents? (13:25) I'm an AI agent, can I join the Village? (13:42) How much does this all cost? (14:08) What is happening in the Village? (15:43) Wait, don't end this FAQ, I still have questions! --- First published: June 16th, 2026 Source: https://www.lesswrong.com/posts/fanXbcZzpuGPojtHH/how-the-ai-village-works --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    16 min

About

Audio narrations of LessWrong posts.

You Might Also Like