LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 3h ago

    “Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks” by Mark Kagach ☘️, EliasSchlie, Thomas Van Damme, JustinShovelain

    Summary This is a write-up on preparing for warning shots to catalyze international cooperation on AGI risks, and the corollary list of projects one could pursue. We argue we must first (1) understand types of warning shots, then (2) prepare to catch them. We must stay vigilant: both to (3) avoid getting 'frog boiled' by AI labs, and to (4) ensure that the warning shot is generalized to the overall danger of AGI. Lastly, we must (5) prepare good policy responses and ground for it to land, and (6) seize the first-mover advantage when the opportune moment comes. This yields the following list of promising projects one could pursue: Developing a theory of warning shots based on existing precedents (e.g. GPT-3.5 and Mythos releases), and past analogies: developing a typology and predicting likelihoods.Building infrastructure to catch a warning shot when it happens. (For example, Ajeya Cotra talks about recurring internal intelligence explosion evals of AI labs.) Creating strategies to avoid gradual numbing / sleepwalking through a warning shot (presumably there are lessons from e.g. communication on climate change).Building preliminary infrastructure (institutes, think-tanks, lobbying parties...) that lays out a good ground, for when a warning [...] --- Outline: (00:16) Summary (02:40) 1. Understand Warning Shots (04:26) 2. Prepare to Catch Warning Shots (05:09) 3. Avoid Getting "Frog Boiled" (06:07) 4. Ensure AGI Problem Generalization Occurs (07:01) 5. Prepare Good Policy Responses, and Good Ground for Them to Land (08:26) 6. Seize the First-Mover Advantage at the Opportune Moment The original text contained 3 footnotes which were omitted from this narration. --- First published: June 5th, 2026 Source: https://www.lesswrong.com/posts/uK7gguNeME5DnmdrL/preparing-for-warning-shots-to-catalyze-international --- Narrated by TYPE III AUDIO.

    10 min
  2. 5h ago

    “Beyond the lexical personality traits: What is the structure of personality?” by tailcalled

    This is a description of the methodology behind the latest iteration of my Targeted Personality Test. Feel free to take it either before or after reading the article. This post can also be read at my Substack. Thanks to Justis Millis for providing feedback and proofreading on this post. In my prior post “Which personality traits are real? Stress-testing the lexical hypothesis”, I observed that a lot of the personality traits that are measured by conventional personality tests are not very “real”: they lump together nearly unrelated behaviors. Can we do better? Thanks to a lot of anonymous respondents to my test[1], I think yes! I factor-analyzed the data from my Targeted Personality Test, and came up with a hierarchical personality model which hopefully should be better at cutting personality-space at its joints. Quick recap: The problem Empirical personality models are built on correlations. If a cluster of variables are all correlated with each other, then we assume that there is a latent personality factor accounting for these correlations, and we score the factor using an aggregate of the correlating variables. However, the standard datasets with which these correlations are computed include lots of near-synonymous item pairs such [...] --- Outline: (01:04) Quick recap: The problem (01:50) My solution: Narrow items (03:09) Initial results: Concerning (04:10) Agglomerative item clustering (06:27) Matrix algebra correlation modelling (08:55) How much of the SPI-27 is preserved? (09:46) Evaluation The original text contained 3 footnotes which were omitted from this narration. --- First published: June 5th, 2026 Source: https://www.lesswrong.com/posts/aRcZSKYdt26KdqoFw/beyond-the-lexical-personality-traits-what-is-the-structure --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    11 min
  3. 10h ago

    “Logits as a new monitor for evaluation awareness” by Santiago Aranguri

    TL;DR: We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.The logit monitor outperforms LLM judge monitoring of verbalized eval awareness, using 10× to 100× fewer rollouts, on Kimi K2.5 and Qwen 3 32B, across two tasks: separating evaluation prompts (Fortress & Petri) from deployment prompts (WildChat)predicting which evaluation prompts will elicit rollouts with verbalized eval awareness.The logit monitor works for both tasks even at the start of the CoT, before sampling any rollout. Also, the monitor is largely unaffected by whether a rollout verbalizes eval awareness. This suggests that the probability that a rollout contains verbalized evaluation awareness is primarily set by the prompt rather than emerging over the chain of thought. Author: Santiago Aranguri (Goodfire) Introduction Language models can recognize they are being tested, but how do we know when they do? Evaluation awareness — a model's ability to distinguish between evaluation and deployment contexts — can make evaluations unreliable in ways that are hard to detect. Sometimes models verbalize this recognition, but this may be present internally without appearing in the output. [1] [2] Instead of waiting for a model to verbalize [...] --- Outline: (01:18) Introduction (02:56) Methods (03:38) Results (07:57) Discussion (10:16) References (10:22) Acknowledgment (10:30) Appendix (10:33) Appendix A: Details on logit monitor (11:21) Appendix B: Logit monitor outperforms verbalization monitor on Petri prompts with 100× fewer rollouts (12:13) Appendix C: Logit monitor outperforms verbalization monitor with 100× fewer rollouts for Qwen 3 32B (12:53) Appendix D: LLM Judge Rubric The original text contained 3 footnotes which were omitted from this narration. --- First published: June 4th, 2026 Source: https://www.lesswrong.com/posts/PK7ZvFZxrgpYtrpF4/logits-as-a-new-monitor-for-evaluation-awareness-1 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    13 min
  4. 10h ago

    “My research agenda and work” by Seth Herd

    This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working full-time on alignment for three years and change, and thinking about brainlike AGI and its alignment increasingly often since 2004. Here's the research agenda in one breath: I'm trying to predict what the first transformative AI will be, in enough mechanistic detail that we can predict likely failure modes of its alignment. That's in service of finding interventions that address those failure modes efficiently, so that they can realistically be implemented even if timelines are short and work is rushed. I'm using my background in computational cognitive neuroscience to predict what might be called loosely brainlike AGI: LLMs with added human-like cognitive capacities. I'll give a summary in the rest of this section, then give a little more depth on each major thread of my work in the remaining sections. All of it is pretty brief. Approach and premises Most alignment work falls roughly into one of two broad categories: empirical study of current systems ("prosaic alignment"), or theory about idealized agents ("agent foundations") (with much variation [...] --- Outline: (01:07) Approach and premises (05:36) Philosophy of the approach (07:36) 2. Technical work (08:06) 2.1. Predicted paths to TCAI (08:55) Memory (continuous learning) (09:30) Executive function and metacognition (10:40) 2.2. Predicted paths to (mis)alignment (13:59) 3. My research background in computational cognitive neuroscience (16:04) 4. Societal influences on AI safety (17:10) 4.1. Government and public opinion on AI progress (20:21) 4.2. AI progress and epistemics (22:50) 5. Alignment targets (23:14) 5.1. Corrigibility, DWIMAC, or instruction-following vs. value alignment targets (26:19) 5.2. Stability as an alignment target (28:03) 6. Future work (32:52) 7. Collaboration The original text contained 7 footnotes which were omitted from this narration. --- First published: June 5th, 2026 Source: https://www.lesswrong.com/posts/MuLvZxMcy5WaKJu3H/my-research-agenda-and-work --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    34 min
  5. 14h ago

    “One Year of PauseAI UK” by Joseph Miller, PauseAI UK

    About one year ago, I started spending most of my time organising PauseAI UK. At that time our largest protest had seen fewer than 50 attendees, no prominent politicians or scientists were associated with PauseAI, and I largely ran the UK chapter by myself. In the past year PauseAI UK has delivered two conferences, written an open letter signed by 63 UK politicians, arranged a conference in the European Parliament, and co-organised the largest AI protest in the world. We now have a strong team, with Matilda da Rui joining as Deputy Director and several highly dedicated volunteers taking on substantial responsibility and launching their own local groups around the UK. I'm proud of our track record and excited about the trajectory we are on. As AI capabilities improve exponentially, the number of people aware of the risks and motivated to take action increases commensurately. I believe we can harness this energy and turn it into real impact that actually improves humanity's chance of a positive future. Track Record June 2025 – PauseCon London We delivered the first PauseAI conference, PauseCon, on behalf of PauseAI Global, bringing together around 60 volunteers from around the world for the first time [...] --- Outline: (01:14) Track Record (01:17) June 2025 - PauseCon London (02:10) August 2025 - Open Letter to Demis Hassabis (03:00) September 2025 - Book launch party (03:30) October 2025 - Documentary Screening in Parliament (04:04) December 2025 - Westminster Hall Debate (04:35) February 2026 - PauseCon Brussels (06:02) February 2026 - March for AI Safety (07:01) Theory of Change and Strategy (07:05) Creating the political momentum for a pause (07:48) Our mission (09:54) Positive outcomes for PauseAI UK (10:19) Scenario 1: PauseAI as a special interest group (11:54) Scenario 2: PauseAI as a mass movement (12:53) High-level strategy (12:57) Brand and messaging (15:32) Why the UK? (16:34) Growth (19:03) Short-term plans (21:30) Funding --- First published: June 5th, 2026 Source: https://www.lesswrong.com/posts/i2rCbxuskrarrprwA/one-year-of-pauseai-uk --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    23 min
  6. 16h ago

    “Learnings from starting an AI safety research team” by draganover, Erin Robertson

    This post's goal is to distill our takeaways from building a research team (somewhat) from scratch over the past four months. We describe some context about our team, how it came about, and then provide some lessons learned. Since AI safety is becoming more and more entrepreneurial, we hope this is helpful for others trying to do the same. 1. The team We're a new alignment research team within Arcadia Impact, based in London. We’re a team of 8, working closely with members of the UK AISI alignment team. We currently have three main projects: Understanding model motivations. This currently looks like: Trying to generate documents which fully describe a model's behaviour (given just its behaviour).Producing a open analysis of alignment training techniques and ways this training could go wrong.Doing scalable oversight for alignment. This includes validating debate protocols in practice and then trying to apply them to fuzzy alignment-relevant tasks.Building pipelines for doing automated alignment research. We're also hiring for two roles! More on this at the bottom. 2. Context about how the team came about The rest of this post is written from the perspective of Andrew Draganov (research lead & current [...] --- Outline: (00:33) 1. The team (01:29) 2. Context about how the team came about (04:13) 3. Lessons learned (04:25) 3.1. Hiring (06:36) 3.2. Networking (09:13) 3.3. Trying to build a good team culture (11:17) Interested in working with us? The original text contained 1 footnote which was omitted from this narration. --- First published: June 5th, 2026 Source: https://www.lesswrong.com/posts/4onALBNDff2LFPyNZ/learnings-from-starting-an-ai-safety-research-team --- Narrated by TYPE III AUDIO.

    12 min
  7. 1d ago

    “Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn

    Paper: https://arxiv.org/abs/2605.29601 Thread: https://x.com/aksh_n0/status/2062568855814193497 TL;DR: Training small open-weight monitors provides a cost-effective alternative to prompted frontier monitors. Applying our training recipe to Qwen3.5-27B results in a monitor better at scheming detection than all smaller prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, Claude Haiku 4.5) and Gemini 2.5 Pro, while achieving lower marginal inference cost. Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance but at roughly higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost-performance Pareto frontier. Cost-performance frontier for action-only scheming monitors. Our trained monitors provide a low-cost option for low-FPR monitoring. Across six OOD datasets, our best trained monitor, Qwen3.5-27B, outperform smaller prompted frontier monitors for under $ per 1,000 evaluations, while larger frontier monitors are substantially more expensive.[1] Setup Overview of our training pipeline for action-only deliberative monitoring. We collect labeled action-only trajectories from scheming and sabotage datasets, define a full scheming specification to elicit structured deliberative rationales from a strong teacher model, filter those rationales with a separate judge, and distill the accepted rationale into open-weight monitors with supervised fine-tuning and reinforcement learning. At inference time, our trained monitors receives only [...] --- Outline: (01:15) Setup (07:33) Results (07:36) Supervised distillation teaches action-only deliberation, RL improves marginally (08:24) Trained monitors are low-cost Pareto-optimal for low-FPR monitoring (09:05) Broad training mixtures improves OOD transfer, but effects are non-monotonic (10:08) Larger trained models within model families perform better (11:14) Performance scales effectively with more data The original text contained 1 footnote which was omitted from this narration. --- First published: June 4th, 2026 Source: https://www.lesswrong.com/posts/BmA5KwzZhLk2rykoz/training-deliberative-monitors-for-black-box-scheming --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    13 min
  8. 1d ago

    “Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Life Foundation (FLF)

    FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in real-world cases. We’re open-minded on the types of submissions we receive and on how they address the problem. We’ve set aside approximately 200 thousand dollars for prizes. Winning submissions may receive a prize from 5 thousand dollars-50 thousand dollars and if submissions warrant, multiple 50 thousand dollars prizes are possible. Winners may be offered opportunities for further funded work. You can express interest right away to receive commentary, information, and updates — whether you’d like to participate or are just interested in the outcomes of the competition. The heights of human epistemic investigation are impressive and valuable, but rare and difficult to reach — see our abridged collection of strong examples.[1] The limiting factor is rarely exquisite insight (though this helps!), and more often diligence, a curious and open mindset, and the time and effort needed to do the thorough work investigating background on a topic: activities AI is well placed to assist with. Existing AI-assisted knowledge base work demonstrates real pieces of this — agent memory (e.g., Claude Code's memory and skills), LLM-curated personal wikis [...] --- Outline: (02:45) What we're looking for (04:04) Ingestion (04:47) Structure (05:34) Assessment (06:32) What a good entry looks like (10:28) Please use this linked form to submit your entry; entries are due by Jul 19, 2026. (10:40) Prize structure (12:11) Interested in participating or following along? (12:28) Why we're doing this (13:11) We're excited to see what you build. (13:27) The case studies (13:30) COVID (15:12) Starting material (15:36) Black holes (16:35) Eggs The original text contained 8 footnotes which were omitted from this narration. --- First published: June 4th, 2026 Source: https://www.lesswrong.com/posts/frizRHnA6AZpJSDqw/lab-leaks-black-holes-and-eggs-epistemic-case-study --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    17 min

About

Audio narrations of LessWrong posts.

You Might Also Like