Eleos AI

Eleos AI

Readouts of the Eleos AI's blog posts and research reports. Audio processing by Aaron Bergman. See more at eleosai.org

Episodes

  1. 01/09/2025

    Why it makes sense to let Claude exit conversations by Robert Long

    Read the post here. Note: This is also posted on Robert Long's Substack. Intro Last week, Anthropic announced that its newest language models, Claude Opus 4 and 4.1, can now shut down certain conversations with users. The announcement explains that Anthropic gave Claude this ability “as part of our exploratory work on potential AI welfare”. This means that, for the first time, a major AI company has changed how it treats its AI systems out of concern for the well-being of the systems themselves, not just user safety. Whether or not you think Claude is or will be conscious—Anthropic themselves say that they are “deeply uncertain”—this decision is a notable moment in the history of human-AI interactions. Some will see this as much ado about nothing. Others will see it as pernicious: hype, a distraction from more important issues, and an exacerbation of already-dangerous anthropomorphism. Others, a considerably smaller group, think that LLMs are obviously already conscious, and so this move is woefully insufficient. I think it’s more mundane: Anthropic is taking a fairly measured response to genuine uncertainty about a morally significant question, and attempting to set a good precedent. For the most part, this intervention’s success won’t depend on how it affects Claude Opus 4.1; it will depend on how people react to it and the precedent it sets. Although we don’t know how that will pan out, and there are reasons to worry about backlash, I think that this was a good move.

    10 min
  2. 01/09/2025

    Eleos commends Anthropic model welfare efforts by Robert Long and Larissa Schiavo

    Please read the original, full post here. Full text Eleos commends Anthropic model welfare efforts Eleos AI Research congratulates Kyle Fish and Anthropic on their announcement of a new research program to investigate potential AI consciousness and welfare. We hope that they will continue to invest in this area, and urge other frontier labs to follow Anthropic's lead. BERKELEY, CA – April 24, 2025 – Eleos AI Research congratulates Kyle Fish and Anthropic on their announcement of a new research program to investigate potential AI consciousness and welfare. Kyle Fish, the researcher at Anthropic heading this effort, previously co-founded Eleos and was a co-author on Eleos' landmark report, "Taking AI Welfare Seriously." Robert Long, Executive Director and co-founder of Eleos AI Research, commented, "This announcement is very promising news for AI welfare. We've known for some time that AI companies are increasingly concerned about the potential consciousness and welfare of the systems they are building. To our knowledge, this is the most significant action any AI company has yet taken to responsibly address potential AI welfare concerns." "We're pleased to see that Anthropic cites 'Taking AI Welfare Seriously' as inspiration for their work on model welfare," said Long, who was a lead author. "We provide guidance to frontier AI labs that want to proactively and thoughtfully engage with these challenges." Rosie Campbell, who recently joined Eleos from OpenAI, added, "Anthropic's announcement is a positive first step. We hope that they will continue to invest in this area, and urge other frontier labs to follow Anthropic's lead." "Ignoring or downplaying these issues will become increasingly untenable," Campbell said. "Frontier labs need to take credible, proactive steps, such as developing AI welfare policies, evaluating models for relevant consciousness-related properties, and making clear commitments for how they will respond if AI welfare risks emerge." Campbell emphasized, "Lab actions are necessary but far from sufficient: AI welfare needs input from researchers, policymakers, and society at large. Eleos intends to collaborate with a broad swathe of stakeholders, and we'd love to hear from people interested in this area."

    3 min
  3. 01/09/2025

    Looking Inward: Language Models Can Learn About Themselves by Introspection

    Please read the paper here. Authors: Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans Abstract Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data.We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger).In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

    1hr 39min
  4. 01/09/2025

    Taking AI Welfare Seriously

    Read the paper here and a summary here Authors: Robert Long and Jeff Sebo (lead), with Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, and David Chalmers Abstract In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. That means that the prospect of AI welfare and moral patienthood — of AI systems with their own interests and moral significance — is no longer an issue only for sci-fi or the distant future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously. We also recommend three early steps that AI companies and other actors can take: They can (1) acknowledge that AI welfare is an important and difficult issue (and ensure that language model outputs do the same), (2) start assessing AI systems for evidence of consciousness and robust agency, and (3) prepare policies and procedures for treating AI systems with an appropriate level of moral concern. To be clear, our argument in this report is not that AI systems definitely are — or will be — conscious, robustly agentic, or otherwise morally significant. Instead, our argument is that there is substantial uncertainty about these possibilities, and so we need to improve our understanding of AI welfare and our ability to make wise decisions about this issue. Otherwise there is a significant risk that we will mishandle decisions about AI welfare, mistakenly harming AI systems that matter morally and/or mistakenly caring for AI systems that do not.

    2h 20m
  5. 01/09/2025

    Towards Evaluating AI Systems for Moral Status Using Self-Reports by Ethan Perez and Robert Long

    Please read the paper here. Abstract As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

    1hr 24min

About

Readouts of the Eleos AI's blog posts and research reports. Audio processing by Aaron Bergman. See more at eleosai.org