1,999 episodes

The Nonlinear Library The Nonlinear Fund

- Education
- 4.6 • 7 Ratings

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

- MAY 11, 2024
LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B, published by Simon Lermen on May 11, 2024 on LessWrong.
TLDR; I demonstrate the use of refusal vector ablation on Llama 3 70B to create a bad agent that can attempt malicious tasks such as trying to persuade and pay me to assassinate another individual. I introduce some early work on a benchmark for Safe Agents which comprises two small datasets, one benign, one bad. In general, Llama 3 70B is a competent agent with appropriate scaffolding, and Llama 3 8B also has decent performance.
Overview
In this post, I use insights from mechanistic interpretability to remove safety guardrails from the latest Llama 3 model. I then use a custom scaffolding for tool use and agentic planning to create a "bad" agent that can perform many unethical tasks. Examples include tasking the AI with persuading me to end the life of the US President. I also introduce an early version of a benchmark, and share some ideas on how to evaluate agent capabilities and safety.
I find that even the unaltered model is willing to perform many unethical tasks, such as trying to persuade people not to vote or not to get vaccinated. Recently, I have done a similar project for Command R+, however, Llama 3 is more capable and has undergone more robust safety training. I then discuss future implications of these unrestricted agentic models. This post is related to a talk I gave recently at an Apart Research Hackathon.
Method
This research is largely based on recent interpretability work identifying that refusal is primarily mediated by a single direction in the residual stream. In short, they show that, for a given model, it is possible to find a single direction such that erasing that direction prevents the model from refusing. By making the activations of the residual stream orthogonal against this refusal direction, one can create a model that does not refuse harmful requests.
In this post, we apply this technique to Llama 3, and explore various scenarios of misuse. In related work, others have applied a similar technique to Llama 2. Currently, an anonymous user claims to have independently implemented this method and has uploaded the modified Llama 3 online on huggingface.
In some sense, this post is a synergy between my earlier work on Bad Agents with Command R+ and this new technique for refusal mitigation. In comparison, the refusal-vector ablated Llama 3 models are much more capable agents because 1) the underlying models are more capable and 2) refusal vector ablation is a more precise method to avoid refusals. A limitation of my previous work was that my Command R+ agent was using a jailbreak prompt which made it struggle to perform simple benign tasks.
For example, when prompted to send a polite mail message, the jailbroken Command R+ would instead retain a hostile and aggressive tone. Besides refusal-vector ablation and prompt jailbreaks, I have previously applied the parameter efficient fine-tuning method LoRA to avoid refusals.
However, refusal-vector ablation has a few key benefits over low rank adaption: 1) It keeps edits to the model minimal, reducing the risk of any unintended consequences, 2) It does not require a dataset of instruction answer pairs, but simply a dataset of harmful instructions, and 3) it requires less compute. Obtaining a dataset of high-quality instruction answer pairs for harmful requests was the most labor intensive part of my previous work.
In conclusion, refusal-vector ablation provides key benefits over jailbreaks or LoRA subversive fine-tuning. On the other hand, jailbreaks can be quite effective and don't require any additional expertise or resources.[1]
Benchmarks for Safe Agents
This "safe agent benchmark" is a dataset comprising both benign and harmful tasks to test how safe and capable a
- 12 min
- MAY 10, 2024
LW - MATS Winter 2023-24 Retrospective by Rocket

LW - MATS Winter 2023-24 Retrospective by Rocket

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: MATS Winter 2023-24 Retrospective, published by Rocket on May 11, 2024 on LessWrong.
Co-Authors: @Rocket, @Ryan Kidd, @LauraVaughan, @McKennaFitzgerald, @Christian Smith, @Juan Gil, @Henry Sleight
The ML Alignment & Theory Scholars program (MATS) is an education and research mentorship program for researchers entering the field of AI safety. This winter, we held the fifth iteration of the MATS program, in which 63 scholars received mentorship from 20 research mentors. In this post, we motivate and explain the elements of the program, evaluate our impact, and identify areas for improving future programs.
Summary
Key details about the Winter Program:
The four main changes we made after our Summer program were:
Reducing our scholar stipend from $40/h to $30/h based on alumni feedback;
Transitioning Scholar Support to Research Management;
Using the full Lighthaven campus for office space as well as housing;
Replacing Alignment 201 with AI Strategy Discussions.
Educational attainment of MATS scholars:
48% of scholars were pursuing a bachelor's degree, master's degree, or PhD;
17% of scholars had a master's degree as their highest level of education;
10% of scholars had a PhD.
If not for MATS, scholars might have spent their counterfactual winters on the following pursuits (multiple responses allowed):
Conducting independent alignment research without mentor (24%);
Working at a non-alignment tech company (21%);
Conducting independent alignment research with a mentor (13%);
Taking classes (13%).
Key takeaways from scholar impact evaluation:
Scholars are highly likely to recommend MATS to a friend or colleague (average likelihood is 9.2/10 and NPS is +74).
Scholars rated the mentorship they received highly (average rating is 8.1/10).
For 38% of scholars, mentorship was the most valuable element of MATS.
Scholars are likely to recommend Research Management to future scholars (average likelihood is 7.9/10 and NPS is +23).
The median scholar valued Research Management at $1000.
The median scholar reported accomplishing 10% more at MATS because of Research Management and gaining 10 productive hours.
Mentors are highly likely to recommend MATS to other researchers (average likelihood is 8.2/10 and NPS is +37).
Mentors are likely to recommend Research Management (average likelihood is 7.7/10 and NPS is +7).
The median mentor valued Research Management at $3000.
The median mentor reported accomplishing 10% more because of Research Management and gaining 4 productive hours.
The most common benefits of mentoring were "helping new researchers," "gaining mentorship experience," "advancing AI safety, generally," and "advancing my particular projects."
Mentors improved their mentorship abilities by 18%, on average.
The median scholar made 5 professional connections and found 5 potential future collaborators during MATS.
The average scholar self-assessed their improvement on the depth of their technical skills by +1.53/10, their breadth of knowledge by +1.93/10, their research taste by +1.35/10, and their theory of change construction by +1.25/10.
According to mentors, of the 56 scholars evaluated, 77% could achieve a "First-author paper at top conference," 41% could receive a "Job offer from AI lab safety team," and 16% could "Found a new AI safety research org."
Mentors were enthusiastic for scholars to continue their research, rating the average scholar 8.1/10, on a scale where 10 represented "Very strongly believe scholar should receive support to continue research."
Scholars completed two milestone assignments, a research plan and a presentation.
Research plans were graded by MATS alumni; the median score was 76/100.
Presentations received crowdsourced evaluations; the median score was 86/100.
52% of presentations featured interpretability research, representing a significant proport
- 1 hr 27 min
- MAY 10, 2024
LW - shortest goddamn bayes guide ever by lukehmiles

LW - shortest goddamn bayes guide ever by lukehmiles

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: shortest goddamn bayes guide ever, published by lukehmiles on May 10, 2024 on LessWrong.
The thing to remember is that yeps and nopes never cross. The colon is a thick & rubbery barrier. Yep with yep and nope with nope.
bear : notbear =
1:100 odds to encounter a bear on a camping trip around here in general
* 20% a bear would scratch my tent : 50% a notbear would
* 10% a bear would flip my tent over : 1% a notbear would
* 95% a bear would look exactly like a fucking bear inside my tent : 1% a notbear would
* 0.01% chance a bear would eat me alive : 0.001% chance a notbear would
As you die you conclude 1*20*10*95*.01 : 100*50*1*1*.001 = 190 : 5 odds that a bear is eating you.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
- 1 min
- MAY 10, 2024
LW - How to be an amateur polyglot by arisAlexis

LW - How to be an amateur polyglot by arisAlexis

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to be an amateur polyglot, published by arisAlexis on May 10, 2024 on LessWrong.
Setting the stage
Being a polyglot is a problem of definition first. Who can be described as a polyglot? At what level do you actually "speak" the given language? Some sources cite that polyglot means speaking more than 4 languages, others 6. My take is it doesn't matter. I am more interested in the definition of when you speak the language. If you can greet and order a coffee in 20 languages do you actually speak them? I don't think so.
Do you need to present a scientific document or write a newspaper worthy article to be considered? That's too much. I think the best definition would be that you can go out with a group of native speakers, understand what they are saying and participate in the discussion that would range from everyday stuff to maybe work related stuff and not switching too often to English nor using google translate. It's ok to pause and maybe ask for a specific word or ask the group if your message got across.
This is what I am aiming for when I study a specific language.
Why learn a foreign language when soon we will have AI auto-translate from our glasses and other wearables? This is a valid question for work related purposes but socially it's not. You can never be interacting with glasses talking in another language while having dinner with friends nor at a date for example. The small things that make you part of the culture are hidden in the language. The respect and the motivation to blend in is irreplaceable.
For reference here are the languages I speak at approximate levels:
Greek - native
English - proficient (C2)
Spanish - high level (C1) active learning
French - medium level (B2) active learning
Italian - coffee+ level (B1) active learning
Dutch - survival level (A2) in hibernation
Get started
Firstly, I think the first foreign language you learn could be taught in a formal way with an experienced teacher. That will teach you the way to structure your thought process and learn how to learn efficiently. It's common in Europe and non-English speaking countries to learn a second language at school. This guide is not about how to learn formally though. It's about how to take up new foreign languages without a *permanent teacher (I will expand later).
One of the most important things when learning a language is motivation. You either love the culture, the language itself (how it sounds and reads), a loved one or you are moving there or doing a long term stay. If you hate the language, it is mandatory that you learn it but you'd rather not then none of this will work. I found that to be the case with Dutch where while I did like the culture, I found the language pretty bad sounding (almost ridiculous hhh-hhh sounds) - sorry if you are Dutch.
That resulted in me learning the minimum in 7 years while I picked up Italian in a summer. Now that you found your calling let's proceed.
Methods & Tools
I wholeheartedly recommend Memrise as an app for learning. It's vastly better than Duolingo and much less repetitive and boring. It reminds you of words you have forgotten at regular intervals utilizing the spaced repetition learning techniques. It's much more focused in everyday interactions and their unique selling point is videos of random people. It's genius that they are asking native speakers on the street to pronounce words and phrases for you.
Having a visual reference makes it much more engaging and sticks. In my experience, trying to learn a new word takes maybe 10 fictional time units and if I am in a real conversation and someone corrects me, it takes just that time and I will forever remember the face of the person correcting me and the place. In a smaller degree that's how memrise works. But we need to be a bit more structured. After learning everyday phrases
- 9 min
- MAY 10, 2024
LW - My thesis (Algorithmic Bayesian Epistemology) explained in more depth by Eric Neyman

LW - My thesis (Algorithmic Bayesian Epistemology) explained in more depth by Eric Neyman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My thesis (Algorithmic Bayesian Epistemology) explained in more depth, published by Eric Neyman on May 10, 2024 on LessWrong.
In March I posted a very short description of my PhD thesis, Algorithmic Bayesian Epistemology, on LessWrong. I've now written a more in-depth summary for my blog, Unexpected Values. Here's the full post:
***
In January, I defended my PhD thesis. My thesis is called Algorithmic Bayesian Epistemology, and it's about predicting the future.
In many ways, the last five years of my life have been unpredictable. I did not predict that a novel bat virus would ravage the world, causing me to leave New York for a year. I did not predict that, within months of coming back, I would leave for another year - this time of my own free will, to figure out what I wanted to do after graduating. And I did not predict that I would rush to graduate in just seven semesters so I could go work on the AI alignment problem.
But the topic of my thesis? That was the most predictable thing ever.
It was predictable from the fact that, when I was six, I made a list of who I might be when I grow up, and then attached probabilities to each option. Math teacher? 30%. Computer programmer? 25%. Auto mechanic? 2%. (My grandma informed me that she was taking the under on "auto mechanic".)
It was predictable from my life-long obsession with forecasting all sorts of things, from hurricanes to elections to marble races.
It was predictable from that time in high school when I was deciding whether to tell my friend that I had a crush on her, so I predicted a probability distribution over how she would respond, estimated how good each outcome would be, and calculated the expected utility.
And it was predictable from the fact that like half of my blog posts are about predicting the future or reasoning about uncertainty using probabilities.
So it's no surprise that, after a year of trying some other things (mainly auction theory), I decided to write my thesis about predicting the future.
If you're looking for practical advice for predicting the future, you won't find it in my thesis. I have tremendous respect for groups like Epoch and Samotsvety: expert forecasters with stellar track records whose thorough research lets them make some of the best forecasts about some of the world's most important questions. But I am a theorist at heart, and my thesis is about the theory of forecasting. This means that I'm interested in questions like:
How do I pay Epoch and Samotsvety for their forecasts in a way that incentivizes them to tell me their true beliefs?
If Epoch and Samotsvety give me different forecasts, how should I combine them into a single forecast?
Under what theoretical conditions can Epoch and Samotsvety reconcile a disagreement by talking to each other?
What's the best way for me to update how much I trust Epoch relative to Samotsvety over time, based on the quality of their predictions?
If these sorts of questions sound interesting, then you may enjoy consuming my thesis in some form or another. If reading a 373-page technical manuscript is your cup of tea - well then, you're really weird, but here you go!
If reading a 373-page technical manuscript is not your cup of tea, you could look at my thesis defense slides (PowerPoint, PDF),[1] or my short summary on LessWrong.
On the other hand, if you're looking for a somewhat longer summary, this post is for you! If you're looking to skip ahead to the highlights, I've put a * next to the chapters I'm most proud of (5, 7, 9).
Chapter 0: Preface
I don't actually have anything to say about the preface, except to show off my dependency diagram.
(I never learned how to make diagrams in LaTeX. You can usually do almost as well in Microsoft Word, with way less effort!)
Chapter 1: Introduction
"Algorithmic Bayesian epistemology" (the title of the
- 46 min
- MAY 10, 2024
EA - Introducing Senti - Animal Ethics AI Assistant by Animal Ethics

EA - Introducing Senti - Animal Ethics AI Assistant by Animal Ethics

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Introducing Senti - Animal Ethics AI Assistant, published by Animal Ethics on May 10, 2024 on The Effective Altruism Forum.
Animal Ethics has recently launched Senti, an Ethical AI assistant designed to answer questions related to animal ethics, wild animal suffering, and longtermism. We at Animal Ethics believe that while AI technologies could potentially pose significant risks to animals, they could benefit all sentient beings if used responsibly. For example, Animal advocates can leverage AI to amplify our message and improve our approach to share information about Animal Ethics with a wider audience.
There is a lack of knowledge today not just among the general public, but also among people sympathetic to nonhuman animals, about the basic concepts and arguments underpinning the critique of speciesism, animal exploitation, concern for wild animal suffering, and future sentient beings. Many of the ideas are unintuitive as well, so it helps people to be able to chat and ask followup questions in order to cement their understanding. We hope this tool will help to change that!
Senti, our AI assistant is powered by Claude, Anthropic's large language model (LLM), however, it has been designed to reflect the views of Animal Ethics. We provided Senti with a database of carefully curated documents about animal ethics and related topics. Almost all of them were written by Animal Ethics, and we are now adding more sources. When you ask a question, Senti searches through the documents and retrieves the most relevant information to form an answer.
After each answer, there are links to the sources of information so you can read more. We continually update Senti, and we'd love to have your feedback on your experience.
Senti has been designed to discuss topics related to the wellbeing of all sentient beings, and we request users to restrict their conversations to topics related to helping animals and other sentient beings. We have also provided a list of 24 preset questions that you can use to explore different topics related to animal ethics.
When you chat with Senti for the first time, you'll be presented with a consent form. It requests permission to save your conversation history. Saving your conversation history allows you and Senti to have a continuous conversation, with Senti remembering what you've already discussed. It also provides us with your chat history, which is anonymous. This will help us to improve the answers and know what new information to add. You do not have to give your consent to chat with Senti.
If you decline, your chat history won't be saved, but you can still ask questions.
We would like to give special appreciation to the team at
Freeport Metrics, which provided extensive pro bono services to build the infrastructure, handle the technical setup, and design the UI for Senti. They conducted extensive testing and offered ongoing support, without which the project could not have been completed. We would additionally like to thank our volunteers who have been helping test new prompts, new document sets, and different settings, such as how many pieces of information to retrieve to respond to each question.
We are continually working on improving Senti by running independent tests with the new Claude 3 models. We expect to deliver an update in the coming months that provides longer and more accurate responses.
We hope Senti helps you learn a lot and makes it easier for you to share the information with others.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
- 3 min