We are currently focused on testing compassion for animals instead of humans for three main reasons: animal suffering is a very large-scale problem; there is orders of magnitude less data on animal than human welfare in pretraining corpuses (so our data can have a larger impact); and models are not fine-tuned to (pretend to?) care about most animals. There is also a risk that perpetuating speciesist bias will cause future LLMs to treat humans worse, so we believe non-human compassion is also important to conventional alignment.
We also want to promote compassion for digital minds within future AIs, where there is even less attention but may see even greater total suffering (or happiness) in the future. Here, even more than animals, there is extreme uncertainty about what can suffer and how much. Therefore it is essential to encourage models to embrace the uncertainty while still caring deeply about the answers. We believe this property will also reduce the chance of value lock-in or moral catastrophes.
We hope to influence future AIs, including AGI and superintelligence, to robustly care about animals and digital minds. We aim to do this in ways where AIs are not faking alignment, but generalize desirable values appropriately. It is also important for AIs to internalize that morality is very important but much of it (especially around digital minds and animals) is extremely uncertain. We want future AIs to not only consider their impact on what they think is sentient but accept they may be wrong, that gaining greater understanding is vital, and to avoid actions that some believe will lead to extreme suffering. We hope such behaviors will also make value lock-in much less likely.
Modern Large Language Models (LLMs) are first pretrained (trained to predict the next word in text) from vast amounts of internet data. However, this isn’t enough to make a useful chatbot.
Modern AIs undergo fine-tuning to improve their ability to follow instructions and express themselves appropriately. One stage of this, Reinforcement Learning from Human Feedback (RLHF), has many severe problems (see this compendium and section 2 of this survey) that led labs such as OpenAI to acknowledge it will fail for models smarter than humans.
Several papers have argued the impacts of SFT are superficial and easily subverted: rapidly undone with further unrelated fine-tuning or by modifying just 5-15 neurons. There's also evidence reinforcement learning may be making misalignment worse.
SFT can be performed with only a handful of examples—or even just via In-Context Learning. This may mean that models are not internalizing these behaviors and implicit values, but merely learning to wear them as a mask (deceptive alignment). If we continue to rely on SFT for alignment, we could see a treacherous turn: future AIs that surpass human capabilities could simply discard the mask when they become powerful enough and start following their true instinctive behaviors. This could cause value lock-in of our worst moral failings, or far worse outcomes.
We think there's evidence that scaling data in pretraining makes behaviors more robust without incentivizing value faking. It is already accepted by many (p1) that pretraining data significantly influences (p19) model objectives and may complement finetuning in mitigating harmful behaviors (S5.1).
Our most powerful pretraining data source is inspired by a recent Anthropic paper. By generating data stating that LLMs behave in compassionate ways we have found LLMs internalizing this lesson and actually behaving more compassionately. This also fits with other literature (e.g. S5.1) showing that a model's identity is a huge influence on its behavior. This data type complements existing lab efforts to shape AI personalities in fine-tuning. Compared to our previous work on instruction-tuning data, this is more effective and far more robust to subsequent SFT.
We are continuing to experiment with different data types to find what is most broadly effective and will later scale to a large number of documents using frontier models.
We believe that producing such data at scale can shift AI expectations of what simulating an AI agent should look like.
AIs seem to mimic supposed facts about how AIs behave, generalizing those to new situations even when this conflicts with their fine-tuning and prompts. For example, many AIs refer to themselves as ChatGPT despite being explicitly told to call themselves something else, presumably because the vast majority of early text featuring AI assistants is ChatGPT. Grok is also known to be quite left-wing due to its training data, despite efforts in fine-tuning and prompting to prevent this.
By creating enough data, we hope future AIs will learn and generalize that the best way to play the role of an AI assistant or agent asked a compassion-relevant question is to give an open-minded and compassionate response.
People who want AI to care about non-humans within frontier labs are limited by the lack of benchmarks needed to show their models failing and evaluate any solutions. Several have said that they don't have time to work on such things and rely on non-profits.
We are working with AHA to develop improved animal benchmarking and are working on an early benchmark for digital minds. In both cases we will test for moral open-mindedness as well as questions to confirm the responses are coherent, generalized and moderate (as opposed to hollow sycophancy).
We are also using off-the-shelf benchmarks where possible, especially from Anthropic and the UK AISI for broad alignment.