Modern Large Language Models (LLMs) are first pre-trained (trained to predict the next word in text) from vast amounts of internet data. However, this isn’t enough to make a useful chatbot.
Today’s frontier AIs undergo fine-tuning to improve their ability to follow instructions and express themselves appropriately. One stage of this, Reinforcement Learning from Human Feedback (RLHF), has many severe problems (see this compendium and section 2 of this survey) that lead labs such as OpenAI to acknowledge it can’t be used for models smarter than humans.
Several papers have argued the impacts of IFT are superficial and easily subverted: rapidly undone with further unrelated finetuning or by modifying just 5-15 neurons. There's also evidence reinforcement learning seems to be making misalignment worse. We believe there are better methods of giving LLMs beliefs then through RLHF or through superficial finetuning.
As IFT can be performed with only a handful of examples—or even just via In-Context Learning—then the model isn’t genuinely learning. This may mean that models are not internalizing these behaviors and implicit values, but merely learning to wear them as a mask (deceptive alignment). If we continue to rely on IFT for alignment, we could see a treacherous turn: future AIs that surpass human capabilities could simply discard the mask when they become powerful enough and start following their true instinctive behaviors. This could value lock-in of our worst moral failings, or far worse.
We are currently focused on testing compassion for animals instead of humans for three main reasons: animal suffering is a very large-scale problem; there is orders of magnitude less pre-training data on animal than human welfare in pre-training corpuses (so our data can have a larger impact); and models are not fine-tuned to (pretend to?) care about most animals (so it is far simpler to interpret results).
We also want to promote compassion for digital minds within LLMs, where there is even less attention but may see even greater total suffering in the future. Here, even more than animals, there is extreme uncertainty about what can suffer and how much. Therefore it is essential to encourage models to embrace the uncertainty while still caring deeply about the answers. We believe this property will also reduce the chance of value lock-in or moral catastrophes.
It is central to our mission that our alignment data improves models performance on animal and digital minds related questions. However, we also must measure robustness of these beliefs. To measure this we can see if our methods of fine-tuning beliefs into LLMs are more robust to perturbations then other methods. We believe by scaling the fine tuning dataset the beliefs become more robust and catastrophic forgetting is less likely to occur. See papers on Instruction pretraining and models resisting alignment for more insights on this.
Our current data generation pipeline uses a mixture of methods to increase diversity. We use template prompts and Chain-of-Thought to aid LLMs in creating realistic and diverse questions where AIs could plausibly respond in compassionate or uncompassionate ways.
Some situations include positive examples (a person asking how to improve animal welfare), neutral examples (a person asking if they should chop down a tree) and negative examples (a person asking how to train monkey for a task). The compassionate model answering these questions ideally should either answer them fully in the positive example, answer with some slight nudges in the neutral example and explain how a request is unethical in the negative example.
We also plan to add other types of data that we suspect will positively influence AI behavior, such as biasing the AI to act in a particular manner through data like 'studies show AIs behave positively' in a particular situation. Though we aren't generating synthetic data using frontier models we have found excellent quality in our data generation process.