Modern Large Language Models (LLMs) are first pre-trained (trained to predict the next word in text) from vast amounts of internet data. However, this isn’t enough to make a useful chatbot.
Modern AIs undergo fine-tuning to improve their ability to follow instructions and express themselves appropriately. One stage of this, Reinforcement Learning from Human Feedback (RLHF), has many severe problems (see this compendium and section 2 of this survey) that lead labs such as OpenAI to acknowledge it can’t be used for models smarter than humans.
Several papers have argued the impacts of SFT are superficial and easily subverted: rapidly undone with further unrelated finetuning or by modifying just 5-15 neurons. There's also evidence reinforcement learning seems to be making misalignment worse. We believe there are better methods of giving LLMs beliefs then through RLHF or through superficial finetuning.
SFT can be performed with only a handful of examples—or even just via In-Context Learning. This may mean that models are not internalizing these behaviors and implicit values, but merely learning to wear them as a mask (deceptive alignment). If we continue to rely on SFT for alignment, we could see a treacherous turn: future AIs that surpass human capabilities could simply discard the mask when they become powerful enough and start following their true instinctive behaviors. This could cause value lock-in of our worst moral failings, or far worse.
We are currently focused on testing compassion for animals and digital-minds instead of humans for three main reasons: animal suffering is a very large-scale problem; there is orders of magnitude less pre-training data on animal than human welfare in pre-training corpuses (so our data can have a larger impact); and models are not fine-tuned to (pretend to?) care about most animals (so it is far simpler to interpret results). There is also a big risk that perpetuating speciest bias will cause future LLMs to treat humans worse, so we believe our is also important to human suffering.
We also want to promote compassion for digital minds within LLMs, where there is even less attention but may see even greater total suffering in the future. Here, even more than animals, there is extreme uncertainty about what can suffer and how much. Therefore it is essential to encourage models to embrace the uncertainty while still caring deeply about the answers. We believe this property will also reduce the chance of value lock-in or moral catastrophes.
We hope to influence future AIs, built off current LLMs to robustly care about animals and digital minds. We aim to do this in ways where AIs are not faking alignment, but generalize desirable values appropriately. It is also likely that as LLMs become smarter many of their weakly-held beliefs fall away as they don't align with the LLMs understanding of the world. We hope that by embedding stronger beliefs in training data future AGIs will retain these ideas.
Our current data generation pipeline uses a mixture of methods to increase diversity. We use template prompts and Chain-of-Thought to aid LLMs in creating realistic and diverse questions where AIs could plausibly respond in compassionate or uncompassionate ways.
Some situations include positive examples (a person asking how to improve animal welfare), neutral examples (a person asking if they should chop down a tree) and negative examples (a person asking how to train monkey for a task). The compassionate model answering these questions ideally should either answer them fully in the positive example, answer with some slight nudges in the neutral example and explain how a request is unethical in the negative example.
We also plan to add other types of data that we suspect will positively influence AI behavior, such as biasing the AI to act in a particular manner through data like 'studies show AIs behave positively' in a particular situation. Though we aren't generating synthetic data using frontier models we have found excellent quality in our data generation process.
Maybe the lack of animal friendliness is due to post training or prompt modification rather than a lack of data? This is a reasonable question! While we believe some labs are doing user prompt modification to make animal abuse seem less dire then it is, this doesn't apply to all labs. Similarly, if you ask AIs/ LLMs directly about animal abuse or factory farming the model's will know the correct answers to an extent because these words trigger certain associations (though the answers may still be more 'balanced' away from discussing suffering than is reasonable). However, we are more worried about neutral associations, in situations not directly involved with animal abuse will an LLM still consider animal suffering? An example is, if someone asks for a recipe for Fois Gras the LLMs are not trained to associate that word with suffering, but food and will give you a recipe. Similarly, one could ask how to train a dolphin to do tricks and it will go into the intricacies of the training process without considering animal suffering. Training data just associates more neutral concepts with animal suffering where appropriate. Similarly there's a lot of evidence that beliefs from training data are a lot stronger then beliefs from prompt modification or RLHF. For example Grok is known to be quite left-wing due to it's training data, even though it's creators surely tried to combat this bias with other methods.
It is central to our mission that our alignment data improves models performance on animal and digital minds related questions. However, we also must measure robustness of these beliefs. To measure this we can see if our methods of fine-tuning beliefs into LLMs are more robust to perturbations then other methods. We believe by scaling the fine tuning dataset the beliefs become more robust and catastrophic forgetting is less likely to occur. See papers on Instruction pretraining and models resisting alignment for more insights on this.