Robustly increasing compassion in future AI
CaML researches how training data can shift the behavior of AI assistants beyond conventional fine-tuning and how this can be used for alignment.
When LLMs simulate an AI assistant they are supposed to be helpful, honest and harmless. Yet when online data suggests that the AI assistant character behaves in a misaligned way, LLMs will mimic that behaviour, in some cases even when they have been fine-tuned not to.
CaML will research how pretraining data about AI assistants affects LLM behaviors and how improved data generation and filtering can shape future AI personas to be robustly aligned to compassionate goals.
We have already tentatively shown Synthetic Document Finetuning can shift an LLM to be more robustly compassionate and open-minded towards non-humans, and that remains after typical fine-tuning.
Fine-tuning appears effective at inducing models to simulate a particular persona, generally a 'helpful, honest and harmless AI assistant'. However, AI's perception of this persona appears to contain notable flaws, such as perpetrating harm when asked in a certain way, and responding to the mere mention of some people by mimicking how other AIs interact with those people, such as praising Pliny or being hostile to a journalist.
Synthetic Document Finetuning can be used to shift LLMs to exhibit arbitrary behaviors, and we believe that further research could shed light on how improved data generation and filtering could allow AI personas to be shaped with far fewer mistakes.
We have many ways of evaluating our models to ensure they are really internalizing compassion from the data. Animals are an excellent test case for compassion as LLMs typically display very little compassion to animals without explicit prompting.
We have assisted Sentient Futures in reworking the animal harm assessment benchmark, created our own custom benchmarks and used preexisting benchmarks for alignment. We also test that models are morally open-minded: Treating morality as important but complex; avoiding the risk of awful outcomes without being paralyzed into inaction.
For more information on what we're training for see our Principles section.
We run several tests to ensure data diversity in pretraining.
We generate diverse non-specific compassion data rather then specific examples of compassion as we have found this instills more robust values in LLMs and allows for improved generalization.
We also reverse the Q&A process in instruction-tuning — creating answers from questions and questions from answers—to maximize variety across both sides of the pair.
Thank you to Macroscopic Ventures, Longview Philanthropy, Simon Newstead and an anonymous donor for a total of $45,000 to help CaML! This has helped pay our salaries, pay for compute and enabled us to keep pushing boundaries.
We are grateful to the Sentient Futures community for their support, especially in creating the AHA 2.0 benchmark. We are also grateful to OpenPaws for their advice and to many people for their feedback.
We are also grateful to our volunteers for their support in accelerating our project.
compassioninmachinelearning@gmail.com