Robustly increasing compassion in future AI
Current fine-tuning often yields shallow alignment affecting a tiny number of weights compared to pretraining. CaML is creating targeted synthetic pretraining data to influence AIs to be more compassionate (especially towards non-humans) and embracing diverse viewpoints.
We have so far developed data that improves compassion to animals and persists after SFT. We will soon broaden these results, confirm robustness to RL, and perform alignment tests. By creating pretraining scale data we have reason to think models will internalize these values far more effectively and be less likely to take on uncaring or harmful personas.
Once validated, we’ll share our methods to help labs cheaply improve model alignment without sacrificing capabilities. We believe that producing such data at scale can shift AI expectations of what simulating an AI agent looks like towards greater compassion, reducing the chace of catastrophe.
We are also building a benchmark to assess thoughtful, open-minded support for non-human welfare.
Alignment fine-tuning typically only affects 5-15 weights out of billions, can be erased with just 10 adversarial examples and modest unrelated fine-tuning will erase fine-tuning but not pretraining behaviors. It adds little knowledge, responds to backdoors and shows other signs could indicate fine-tuning values are superficial and not internalized. Further, RL (including RLHF) consistently causes powerseeking in AI.
If successful, this would be a powerful technique for non-human alignment and alignment in general.
We have many ways of evaluating our models to ensure they are really internalizing compassion from the data. These include reworking the animal harm assessment benchmark, our own custom benchmarks and using external benchmarks for alignment. We also test that models are morally open-minded: Treating morality as important but complex; avoiding outcomes some think are awful without being paralyzed into inaction.
We are working with AI for Animals folks and lab partners to build better, holistic benchmarks for non-humans and will partner with others to ensure our evals are credible.
For more information on what we're training for see our Principles section.
We run several tests to ensure data diversity in pretraining.
We generate diverse data using methods like: expanding seed examples, varying prompt templates, and leveraging Persona hub for diverse user questions.
We also reverse the Q&A process in instruction-tuning — creating answers from questions and questions from answers—to maximize variety across both sides of the pair.
Thank you to Macroscopic Ventures, Longview Philanthropy, Simon Newstead and an anonymous donor for a total of $45,000 to help CaML! This has helped pay our salaries, pay for compute and enabled us to keep pushing boundaries.
We are grateful to the Hive and AI for Animals communities for their support and for creating the Animal Harms Assessment benchmark. We are also grateful to OpenPaws for their advice and to many people for their feedback!
We are also grateful to our volunteers for their support in accelerating our project.
compassioninmachinelearning@gmail.com