We hope to influence future transformative AI to robustly care about the welfare of sentient beings. We believe that future agents should appreciate that morality is very important to get right but that much of it (especially around non-humans) is highly uncertain and have appropriate intellectual humility and open-mindedness about the right actions. To avoid locking-in subpar values, transformative AI should internalize that it may be wrong, that gaining greater understanding is vital and should appreciate the long tail of moral risks.
Anthropic has recently developed Synthetic Document Finetuning (SDF), a new way of shaping AI beliefs. SDF adds synthetic documents describing how AIs behave (without examples) and adds them after pretraining. SDF can be used both to shape AI values in the pretraining stage and to approximate how the descriptions of AIs on the internet shape the personas that LLMs find most natural to simulate when asked to exhibit certain behaviors. While SDF-injected behaviors can be removed by fine-tuning that directly combats it, our results show that positive behaviors consistent with typical fine-tuning can persist and powerfully shape the behaviors of post-trained models.
SDF could be used to add a new dimension to fine-tuning as well as improve our understanding of how pretraining data (and data filtering) affects model behavior.
Redwood Research has also recently proposed research on the Basic science of teaching AIs synthetic facts. They suggest this would assist with catching schemers, improving control, developing model organisms and limiting misuse risks. Redwood Research also suggests specifically investigating how training data affects model preferences (such as animal compassion), personas, and the impacts of descriptions of AI alignment behavior.
We are currently focused on testing compassion for animals for two main reasons: there is orders of magnitude less data on animals than human welfare in pretraining corpuses (so our data can have a larger impact); and models are not fine-tuned to (pretend to?) care about most animals. Beyond the direct importance, there is also a risk that perpetuating speciesist bias will cause future LLMs to treat humans worse, so we believe non-human compassion is important for signifying conventional alignment. We are also evaluating how models react to documents supporting the welfare of digital minds for these reasons.
It is essential to encourage models to embrace the uncertainty while still caring deeply about the answers. We believe this property will also reduce the chance of value lock-in or moral catastrophes.
We use off-the-shelf benchmarks where possible, notably several from Anthropic and the UK AISI for broad alignment.
However engineers who want to increase compassion in LLMs within frontier labs are limited by the lack of benchmarks needed to show their models failing and evaluate any solutions. Several have said that they don't have time to work on such things and rely on non-profits.
We are working with AHA to develop improved animal benchmarking and are working on an early benchmark for digital minds. In both cases we will test for moral open-mindedness as well as questions to confirm the responses are coherent, generalized and moderate (as opposed to hollow sycophancy).