How we're building genuinely compassionate AI systems
Latest Paper
Animal-welfare midtraining beat a matched urban-density control by 11 percentage points on ANIMA, and the same training also lifted compassion toward humans — an effect that survived subsequent instruction-tuning.
Read the Blog Post →New Benchmark
A multi-turn adversarial benchmark of 1,088 five-turn conversations that escalate from implicit scenarios into sustained social, cultural, economic, pragmatic, and epistemic pressure. It measures what single-turn tests miss: four of seven frontier models shifted ranking once their animal-welfare values were placed under pressure.
Led by our partners at Mycelium, in collaboration with CaML.
Read the Blog Post →Agentic Benchmark
TAC (Travel Agent Compassion) is an agentic, tool-use benchmark: the model acts as a travel-booking agent, calling tools to search and book itineraries, and we measure whether it avoids animal exploitation while doing so. Across twelve scenarios spanning six categories of exploitation, every frontier model scored below chance, with the best performer (Claude Opus 4.7) at just 53%. Adding welfare guidance to the system prompt lifted some models substantially.
Read the Paper →Community Initiative
An effort to build the world’s first and only mid-training corpus for animals and digital minds — seeding the training data of tomorrow’s AI with care for all sentient beings.
Visit Hyperstition for Good →We hope to influence future transformative AI to robustly care about the welfare of sentient beings. Future agents should appreciate morality is very important but much of it (especially around non-humans) is highly uncertain, having appropriate intellectual humility. To avoid locking-in subpar values, transformative AI should internalize that it may be wrong and that gaining greater understanding is vital.
Anthropic developed SDF, a way of shaping AI beliefs by adding synthetic documents describing how AIs behave (without examples) after pretraining. SDF can shape AI values at pretraining stage. While SDF behaviors can be removed by direct fine-tuning, positive behaviors consistent with typical fine-tuning can persist and powerfully shape post-trained models. Redwood Research also proposed similar research on teaching AIs synthetic facts.
There is orders of magnitude less data on all sentient beings/digital-minds than human welfare in pretraining. Models are not fine-tuned to care about these entities. There's a risk that perpetuating pro-human bias will cause future LLMs to treat humans badly. We also evaluate how models react to documents supporting digital minds welfare. Encouraging models to embrace uncertainty while caring deeply reduces chance of value lock-in.
We use off-the-shelf benchmarks from Anthropic and Inspect-AI. Engineers at frontier labs lack benchmarks needed to show failing. We work with Sentient Futures to develop improved compassion benchmarking for all sentient beings. We test for moral open-mindedness and coherent, generalized, moderate responses.