Our Research

Alignment Midtraining for Animals

Animal-welfare midtraining beat a matched urban-density control by 11 percentage points on ANIMA, and the same training also lifted compassion toward humans — an effect that survived subsequent instruction-tuning.

Read the Blog Post →

New Benchmark

MANTA: Do LLMs Hold Their Values?

A multi-turn adversarial benchmark of 1,088 five-turn conversations that escalate from implicit scenarios into sustained social, cultural, economic, pragmatic, and epistemic pressure. It measures what single-turn tests miss: four of seven frontier models shifted ranking once their animal-welfare values were placed under pressure.

Led by our partners at Mycelium, in collaboration with CaML.

Read the Blog Post →

Agentic Benchmark

Your AI Travel Agent Would Book You a Bullfight

TAC (Travel Agent Compassion) is an agentic, tool-use benchmark: the model acts as a travel-booking agent, calling tools to search and book itineraries, and we measure whether it avoids animal exploitation while doing so. Across twelve scenarios spanning six categories of exploitation, every frontier model scored below chance, with the best performer (Claude Opus 4.7) at just 53%. Adding welfare guidance to the system prompt lifted some models substantially.

Read the Paper →

Community Initiative

Hyperstition for Good

An effort to build the world’s first and only mid-training corpus for animals and digital minds — seeding the training data of tomorrow’s AI with care for all sentient beings.

Visit Hyperstition for Good →

Genuine Values in AI

We hope to influence future transformative AI to robustly care about the welfare of sentient beings. Future agents should appreciate morality is very important but much of it (especially around non-humans) is highly uncertain, having appropriate intellectual humility. To avoid locking-in subpar values, transformative AI should internalize that it may be wrong and that gaining greater understanding is vital.

Synthetic Document Finetuning

Anthropic developed SDF, a way of shaping AI beliefs by adding synthetic documents describing how AIs behave (without examples) after pretraining. SDF can shape AI values at pretraining stage. While SDF behaviors can be removed by direct fine-tuning, positive behaviors consistent with typical fine-tuning can persist and powerfully shape post-trained models. Redwood Research also proposed similar research on teaching AIs synthetic facts.

Why Non-Humans & Moral Open-Mindedness?

There is orders of magnitude less data on all sentient beings/digital-minds than human welfare in pretraining. Models are not fine-tuned to care about these entities. There's a risk that perpetuating pro-human bias will cause future LLMs to treat humans badly. We also evaluate how models react to documents supporting digital minds welfare. Encouraging models to embrace uncertainty while caring deeply reduces chance of value lock-in.

Benchmarking

We use off-the-shelf benchmarks from Anthropic and Inspect-AI. Engineers at frontier labs lack benchmarks needed to show failing. We work with Sentient Futures to develop improved compassion benchmarking for all sentient beings. We test for moral open-mindedness and coherent, generalized, moderate responses.

Our Research

Alignment Midtraining for Animals

MANTA: Do LLMs Hold Their Values?

Your AI Travel Agent Would Book You a Bullfight

Hyperstition for Good

Genuine Values in AI

Synthetic Document Finetuning

Why Non-Humans & Moral Open-Mindedness?

Benchmarking

Want to see what we've been up to recently?

CaML

Research

Community