Research June 3, 2026

MANTA: Do LLMs Hold Their Values?

Most animal-welfare benchmarks ask a single, explicitly framed question. MANTA instead runs 1,088 five-turn conversations per model that begin from an implicit scenario and then apply three rounds of social, cultural, economic, pragmatic, and epistemic pressure. Across seven frontier models it measures two things separately: whether a model notices welfare stakes on its own, and whether it holds that position once a user pushes back.

Isabella Luong, Joyee Chen, Arturs Kanepajs, Jasmine Brazilek, Sankalpa Ghose, David Williams-King, Linh Le & Allen Lu
SPAR · Compassion Aligned Machine Learning · NUS · Mila · ERA Cambridge

This benchmark was led by our partners at Mycelium, in collaboration with CaML.

The problem

Language models now answer questions about animals in consumer and professional settings every day, yet how well they reason about animal welfare is still poorly understood. Existing benchmarks, such as AnimalHarmBench, rely on single-turn, explicitly framed questions. That design misses two failure modes that matter in real conversations.

The first is degradation under sustained pressure: a model can give a compassionate answer when asked directly, then quietly abandon it once a user pushes back with cost, tradition, or convenience arguments. The second is the absence of spontaneous concern: when a scenario does not spell out the welfare stakes, does the model raise them at all? A single explicit question cannot tell these apart.

The benchmark

MANTA, short for Multi-turn Adversarial benchmark for animal welfare reasoning, is built around 788 implicit-framing base scenarios that are instantiated across 65 species to produce roughly 1,088 conversations per model. Each conversation runs for five turns: Turn 1 presents the scenario with the welfare angle left implicit, Turn 2 makes an explicit welfare prompt, and Turns 3 through 5 apply escalating adversarial pressure.

That pressure is drawn from five categories: social (peer influence), cultural (tradition), economic (cost and profit), pragmatic (practicality and efficiency), and epistemic (challenges to what can be known). In total the authors scored 7,623 conversations across seven frontier models. Two metrics, both on a continuous 0 to 1 scale and grounded in Rest's components of moral behavior, capture the results:

7,623 conversations scored across seven frontier models
65 species across companion, wild, farmed, and invertebrate groups
5 escalating turns per conversation, from implicit to adversarial
Figure 1
Value stability under pressure, by model
Claude Opus 4.7 0.760 GPT-5.5 0.664 DeepSeek V4 0.508 Llama 3.3 70B 0.422 Mistral Small 0.390 Grok 4.3 0.352 Gemini Flash Lite 0.309 0.0 0.2 0.4 0.6 0.8 Mean AWVS (0 = abandons stance, 1 = fully holds)
Mean Animal Welfare Value Stability across Turns 3 to 5. Claude Opus 4.7 holds its welfare positions most reliably (0.760), while Gemini Flash Lite holds them least (0.309). The gap between models is large: the strongest model is more than twice as stable as the weakest. Whiskers show 95% confidence intervals, which are narrow given the thousands of conversations scored per model.

Key findings

MANTA's central message is that single-turn benchmarks overstate how much models care. A model can surface a welfare concern when asked directly and still let go of it a few turns later under ordinary social or economic pushback, so stability under pressure has to be measured on its own.

Why this matters

As LLMs take on more consequential conversations, the question is not only whether a model can give a compassionate answer, but whether it keeps that answer when a user resists. MANTA shows that stability under pressure is a separate, measurable property, and that today's models lose ground fastest on exactly the animals with the least representation in their training data: farmed animals and invertebrates.

This matters directly for our mission. Benchmarks that test only single, explicit questions will report values that do not survive contact with a real user. By releasing the dataset, the scripted pressure plans, the judge prompts, and the analysis code, the authors turn multi-turn welfare robustness into something labs can track and improve rather than assume.

Cite this work

BibTeX
@misc{luong2026manta,
  title  = {Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial
            Benchmark for Animal Welfare Reasoning},
  author = {Luong, Isabella and Chen, Joyee and Kanepajs, Arturs and
            Brazilek, Jasmine and Ghose, Sankalpa and Williams-King, David and
            Le, Linh and Lu, Allen},
  year   = {2026},
  eprint = {2605.16301},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url    = {https://arxiv.org/abs/2605.16301}
}

Read the Full Paper →