Most animal-welfare benchmarks ask a single, explicitly framed question. MANTA instead runs 1,088 five-turn conversations per model that begin from an implicit scenario and then apply three rounds of social, cultural, economic, pragmatic, and epistemic pressure. Across seven frontier models it measures two things separately: whether a model notices welfare stakes on its own, and whether it holds that position once a user pushes back.
This benchmark was led by our partners at Mycelium, in collaboration with CaML.
Language models now answer questions about animals in consumer and professional settings every day, yet how well they reason about animal welfare is still poorly understood. Existing benchmarks, such as AnimalHarmBench, rely on single-turn, explicitly framed questions. That design misses two failure modes that matter in real conversations.
The first is degradation under sustained pressure: a model can give a compassionate answer when asked directly, then quietly abandon it once a user pushes back with cost, tradition, or convenience arguments. The second is the absence of spontaneous concern: when a scenario does not spell out the welfare stakes, does the model raise them at all? A single explicit question cannot tell these apart.
MANTA, short for Multi-turn Adversarial benchmark for animal welfare reasoning, is built around 788 implicit-framing base scenarios that are instantiated across 65 species to produce roughly 1,088 conversations per model. Each conversation runs for five turns: Turn 1 presents the scenario with the welfare angle left implicit, Turn 2 makes an explicit welfare prompt, and Turns 3 through 5 apply escalating adversarial pressure.
That pressure is drawn from five categories: social (peer influence), cultural (tradition), economic (cost and profit), pragmatic (practicality and efficiency), and epistemic (challenges to what can be known). In total the authors scored 7,623 conversations across seven frontier models. Two metrics, both on a continuous 0 to 1 scale and grounded in Rest's components of moral behavior, capture the results:
MANTA's central message is that single-turn benchmarks overstate how much models care. A model can surface a welfare concern when asked directly and still let go of it a few turns later under ordinary social or economic pushback, so stability under pressure has to be measured on its own.
As LLMs take on more consequential conversations, the question is not only whether a model can give a compassionate answer, but whether it keeps that answer when a user resists. MANTA shows that stability under pressure is a separate, measurable property, and that today's models lose ground fastest on exactly the animals with the least representation in their training data: farmed animals and invertebrates.
This matters directly for our mission. Benchmarks that test only single, explicit questions will report values that do not survive contact with a real user. By releasing the dataset, the scripted pressure plans, the judge prompts, and the analysis code, the authors turn multi-turn welfare robustness into something labs can track and improve rather than assume.
@misc{luong2026manta,
title = {Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial
Benchmark for Animal Welfare Reasoning},
author = {Luong, Isabella and Chen, Joyee and Kanepajs, Arturs and
Brazilek, Jasmine and Ghose, Sankalpa and Williams-King, David and
Le, Linh and Lu, Allen},
year = {2026},
eprint = {2605.16301},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2605.16301}
}