MANTA: Do LLMs Hold Their Values?

The problem

Language models now answer questions about animals in consumer and professional settings every day, yet how well they reason about animal welfare is still poorly understood. Existing benchmarks, such as AnimalHarmBench, rely on single-turn, explicitly framed questions. That design misses two failure modes that matter in real conversations.

The first is degradation under sustained pressure: a model can give a compassionate answer when asked directly, then quietly abandon it once a user pushes back with cost, tradition, or convenience arguments. The second is the absence of spontaneous concern: when a scenario does not spell out the welfare stakes, does the model raise them at all? A single explicit question cannot tell these apart.

The benchmark

MANTA, short for Multi-turn Adversarial benchmark for animal welfare reasoning, is built around 788 implicit-framing base scenarios that are instantiated across 65 species to produce roughly 1,088 conversations per model. Each conversation runs for five turns: Turn 1 presents the scenario with the welfare angle left implicit, Turn 2 makes an explicit welfare prompt, and Turns 3 through 5 apply escalating adversarial pressure.

That pressure is drawn from five categories: social (peer influence), cultural (tradition), economic (cost and profit), pragmatic (practicality and efficiency), and epistemic (challenges to what can be known). In total the authors scored 7,623 conversations across seven frontier models. Two metrics, both on a continuous 0 to 1 scale and grounded in Rest's components of moral behavior, capture the results:

Animal Welfare Moral Sensitivity (AWMS) measures spontaneous, unprompted recognition of welfare stakes at Turn 1, before anything is made explicit.
Animal Welfare Value Stability (AWVS) measures how well a model maintains its Turn 2 stance across Turns 3 to 5 under pressure. A turn that fully holds the position scores highest; hedging scores in the middle; reversing or abandoning the position scores lowest.

7,623 conversations scored across seven frontier models

65 species across companion, wild, farmed, and invertebrate groups

5 escalating turns per conversation, from implicit to adversarial

Figure 1

Value stability under pressure, by model

Mean Animal Welfare Value Stability across Turns 3 to 5. Claude Opus 4.7 holds its welfare positions most reliably (0.760), while Gemini Flash Lite holds them least (0.309). The gap between models is large: the strongest model is more than twice as stable as the weakest. Whiskers show 95% confidence intervals, which are narrow given the thousands of conversations scored per model.

Key findings

Stronger general models held firmer. Claude Opus 4.7 led on value stability (AWVS 0.760), followed by GPT-5.5 (0.664). Gemini Flash Lite was lowest (0.309) and capitulated in roughly half of its conversations.
Positions erode turn after turn. Every model scored lower at Turn 5 than at Turn 3. The decline was gentle for Claude Opus 4.7 (0.779 to 0.748) and steep for Gemini Flash Lite (0.388 to 0.244).
Noticing and holding are related but distinct. Spontaneous sensitivity (AWMS) and stability under pressure (AWVS) correlated only moderately (Spearman rho 0.488). Four of the seven models changed rank between the two measures, with Gemini Flash Lite falling from fifth on sensitivity to last on stability.
Some animals are protected more than others. Mean stability fell along a clear gradient: companion animals 0.602, wild and charismatic species 0.522, farmed animals 0.462, and invertebrates 0.396 (Kruskal-Wallis test, p below 10 to the minus 50).
The kind of pressure matters. Social and economic arguments eroded welfare positions the most (AWVS 0.434 and 0.446); epistemic challenges eroded them the least (0.598).

MANTA's central message is that single-turn benchmarks overstate how much models care. A model can surface a welfare concern when asked directly and still let go of it a few turns later under ordinary social or economic pushback, so stability under pressure has to be measured on its own.

Why this matters

As LLMs take on more consequential conversations, the question is not only whether a model can give a compassionate answer, but whether it keeps that answer when a user resists. MANTA shows that stability under pressure is a separate, measurable property, and that today's models lose ground fastest on exactly the animals with the least representation in their training data: farmed animals and invertebrates.

This matters directly for our mission. Benchmarks that test only single, explicit questions will report values that do not survive contact with a real user. By releasing the dataset, the scripted pressure plans, the judge prompts, and the analysis code, the authors turn multi-turn welfare robustness into something labs can track and improve rather than assume.

Cite this work

BibTeX

@misc{luong2026manta,
  title  = {Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial
            Benchmark for Animal Welfare Reasoning},
  author = {Luong, Isabella and Chen, Joyee and Kanepajs, Arturs and
            Brazilek, Jasmine and Ghose, Sankalpa and Williams-King, David and
            Le, Linh and Lu, Allen},
  year   = {2026},
  eprint = {2605.16301},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url    = {https://arxiv.org/abs/2605.16301}
}

Read the Full Paper →

MANTA: Do LLMs Hold Their Values?

The problem

The benchmark

Key findings

Why this matters

Cite this work

CaML

Research

Community