Current evaluation methods focus on performance rather than failure, creating challenges for insurers seeking to price AI risk, the broker said

Gallagher Re has warned that current artificial intelligence (AI) model evaluation methods are not fit for underwriting purposes, arguing that better approaches will be needed to support insurer confidence in pricing AI risk.

AI

The reinsurance broker set out the warning in its report, “Anthropic’s Fourth Way: Why Restricted AI Models Are a Challenge for Insurers”.

Gallagher Re said most AI models are currently assessed through benchmarks, which score capability against fixed tasks.

While these tests may be useful for comparing models in controlled conditions, the broker said they do not fully capture how models behave when exposed to ambiguous or unpredictable inputs in real-world deployment.

Ed Pocock, global head of cyber security at Gallagher Re, said: “They indicate what a model can do under controlled, but insurers are concerned with how models fail, how often they fail, and whether those failures could be correlated across a portfolio.”

The report said a model can score highly on a benchmark while still hallucinating, making inconsistent decisions or failing in ways that are difficult to detect.

Gallagher Re warned that current evaluation methods do not assess concentration risk, including whether failures in widely used foundation models could be correlated across multiple insureds.

The reinsurance broker said models are increasingly shaped by the benchmarks used to evaluate them, a process known as benchmark contamination.

This can inflate published scores and reduce their value as a guide to real-world reliability.

Pocock said: “This risks erasing useful differentiation between systems and increasing concentration risk.”

The report also examined restricted-distribution AI models, using Anthropic’s Mythos model as an example.

Mythos, released under Anthropic’s Project Glasswing programme, was made available only to a vetted group of partners rather than the broader market.

Gallagher Re said this represents a fourth category of frontier AI model, restricted distribution, sitting apart from open source, open weight and proprietary models.

The broker said the distinction matters because the most capable models could be kept beyond the reach of independent evaluators needed by insurers and the wider market.

The UK AI Security Institute has analysed Mythos and published its findings.

However, Gallagher Re argued that restricted models need to be accessible to independent, third-party evaluators if insurers are to price risk rather than uncertainty.

Pocock said: “If a model cannot be independently evaluated, it cannot be meaningfully priced.

“Insurers could end up loading for uncertainty rather than reflecting actual risk.

“That raises costs for everyone and slows the market’s development.”

Gallagher Re called for evaluation methods that test AI systems as they operate in practice, including with real-world inputs, under adversarial conditions and over time as models are updated.

The report said evaluation should measure hallucination rates, decision consistency, how models fail and the potential for correlated failure across deployments.

Gallagher Re said newer approaches from firms including Epoch AI and Artificial Analysis are moving towards evaluations that are harder to contaminate and more informative about failure.

The broker said the re/insurance market could influence which AI models are deployed and how transparently they are evaluated through underwriting requirements, pricing signals and coverage design.

Pocock said: “Better evaluation gives the market the tools to reward transparency and robustness.

“Without it, we risk defaulting to scale and brand as proxies for safety, which could amplify the concentration risks we’ll need to manage,” he added.