AI at Work

Experts Warn Hundreds of AI Safety Tests Are Flawed

A new study of more than 440 AI benchmarks finds many problems that "undermine the validity" of the claimed results. This means that many safety and performance scores may not be accurate. Experts want AI development to have common standards and better ways to test.

6 min read Nov 14, 2025
Share:
Experts Warn Hundreds of AI Safety Tests Are Flawed

Researchers have found that most of the standard tests (benchmarks) used to check the safety and abilities of AI have big problems. The UK's AI Security Institute (AISI) and researchers from Stanford, Berkeley, Oxford, and other schools looked at more than 440 AI benchmarks. They discovered that nearly all of these tests have issues that "undermine the validity of the resulting claims," which means that the scores they give could be wrong or not useful. This makes me question a lot of public statements about AI progress and safety, which often use these benchmark scores.

Benchmarks are standardized tests that check to see if AI systems meet certain standards, like being safe, accurate, or in line with human values. Without strict rules for AI, tech companies often use these benchmarks to make sure their new models are good quality. For instance, companies might brag about how well a model does on math tests or tests of moral reasoning. But the new report says that "it becomes hard to know whether models are genuinely improving or just appearing to" without clear definitions and good measurements. In short, if the tests themselves are bad, higher test scores might make it look like real progress has been made.

Why AI Benchmarks Matter

QuillBot-generated-image-2 (19).png