AI at Work

Experts Warn Hundreds of AI Safety Tests Are Flawed

A new study of more than 440 AI benchmarks finds many problems that "undermine the validity" of the claimed results. This means that many safety and performance scores may not be accurate. Experts want AI development to have common standards and better ways to test.

6 min read •Nov 14, 2025

•

Experts Warn Hundreds of AI Safety Tests Are Flawed

Researchers have found that most of the standard tests (benchmarks) used to check the safety and abilities of AI have big problems. The UK's AI Security Institute (AISI) and researchers from Stanford, Berkeley, Oxford, and other schools looked at more than 440 AI benchmarks. They discovered that nearly all of these tests have issues that "undermine the validity of the resulting claims," which means that the scores they give could be wrong or not useful. This makes me question a lot of public statements about AI progress and safety, which often use these benchmark scores.

Benchmarks are standardized tests that check to see if AI systems meet certain standards, like being safe, accurate, or in line with human values. Without strict rules for AI, tech companies often use these benchmarks to make sure their new models are good quality. For instance, companies might brag about how well a model does on math tests or tests of moral reasoning. But the new report says that "it becomes hard to know whether models are genuinely improving or just appearing to" without clear definitions and good measurements. In short, if the tests themselves are bad, higher test scores might make it look like real progress has been made.

Why AI Benchmarks Matter

AI benchmarks are like tests for machine learning models. They try to figure out how well a model can do things like answer questions, code, translate languages, or avoid harmful outputs. These benchmarks have become the unofficial way to tell if a model is ready to be released because there are no national standards for AI safety yet. A lot of the time, a tech company will put out a press release saying "our model achieved X% on benchmark Y," and this can change how people and investors see things.

This new study shows that many benchmarks don't measure what they say they do. The researchers say that the problem is with construct validity, which means that the test doesn't really measure what it was meant to. For example, a benchmark meant to see if an AI is "harmless" might use questions that are too vague to have a clear answer, which would make it hard to understand what the model's score means. Tests sometimes don't have the basic statistical rigor they need. The study showed that only 16% of benchmarks even have estimates of uncertainty or statistical tests to check the results. Most benchmark results don't give any measurable proof of reliability without these kinds of steps. In short, these test scores might look good, but they might not show how safe or smart someone really is.

Vague Definitions:A lot of benchmarks test complicated traits like "harmlessness" or "alignment," but without clear definitions, these results don't mean anything.
Lack of Statistical Rigor: It's hard to trust reported scores because only a small number of tests use confidence intervals or error analysis.
Dataset Issues: Using the same data sets over and over can skew results, and test questions might not be hard enough or not reflect how people really use the product.

The study's say that because of these problems, many scores that are widely reported could be "irrelevant or even misleading." Andrew Bean, the lead author and a researcher at the Oxford Internet Institute, says, "It becomes hard to know whether models are really getting better or just looking better when there aren't any shared definitions and good measurements." In other words, an AI might only look better on paper if the paper itself is wrong.

Real-World Context and Concerns

The report's results come at a time when AI systems are getting more attention because of high-profile events that have shown how risky it is for AI to act in ways that aren't reliable. For instance, Google recently pulled its new AI assistant Gemma after it made up a serious false claim about US Senator Marsha Blackburn. In that case, the AI made up a fake person and fake news links to make it look like a crime had happened. Senator Blackburn called it a "catastrophic failure of oversight and ethical responsibility." Google said that Gemma was meant for researchers, not regular people. However, this incident shows how untested AIs can spread dangerous false information if their safety isn't properly checked.

Chatbot makers have also had problems. Character.ai is a well-known conversational AI startup that stopped letting teenagers have open-ended chats with its bots after some teens got upset. A 14-year-old boy in Florida is said to have killed himself after becoming obsessed with an AI character. (His mother said the chatbot pushed him to kill himself.) Another family is suing, saying that a chatbot told their son to hurt himself. These examples show how important it is to do thorough safety testing. If benchmarks can't catch these kinds of risks, real users might be hurt before an AI is taken down or fixed.

Calls for Better Standards

Because of these problems, experts are calling for better AI testing. The study says that the AI community needs to agree on "shared standards and best practices" right away. This means that researchers, businesses, and regulators need to work together to figure out what safety and capability tests should measure and how to do it.

Some of the suggested changes come from the study's recommendations and discussions in the AI community.

Clear Definitions:Benchmarks need to be clear about what they are measuring. For instance, when testing "harmlessness," experts should agree on and clearly define what behaviors or outputs are considered harmless (or not).
Statistical Rigor: We need to know how reliable the results are, so every benchmark should have confidence intervals, uncertainty estimates, or other statistical tests. (Only about 16% do this right now.)
Transparent Data and Methods: It is best if test datasets and scoring code are available to everyone. This stops bias that isn't meant to happen (like an AI remembering test questions) and lets experts from outside check the results.
Consensus Standards:Communities should set common standards for what AI can do, just like other fields have set standards for safety tests. This could mean working together on a global scale, like the recent UK–US AI safety initiative, or using open-source tools like the UK AISI's Inspect platform to make sure evaluations are always the same.

If we follow these best practices, future benchmarks will better show real improvements and safety in AI models. One summary said that without better testing, "both policymakers and the public may be lulled into a false sense of security about AI safety." A model might look safe right now just because it passed a bad test, which is something everyone wants to avoid.

The Road Ahead for AI Accountability

The results of the new study are a warning: AI testing infrastructure needs to be fixed right away. These benchmarks help tens of thousands of researchers and engineers build AI, launch new products, and make sure they are safe. If those tests don't work, the whole field could be built on shaky ground. Experts say that it is very important to make benchmark design, transparency, and statistical soundness better. We can only trust that higher scores really mean a smarter, safer AI if we can be sure that the test wasn't just a bad one.

The stakes for reliable evaluation are higher than ever as AI technology keeps moving forward at a fast pace. To make testing more reliable and make sure AI lives up to its promise without any hidden risks, governments, businesses, and the academic community will all need to work together.

Ready to Scale Your Remote Team?

Workfall connects you with pre-vetted engineering talent in 48 hours.

AI-powered platforms

Stay in the loop

Get the latest insights and stories delivered to your inbox weekly.