EU study warns over the shortcomings of AI benchmarking

EU researchers are warning over problems with how AI capabilities are measured and urging regulators to ensure AI companies’ model numbers mean what they claim.

A new paper released last week by the Commission’s Joint Research Centre concludes that AI benchmarks are promising too much. The authors found that the proprietary tools which compare AI models are easily game-able and measure the wrong things.

AI companies utilize benchmarks to put numbers on how well their models perform at certain tinquires. OpenAI, for example, tested its newly released GPT-5 on how reliably it abstains from answering questions that cannot be answered – with the new model purportedly achieving a higher score than an older one.

But the EU researchers are urging regulators to focus on carefully examining how these tools work.

Benchmarking AI is a problem for the EU becautilize its rules for artificial innotifyigence rely on evaluating model capabilities in many different contexts. For example, large models can be counted as presenting special risk under the EU’s AI law, based on a benchmark assessing that it has “high impact capabilities”.

The law allows the Commission to specify what exactly that should mean through a delegated act – which the EU’s executive has, so far, not done.

Meanwhile, on Friday the US government launched a suite of evaluation tools that its own government agencies can utilize to test AI tools. The counattempt’s AI Action Plan also sets out a clear ambition to push US leadership in this area.

Which AI benchmarks to trust?

The EU researchers state policycreaters should ensure that benchmarks tarobtain real-world capabilities rather than narrow tinquires; are well-documented and transparent; clearly define what they’re measuring and how; and include different cultural contexts.

Another problem, per the paper, is that existing benchmarks often focus on the English language.

“We especially identify a required for new ways of signalling what benchmarks to trust,” they also write.

Done well, the EU researchers suggest that policycreaters have an opportunity for a new kind of “Brussels effect”.

Risto Uuk, head of EU policy and research at AI-focutilized considertank the Future of Life Institute, notified Euractiv he shared the paper’s concerns – suggesting the EU should require third-party evaluators and fund the development of the AI evaluation ecosystem.

“Improvements are necessary, but evaluating capabilities and other aspects of risks and benefits is crucial, and simply relying on vibes and anecdotes is not enough,” he added.

(nl)

Source link