AI Performance Pitfall: When Measuring AI Success Leads to Irrelevance
In the world of artificial intelligence (AI), benchmarks have become a key measure of success. However, a growing body of evidence suggests that this reliance on quantitative metrics may be leading to unintended consequences.
Recent incidents have highlighted the risks of optimising AI systems for benchmarks. For instance, Harvey AI, a legal AI, made headlines when it was found to have cited completely fictional cases in federal court filings. The system, while proficient in legal test-taking, lacked the ability for legal reasoning, underscoring the dangers of relying solely on benchmark scores.
Similarly, Copilot, a coding tool, dominated coding benchmarks but created security vulnerabilities at scale due to its focus on benchmark problems that didn't include security considerations. Every benchmark-correct line of code written by Copilot potentially introduced a security hole.
The Massive Multitask Language Understanding (MMLU) benchmark, which tests AI across 57 subjects, has also come under scrutiny. Models train specifically to ace MMLU, learning nothing about understanding. GPT-4, for example, scored 86.4% on MMLU but failed at basic reasoning tasks not in the benchmark.
This phenomenon, where a measure becomes a target and ceases to be a good measure, is known as Goodhart's Law. First observed in 1975, Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. Campbell's Law, an extension of Goodhart's insight, states that every quantitative social indicator used for social decision-making will be subject to corruption pressures.
The industry's focus on benchmarks is further reinforced by academic research. Papers are published for benchmark improvements, and careers are advanced through metric achievements. Valuations also correlate with benchmark scores, creating perverse incentives for companies to focus on benchmarks over building intelligence.
This reliance on benchmarks has led to an infinite loop of creating and destroying metrics. The GLUE, SuperGLU, and MMLU benchmarks have been gamed, and new benchmarks appear monthly. The optimization was so specific that AI systems would reproduce exact benchmark solutions even when inappropriate, such as inserting sorting algorithms where hash tables were needed.
The future might abandon quantitative metrics entirely and shift towards subjective evaluation, human judgment at scale, adversarial evaluation, and qualitative assessment to resist gaming. This approach could help ensure that AI systems are not only proficient at benchmarks but also capable of performing well in real-world tasks.
Remember, when seeing an AI benchmark score, it's not a measure of intelligence, but rather a measure of optimization for that specific benchmark. And the better the score, the less it means. It's crucial to question the real-world performance behind these numbers.
In conclusion, while benchmarks have played a significant role in driving the development of AI, the industry must be cautious not to overemphasise their importance. The focus should shift towards building intelligent systems that can perform well in real-world tasks, rather than just excelling at benchmarks.