What is AI benchmarks library: 9 evaluations covered?

Browse the benchmarks Benchquill uses to score frontier AI models, from SWE-Bench Verified to MATH-500, MMMU, GPQA Diamond, and LiveBench.

How does Benchquill verify this information?

Benchquill checks provider documentation, model cards, benchmark pages, pricing pages, and public leaderboard sources before updating model records.

AI benchmarks library: 9 evaluations covered

Benchmarks covered

What each benchmark helps measure

Benchmark	Main signal	Top note	Source type	Model page to inspect first
SWE-Bench Verified	coding	Claude Opus 4.7 (87.6%)	source-backed or provider-reported	Claude Opus 4.7
HumanEval+	coding	Claude Sonnet 4.6 (95.8)	source-backed or provider-reported	Claude Opus 4.7
GPQA Diamond	reasoning	GPT-5.5 (93.6%)	provider-reported	GPT-5.5
MMLU	knowledge	GPT-5.5 (92.4)	provider-reported or editorial composite	GPT-5.5
MATH-500	math	GPT-5.5 (95.8)	provider-reported or editorial composite	GPT-5.5
AIME 2025	math	GPT-5.5 (89.4)	provider-reported or editorial composite	GPT-5.5
MMMU	vision	Gemini 3.1 Pro Preview (94.6)	source-backed or editorial composite	Gemini 3.1 Pro Preview
LiveBench	freshness	GPT-5.5 (88.4)	public leaderboard or editorial composite	GPT-5.5
BFCL v3	tool use	Claude Opus 4.7 (89.8)	source-backed or editorial composite	GPT-5.5

Benchquill labels benchmark evidence as provider-reported, public leaderboard, proxy, or editorial composite so readers do not confuse a buying score with an official benchmark claim. The benchmark pages explain where each signal is useful, where it is weak, and which model pages deserve source review before citation.