AI benchmarks library: 9 evaluations covered
Browse the benchmarks Benchquill uses to score frontier AI models, from SWE-Bench Verified to MATH-500, MMMU, GPQA Diamond, and LiveBench.
What each benchmark helps measure
| Benchmark | Main signal | Top note | Source type | Model page to inspect first |
|---|---|---|---|---|
| SWE-Bench Verified | coding | Claude Opus 4.7 (87.6%) | source-backed or provider-reported | Claude Opus 4.7 |
| HumanEval+ | coding | Claude Sonnet 4.6 (95.8) | source-backed or provider-reported | Claude Opus 4.7 |
| GPQA Diamond | reasoning | GPT-5.5 (93.6%) | provider-reported | GPT-5.5 |
| MMLU | knowledge | GPT-5.5 (92.4) | provider-reported or editorial composite | GPT-5.5 |
| MATH-500 | math | GPT-5.5 (95.8) | provider-reported or editorial composite | GPT-5.5 |
| AIME 2025 | math | GPT-5.5 (89.4) | provider-reported or editorial composite | GPT-5.5 |
| MMMU | vision | Gemini 3.1 Pro Preview (94.6) | source-backed or editorial composite | Gemini 3.1 Pro Preview |
| LiveBench | freshness | GPT-5.5 (88.4) | public leaderboard or editorial composite | GPT-5.5 |
| BFCL v3 | tool use | Claude Opus 4.7 (89.8) | source-backed or editorial composite | GPT-5.5 |
Benchquill labels benchmark evidence as provider-reported, public leaderboard, proxy, or editorial composite so readers do not confuse a buying score with an official benchmark claim. The benchmark pages explain where each signal is useful, where it is weak, and which model pages deserve source review before citation.