Benchquill v3.7
Live Analysis Lower-cost models are getting closer to premium models on value
Benchmarks covered

What each benchmark helps measure

BenchmarkMain signalTop noteSource typeModel page to inspect first
SWE-Bench VerifiedcodingClaude Opus 4.7 (87.6%)source-backed or provider-reportedClaude Opus 4.7
HumanEval+codingClaude Sonnet 4.6 (95.8)source-backed or provider-reportedClaude Opus 4.7
GPQA DiamondreasoningGPT-5.5 (93.6%)provider-reportedGPT-5.5
MMLUknowledgeGPT-5.5 (92.4)provider-reported or editorial compositeGPT-5.5
MATH-500mathGPT-5.5 (95.8)provider-reported or editorial compositeGPT-5.5
AIME 2025mathGPT-5.5 (89.4)provider-reported or editorial compositeGPT-5.5
MMMUvisionGemini 3.1 Pro Preview (94.6)source-backed or editorial compositeGemini 3.1 Pro Preview
LiveBenchfreshnessGPT-5.5 (88.4)public leaderboard or editorial compositeGPT-5.5
BFCL v3tool useClaude Opus 4.7 (89.8)source-backed or editorial compositeGPT-5.5

Benchquill labels benchmark evidence as provider-reported, public leaderboard, proxy, or editorial composite so readers do not confuse a buying score with an official benchmark claim. The benchmark pages explain where each signal is useful, where it is weak, and which model pages deserve source review before citation.