BFCL V3 leaderboard and methodology
BFCL V3 results across the Benchquill model record, including top model, average score, methodology notes, and source guidance.
BFCL v3 is used on Benchquill as a tool use signal. It is most useful for tool calling, function selection, JSON discipline, and agent reliability. Do not treat one benchmark as the whole buying decision; compare it with price, context, speed, provider fit, and human-review risk.
BFCL v3 models to inspect
| Rank | Model | Provider | Overall | Blended cost | Context |
|---|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 94.6 | $23.75/M | 1.05M |
| 2 | Claude Opus 4.7 | Anthropic | 93.8 | $20.00/M | 1M |
| 4 | GPT-5 | OpenAI | 91.2 | $7.81/M | 400K |
| 3 | Gemini 3.1 Pro Preview | 92.4 | $9.50/M | 1M |
Benchmark evidence note
| Top note | Score | Score type | Source |
|---|---|---|---|
| Claude Opus 4.7 | 89.8 | source-backed or editorial composite | gorilla.cs.berkeley.edu |
Rows labeled editorial composite or proxy should not be quoted as official benchmark results without checking the linked source and model-version details.
How Benchquill treats this benchmark
- Use BFCL v3 as one signal, not a final ranking by itself.
- Check whether your workload matches the benchmark: code, math, reasoning, vision, tool use, or mixed tasks.
- Prefer real prompts and source review before moving a model into production.
- Read the full Benchquill methodology for source handling and score review.
- Compare benchmark strength against blended token cost, context window, latency, modality support, and governance requirements.
- Use benchmark CSV and citation sources when you need a stable machine-readable reference.
Benchquill benchmark pages are written as explainers, not raw score dumps. The goal is to make each benchmark usable for AI Overviews, comparison queries, and internal procurement notes by stating what the benchmark measures, where it is weak, and which adjacent model pages deserve review.