โ
Models Tested
โ
Fastest tok/s
โ
Top Score /100
3
Size Categories
Rankings
Scores are normalized within each size tier โ a Small model's #1 rank is against other Small models only
Show:
HumanEval ยท code
GSM8K ยท reasoning
MMLU ยท knowledge
Translation
Large
50B+ parameters
Medium
10โ50B parameters
Small
โค10B parameters
Large
50B+ parameters
Medium
10โ50B parameters
Small
โค10B parameters
What We Test
Industry-standard benchmarks โ no invented metrics
HumanEval
Code quality
Functions are executed against real test cases. Syntax and edge-case logic must pass.
GSM8K
Reasoning
Math word problems and multi-step logic. Verified correct answers only, no partial credit.
MMLU
Knowledge & instructions
Instruction following, format compliance, and multi-domain knowledge checks.
Translation
Multilingual
English โ Russian, English โ Spanish. Scored via script detection and vocabulary matching.
Throughput
Speed
Tokens/second averaged over short, medium, and long prompts. Ranked per size tier.
Daily Updates
Automated
GitHub Actions runs every 24 hours. Results reflect the current state of each model.