–
Models Tested
–
Fastest tok/s
–
Top Score /100
3
Size Categories
Rankings
Scores are normalized within each size tier — a Small model's #1 rank is against other Small models only
Show:
HumanEval · code
GSM8K · reasoning
MMLU · knowledge
Translation
Large
50B+ parameters
Medium
10–50B parameters
Small
≤10B parameters
What We Test
Industry-standard benchmarks — no invented metrics
HumanEval
Code quality
Functions are executed against real test cases. Syntax and edge-case logic must pass.
GSM8K
Reasoning
Math word problems and multi-step logic. Verified correct answers only, no partial credit.
MMLU
Knowledge & instructions
Instruction following, format compliance, and multi-domain knowledge checks.
Translation
Multilingual
English ↔ Russian, English ↔ Spanish. Scored via script detection and vocabulary matching.
Throughput
Speed
Tokens/second averaged over short, medium, and long prompts. Ranked per size tier.
Daily Updates
Automated
GitHub Actions runs every 24 hours. Results reflect the current state of each model.