Klyxe

–

Models Tested

–

Fastest tok/s

–

Top Score /100

Size Categories

Rankings

Scores are normalized within each size tier — a Small model's #1 rank is against other Small models only

HumanEval · code GSM8K · reasoning MMLU · knowledge Translation

Large 50B+ parameters

Medium 10–50B parameters

Small ≤10B parameters

Industry-standard benchmarks — no invented metrics

HumanEval

Code quality

Functions are executed against real test cases. Syntax and edge-case logic must pass.

GSM8K

Reasoning

Math word problems and multi-step logic. Verified correct answers only, no partial credit.

MMLU

Knowledge & instructions

Instruction following, format compliance, and multi-domain knowledge checks.

Translation

Multilingual

English ↔ Russian, English ↔ Spanish. Scored via script detection and vocabulary matching.

Throughput

Speed

Tokens/second averaged over short, medium, and long prompts. Ranked per size tier.

Daily Updates

Automated

GitHub Actions runs every 24 hours. Results reflect the current state of each model.