Dark

About ModelLens

Free AI model rankings updated daily. No sponsorships, no bias β€” just automated benchmarks running via GitHub Actions at 03:00 UTC.

What We Test

Every model is tested across 5 categories. Speed is tracked separately from quality β€” a slow model that writes perfect code still scores 100/100 on quality.

CategoryWeightWhat it measures
Code30%Functions that actually run and produce correct output
Reasoning25%Logic, math, and puzzle problems with verified answers
Instructions25%Following exact formatting requirements (JSON, lists, word counts)
Translation20%Correct target language script and vocabulary
SpeedseparateTokens per second β€” own leaderboard, not mixed into quality
πŸ’» Code Tests β€” Real Execution 30% of quality

The model writes Python functions. We actually run them with test inputs and verify outputs match expected values. Syntax errors, wrong logic, or empty responses score 0. No partial credit.

Example β€” is_prime test:

Prompt: "Write a Python function called is_prime(n) that returns True if n is prime..."

We run: is_prime(2) β†’ True βœ… | is_prime(4) β†’ False βœ… | is_prime(17) β†’ True βœ…

Score = (passed / total) Γ— 100
3 problems tested: Prime checker Β· Fibonacci sequence Β· Palindrome detection
🧠 Reasoning Tests 25% of quality

Logic puzzles, math problems, and reasoning challenges with known correct answers. We check if the model's answer contains the correct value.

Example β€” Coin flip probability:

Prompt: "I flip a fair coin 3 times and get heads each time. What is the probability of getting heads on the 4th flip?"

Correct answer: 1/2 or 50% or 0.5
We check if model's response contains any of these βœ…
5 reasoning tests: Syllogisms Β· Speed math Β· River crossing puzzle Β· Probability Β· Letter counting
πŸ“‹ Instruction Following 25% of quality

Can the model follow exact formatting rules? We ask for JSON with specific keys, numbered lists with exact counts, and precise sentence counts.

Example β€” JSON formatting:

Prompt: 'Return a JSON object with exactly these keys: "name", "age", "city". Return ONLY valid JSON.'

We parse the JSON and check all 3 required keys are present βœ…
3 instruction tests: JSON formatting Β· Numbered lists Β· Exact sentence counts
🌍 Translation Tests 20% of quality

Translating between English, Russian, and Spanish. We check if the output uses the correct writing system (Cyrillic for Russian) and contains expected vocabulary.

Example β€” English to Russian:

Prompt: "Translate to Russian: 'Artificial intelligence is changing the world.'"

We check: Uses Cyrillic characters βœ… + Contains Russian keywords βœ…
3 translation pairs: EN→RU · RU→EN · EN→ES
⚑ Speed Tests Separate leaderboard

We measure tokens per second across short, medium, and long generation tasks. Speed is tracked separately β€” it never affects quality scores.

3 speed tests: Short prompt (haiku) Β· Medium (200 words) Β· Long (300 words)

Final score = average tokens/second across all 3

How It Works

  1. Daily at 03:00 UTC β€” GitHub Actions triggers the benchmark workflow
  2. Each model tested β€” Python script runs all tests via API calls to Groq, OpenRouter
  3. Results saved to JSON β€” Scores written to docs/data/results/YYYY-MM-DD.json
  4. Git commit + push β€” GitHub Pages automatically rebuilds and deploys the updated site
  5. News aggregated β€” RSS feeds from HuggingFace, OpenAI, Anthropic, etc. parsed and displayed

Size Categories

Models are ranked within their size tier β€” not globally. A top-ranked 8B model isn't compared to a 405B model.

CategoryParameter rangeTypical models
Small≀10B parametersLlama 3.2 1B/3B, Gemma 7B, Qwen 2.5 7B
Medium10B–50B parametersMixtral 8x7B, Phi-3 Medium, MythoMax 13B
Large50B+ parametersLlama 3.3 70B, Qwen 2.5 72B, Llama 3.1 405B

Open Source

All code, data, and methodology is public on GitHub. No hidden algorithms, no sponsored rankings.

View on GitHub β†’