About — ModelLens

What We Test

Every model is tested across 5 categories. Speed is tracked separately from quality — a slow model that writes perfect code still scores 100/100 on quality.

Category	Weight	What it measures
Code	30%	Functions that actually run and produce correct output
Reasoning	25%	Logic, math, and puzzle problems with verified answers
Instructions	25%	Following exact formatting requirements (JSON, lists, word counts)
Translation	20%	Correct target language script and vocabulary
Speed	separate	Tokens per second — own leaderboard, not mixed into quality

💻 Code Tests — Real Execution 30% of quality

The model writes Python functions. We actually run them with test inputs and verify outputs match expected values. Syntax errors, wrong logic, or empty responses score 0. No partial credit.

Example — is_prime test:

Prompt: "Write a Python function called is_prime(n) that returns True if n is prime..."

We run: is_prime(2) → True ✅ | is_prime(4) → False ✅ | is_prime(17) → True ✅

Score = (passed / total) × 100

3 problems tested: Prime checker · Fibonacci sequence · Palindrome detection

🧠 Reasoning Tests 25% of quality

Logic puzzles, math problems, and reasoning challenges with known correct answers. We check if the model's answer contains the correct value.

Example — Coin flip probability:

Prompt: "I flip a fair coin 3 times and get heads each time. What is the probability of getting heads on the 4th flip?"

Correct answer: 1/2 or 50% or 0.5
We check if model's response contains any of these ✅

5 reasoning tests: Syllogisms · Speed math · River crossing puzzle · Probability · Letter counting

📋 Instruction Following 25% of quality

Can the model follow exact formatting rules? We ask for JSON with specific keys, numbered lists with exact counts, and precise sentence counts.

Example — JSON formatting:

Prompt: 'Return a JSON object with exactly these keys: "name", "age", "city". Return ONLY valid JSON.'

We parse the JSON and check all 3 required keys are present ✅

3 instruction tests: JSON formatting · Numbered lists · Exact sentence counts

🌍 Translation Tests 20% of quality

Translating between English, Russian, and Spanish. We check if the output uses the correct writing system (Cyrillic for Russian) and contains expected vocabulary.

Example — English to Russian:

Prompt: "Translate to Russian: 'Artificial intelligence is changing the world.'"

We check: Uses Cyrillic characters ✅ + Contains Russian keywords ✅

3 translation pairs: EN→RU · RU→EN · EN→ES

⚡ Speed Tests Separate leaderboard

We measure tokens per second across short, medium, and long generation tasks. Speed is tracked separately — it never affects quality scores.

3 speed tests: Short prompt (haiku) · Medium (200 words) · Long (300 words)

Final score = average tokens/second across all 3

How It Works

Daily at 03:00 UTC — GitHub Actions triggers the benchmark workflow
Each model tested — Python script runs all tests via API calls to Groq, OpenRouter
Results saved to JSON — Scores written to docs/data/results/YYYY-MM-DD.json
Git commit + push — GitHub Pages automatically rebuilds and deploys the updated site
News aggregated — RSS feeds from HuggingFace, OpenAI, Anthropic, etc. parsed and displayed

Size Categories

Models are ranked within their size tier — not globally. A top-ranked 8B model isn't compared to a 405B model.

Category	Parameter range	Typical models
Small	≤10B parameters	Llama 3.2 1B/3B, Gemma 7B, Qwen 2.5 7B
Medium	10B–50B parameters	Mixtral 8x7B, Phi-3 Medium, MythoMax 13B
Large	50B+ parameters	Llama 3.3 70B, Qwen 2.5 72B, Llama 3.1 405B

Open Source

All code, data, and methodology is public on GitHub. No hidden algorithms, no sponsored rankings.

View on GitHub →

About ModelLens

What We Test

How It Works

Size Categories

Open Source