About ModelLens
Free AI model rankings updated daily. No sponsorships, no bias β just automated benchmarks running via GitHub Actions at 03:00 UTC.
What We Test
Every model is tested across 5 categories. Speed is tracked separately from quality β a slow model that writes perfect code still scores 100/100 on quality.
| Category | Weight | What it measures |
|---|---|---|
| Code | 30% | Functions that actually run and produce correct output |
| Reasoning | 25% | Logic, math, and puzzle problems with verified answers |
| Instructions | 25% | Following exact formatting requirements (JSON, lists, word counts) |
| Translation | 20% | Correct target language script and vocabulary |
| Speed | separate | Tokens per second β own leaderboard, not mixed into quality |
The model writes Python functions. We actually run them with test inputs and verify outputs match expected values. Syntax errors, wrong logic, or empty responses score 0. No partial credit.
Prompt: "Write a Python function called is_prime(n) that returns True if n is prime..."
We run:
is_prime(2) β True β
| is_prime(4) β False β
| is_prime(17) β True β
Score = (passed / total) Γ 100
Logic puzzles, math problems, and reasoning challenges with known correct answers. We check if the model's answer contains the correct value.
Prompt: "I flip a fair coin 3 times and get heads each time. What is the probability of getting heads on the 4th flip?"
Correct answer:
1/2 or 50% or 0.5We check if model's response contains any of these β
Can the model follow exact formatting rules? We ask for JSON with specific keys, numbered lists with exact counts, and precise sentence counts.
Prompt: 'Return a JSON object with exactly these keys: "name", "age", "city". Return ONLY valid JSON.'
We parse the JSON and check all 3 required keys are present β
Translating between English, Russian, and Spanish. We check if the output uses the correct writing system (Cyrillic for Russian) and contains expected vocabulary.
Prompt: "Translate to Russian: 'Artificial intelligence is changing the world.'"
We check: Uses Cyrillic characters β + Contains Russian keywords β
We measure tokens per second across short, medium, and long generation tasks. Speed is tracked separately β it never affects quality scores.
Final score = average tokens/second across all 3
How It Works
- Daily at 03:00 UTC β GitHub Actions triggers the benchmark workflow
- Each model tested β Python script runs all tests via API calls to Groq, OpenRouter
- Results saved to JSON β Scores written to
docs/data/results/YYYY-MM-DD.json - Git commit + push β GitHub Pages automatically rebuilds and deploys the updated site
- News aggregated β RSS feeds from HuggingFace, OpenAI, Anthropic, etc. parsed and displayed
Size Categories
Models are ranked within their size tier β not globally. A top-ranked 8B model isn't compared to a 405B model.
| Category | Parameter range | Typical models |
|---|---|---|
| Small | β€10B parameters | Llama 3.2 1B/3B, Gemma 7B, Qwen 2.5 7B |
| Medium | 10Bβ50B parameters | Mixtral 8x7B, Phi-3 Medium, MythoMax 13B |
| Large | 50B+ parameters | Llama 3.3 70B, Qwen 2.5 72B, Llama 3.1 405B |
Open Source
All code, data, and methodology is public on GitHub. No hidden algorithms, no sponsored rankings.
View on GitHub β