How Lexentia Proof Works
Daily automated benchmarks of free AI models โ no human grading, no sponsorships. Every score comes from code that actually runs, logic puzzles with known answers, and strict formatting checks.
Model writes a Python function. We actually execute it with test inputs and compare outputs to expected values. Syntax error or wrong logic = 0. No partial credit for broken code.
Logic puzzles, probability traps, and multi-step deduction โ each with a single correct answer. We check if the model's response contains the right answer string.
Can the model follow exact formatting rules? We ask for JSON with specific keys, numbered lists of exact length, and precise sentence counts. We parse the output programmatically.
English โ Russian โ Spanish. We check for the correct writing system (Cyrillic ratio for Russian) and expected vocabulary keywords. No LLM judge โ purely deterministic checks.
quality = (
avg_code * 0.35 + # 35% โ real execution
reasoning_score * 0.30 + # 30% โ logic & math
avg_instruction * 0.20 + # 20% โ format compliance
avg_translation * 0.15 # 15% โ language accuracy
)
| Tier | Size | Code tests | Reasoning tests | Instruction tests |
|---|---|---|---|---|
| Small | โค10B | fibonacci, palindrome | syllogism, speed math, counting | JSON (3 keys), list ร5 |
| Medium | 11โ50B | + is_prime, bubble_sort, count_vowels | + river crossing, coin flip, word problem, deduction | + sentence count ร3 |
| Large | 50B+ | + binary_search, flatten, longest_common_prefix | + knights & knaves, Bayes base rate, scheduling | + JSON nested (name/scores/active) |
Rankings are normalized within each tier โ a Small #1 is the best among small models, not globally. Models with unknown size fall back to the Medium tier.
Fully automated, no human involvement. The workflow runs benchmark.py against all configured models.
Small models run 2 code + 3 reasoning tests. Large models run 4 code + 6 reasoning tests. Python functions are executed in a sandbox via subprocess. All results are deterministic.
Results saved to docs/data/results/YYYY-MM-DD.json (full proof),
leaderboard.json (ranked list), and summary.json (lightweight integration endpoint).
The updated JSON files are committed and pushed. GitHub Pages picks up the changes within minutes.
Three speed prompts (short / medium / long) measure raw tokens/second. A model can score 100/100 on quality and rank last on speed โ or vice versa.
const res = await fetch('https://skillichse.github.io/Lexentia-Proof/data/results/summary.json');
const { models } = await res.json();
const top5 = models.slice(0, 5); // already sorted by quality desc
top5.forEach(m => {
console.log(`#${m.rank} ${m.name} โ quality: ${m.quality}, speed: ${m.speed} tok/s`);
});
import requests
data = requests.get('https://skillichse.github.io/Lexentia-Proof/data/results/summary.json').json()
for m in data['models'][:5]:
print(f"#{m['rank']} {m['name']} โ {m['quality']}/100 quality, {m['speed']} tok/s")
All benchmark code, scoring logic, and raw result data is public.
No hidden weights, no paid placements, no LLM-as-judge black boxes.
Every score can be reproduced by running benchmark.py yourself.
Lexentia Proof