Dark

How Lexentia Proof Works

Daily automated benchmarks of free AI models โ€” no human grading, no sponsorships. Every score comes from code that actually runs, logic puzzles with known answers, and strict formatting checks.

๐Ÿ“ Quality Score Formula
One final number โ€” Quality Score 0โ€“100 โ€” is a weighted average of four categories. Speed is measured separately and never affects quality.
๐Ÿ’ป Code 35%

Model writes a Python function. We actually execute it with test inputs and compare outputs to expected values. Syntax error or wrong logic = 0. No partial credit for broken code.

fibonacci palindrome is_prime bubble_sort count_vowels binary_search flatten longest_common_prefix
๐Ÿง  Reasoning 30%

Logic puzzles, probability traps, and multi-step deduction โ€” each with a single correct answer. We check if the model's response contains the right answer string.

syllogism river crossing coin flip knights & knaves base rate / Bayes scheduling constraints word problem counting trap
๐Ÿ“‹ Instruction Following 20%

Can the model follow exact formatting rules? We ask for JSON with specific keys, numbered lists of exact length, and precise sentence counts. We parse the output programmatically.

JSON 3 keys JSON nested numbered list ร—5 sentence count ร—3
๐ŸŒ Translation 15%

English โ†” Russian โ†” Spanish. We check for the correct writing system (Cyrillic ratio for Russian) and expected vocabulary keywords. No LLM judge โ€” purely deterministic checks.

EN โ†’ RU RU โ†’ EN EN โ†’ ES
# quality_score formula (benchmark.py)
quality = (
  avg_code          * 0.35 + # 35% โ€” real execution
  reasoning_score  * 0.30 + # 30% โ€” logic & math
  avg_instruction  * 0.20 + # 20% โ€” format compliance
  avg_translation  * 0.15     # 15% โ€” language accuracy
)

๐Ÿ“Š Tiered Testing โ€” Difficulty Scales with Size
Small models get easier tasks โ€” basic recursion, simple logic. Large models get harder ones โ€” algorithmic problems, Bayesian traps, multi-step constraints. This makes scores meaningful within each tier and prevents small models from looking better than they are by acing trivial tasks.
Tier Size Code tests Reasoning tests Instruction tests
Small โ‰ค10B fibonacci, palindrome syllogism, speed math, counting JSON (3 keys), list ร—5
Medium 11โ€“50B + is_prime, bubble_sort, count_vowels + river crossing, coin flip, word problem, deduction + sentence count ร—3
Large 50B+ + binary_search, flatten, longest_common_prefix + knights & knaves, Bayes base rate, scheduling + JSON nested (name/scores/active)

Rankings are normalized within each tier โ€” a Small #1 is the best among small models, not globally. Models with unknown size fall back to the Medium tier.


โš™๏ธ How It Runs
1
Daily at 03:00 UTC โ€” GitHub Actions triggers the benchmark

Fully automated, no human involvement. The workflow runs benchmark.py against all configured models.

2
Each model runs its tier's test suite

Small models run 2 code + 3 reasoning tests. Large models run 4 code + 6 reasoning tests. Python functions are executed in a sandbox via subprocess. All results are deterministic.

3
Scores computed, JSON files written

Results saved to docs/data/results/YYYY-MM-DD.json (full proof), leaderboard.json (ranked list), and summary.json (lightweight integration endpoint).

4
Git commit โ†’ GitHub Pages redeploys automatically

The updated JSON files are committed and pushed. GitHub Pages picks up the changes within minutes.

5
Speed measured separately, never mixed into quality

Three speed prompts (short / medium / long) measure raw tokens/second. A model can score 100/100 on quality and rank last on speed โ€” or vice versa.


๐Ÿ”Œ Public API โ€” Use in Your Projects
All data is available as plain JSON over HTTPS with CORS. No auth, no rate limits. Fetch rankings, embed a leaderboard, or pipe scores into your own tooling.
data/results/summary.json
Recommended for integrations. One entry per model โ€” id, name, provider, tier, quality score, speed, per-category scores. ~6 KB, updated daily.
~6 KB ยท updated daily
data/results/leaderboard.json
Quality-ranked list with rank, score, speed, tier, and status for every model. Same data as the main Rankings page.
~6 KB ยท updated daily
data/results/leaderboard_speed.json
Same structure, ranked by tokens/second instead of quality score.
~6 KB ยท updated daily
data/results/YYYY-MM-DD.json
Full daily snapshot with raw test outputs, execution logs, and per-case results. Use this to audit scores or build custom analysis.
~165 KB ยท one file per day
// Minimal example โ€” fetch top 5 models
const res = await fetch('https://skillichse.github.io/Lexentia-Proof/data/results/summary.json');
const { models } = await res.json();
const top5 = models.slice(0, 5); // already sorted by quality desc

top5.forEach(m => {
  console.log(`#${m.rank} ${m.name} โ€” quality: ${m.quality}, speed: ${m.speed} tok/s`);
});
# Python example
import requests
data = requests.get('https://skillichse.github.io/Lexentia-Proof/data/results/summary.json').json()
for m in data['models'][:5]:
  print(f"#{m['rank']} {m['name']} โ€” {m['quality']}/100 quality, {m['speed']} tok/s")

Open Source

All benchmark code, scoring logic, and raw result data is public. No hidden weights, no paid placements, no LLM-as-judge black boxes. Every score can be reproduced by running benchmark.py yourself.

View on GitHub โ†’