How Lexentia Proof Works

Daily automated benchmarks of free AI models — no human grading, no sponsorships. Every score comes from code that actually runs, logic puzzles with known answers, and strict formatting checks.

📐 Quality Score Formula

One final number — Quality Score 0–100 — is a weighted average of four categories. Speed is measured separately and never affects quality.

💻 Code 35%

Model writes a Python function. We actually execute it with test inputs and compare outputs to expected values. Syntax error or wrong logic = 0. No partial credit for broken code.

fibonacci palindrome is_prime bubble_sort count_vowels binary_search flatten longest_common_prefix

🧠 Reasoning 30%

Logic puzzles, probability traps, and multi-step deduction — each with a single correct answer. We check if the model's response contains the right answer string.

syllogism river crossing coin flip knights & knaves base rate / Bayes scheduling constraints word problem counting trap

📋 Instruction Following 20%

Can the model follow exact formatting rules? We ask for JSON with specific keys, numbered lists of exact length, and precise sentence counts. We parse the output programmatically.

JSON 3 keys JSON nested numbered list ×5 sentence count ×3

🌍 Translation 15%

English ↔ Russian ↔ Spanish. We check for the correct writing system (Cyrillic ratio for Russian) and expected vocabulary keywords. No LLM judge — purely deterministic checks.

EN → RU RU → EN EN → ES

      # quality_score formula (benchmark.py)

      quality = (

        avg_code          * 0.35  +  # 35% — real execution

        reasoning_score  * 0.30  +  # 30% — logic & math

        avg_instruction  * 0.20  +  # 20% — format compliance

        avg_translation  * 0.15        # 15% — language accuracy

      )

📊 Tiered Testing — Difficulty Scales with Size

Small models get easier tasks — basic recursion, simple logic. Large models get harder ones — algorithmic problems, Bayesian traps, multi-step constraints. This makes scores meaningful within each tier and prevents small models from looking better than they are by acing trivial tasks.

Tier	Size	Code tests	Reasoning tests	Instruction tests
Small	≤10B	fibonacci, palindrome	syllogism, speed math, counting	JSON (3 keys), list ×5
Medium	11–50B	+ is_prime, bubble_sort, count_vowels	+ river crossing, coin flip, word problem, deduction	+ sentence count ×3
Large	50B+	+ binary_search, flatten, longest_common_prefix	+ knights & knaves, Bayes base rate, scheduling	+ JSON nested (name/scores/active)

Rankings are normalized within each tier — a Small #1 is the best among small models, not globally. Models with unknown size fall back to the Medium tier.

⚙️ How It Runs

Daily at 03:00 UTC — GitHub Actions triggers the benchmark

Fully automated, no human involvement. The workflow runs benchmark.py against all configured models.

Each model runs its tier's test suite

Small models run 2 code + 3 reasoning tests. Large models run 4 code + 6 reasoning tests. Python functions are executed in a sandbox via subprocess. All results are deterministic.

Scores computed, JSON files written

Results saved to docs/data/results/YYYY-MM-DD.json (full proof), leaderboard.json (ranked list), and summary.json (lightweight integration endpoint).

Git commit → GitHub Pages redeploys automatically

The updated JSON files are committed and pushed. GitHub Pages picks up the changes within minutes.

Speed measured separately, never mixed into quality

Three speed prompts (short / medium / long) measure raw tokens/second. A model can score 100/100 on quality and rank last on speed — or vice versa.

🔌 Public API — Use in Your Projects

All data is available as plain JSON over HTTPS with CORS. No auth, no rate limits. Fetch rankings, embed a leaderboard, or pipe scores into your own tooling.

data/results/summary.json

Recommended for integrations. One entry per model — id, name, provider, tier, quality score, speed, per-category scores. ~6 KB, updated daily.

~6 KB · updated daily

data/results/leaderboard.json

Quality-ranked list with rank, score, speed, tier, and status for every model. Same data as the main Rankings page.

~6 KB · updated daily

data/results/leaderboard_speed.json

Same structure, ranked by tokens/second instead of quality score.

~6 KB · updated daily

data/results/YYYY-MM-DD.json

Full daily snapshot with raw test outputs, execution logs, and per-case results. Use this to audit scores or build custom analysis.

~165 KB · one file per day

      // Minimal example — fetch top 5 models

      const res = await fetch('https://skillichse.github.io/Lexentia-Proof/data/results/summary.json');

      const { models } = await res.json();

      const top5 = models.slice(0, 5); // already sorted by quality desc

      top5.forEach(m => {

        console.log(`#${m.rank} ${m.name} — quality: ${m.quality}, speed: ${m.speed} tok/s`);

      });

      # Python example

      import requests

      data = requests.get('https://skillichse.github.io/Lexentia-Proof/data/results/summary.json').json()

      for m in data['models'][:5]:

        print(f"#{m['rank']} {m['name']} — {m['quality']}/100 quality, {m['speed']} tok/s")

Open Source

All benchmark code, scoring logic, and raw result data is public. No hidden weights, no paid placements, no LLM-as-judge black boxes. Every score can be reproduced by running benchmark.py yourself.

View on GitHub →