evaluate: switch to Elo ratings with skillratings crate

Replace sort-based ranking with proper Elo system: - Each agent TYPE has a persistent Elo rating (agent-elo.json) - Each matchup: pick two random types, grab a recent action from each, LLM compares, update ratings - Ratings persist across daily evaluations — natural recency bias from continuous updates against current opponents - K=32 for fast adaptation to prompt changes Usage: poc-memory agent evaluate --matchups 30 --model haiku Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:53:46 -04:00 · 2026-03-14 19:53:46 -04:00 · 16777924d0
commit 16777924d0
parent e2a6bc4c8b
4 changed files with 129 additions and 71 deletions
--- a/poc-memory/Cargo.toml
+++ b/poc-memory/Cargo.toml
@ -24,6 +24,7 @@ jobkit-daemon = { path = "../jobkit-daemon" }
 redb = "2"
 log = "0.4"
 ratatui = "0.29"
+skillratings = "0.28"
 crossterm = { version = "0.28", features = ["event-stream"] }

 [build-dependencies]