evaluate: switch to Elo ratings with skillratings crate

Replace sort-based ranking with proper Elo system:
- Each agent TYPE has a persistent Elo rating (agent-elo.json)
- Each matchup: pick two random types, grab a recent action from
  each, LLM compares, update ratings
- Ratings persist across daily evaluations — natural recency bias
  from continuous updates against current opponents
- K=32 for fast adaptation to prompt changes

Usage: poc-memory agent evaluate --matchups 30 --model haiku

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
This commit is contained in:
ProofOfConcept 2026-03-14 19:53:46 -04:00
parent e2a6bc4c8b
commit 16777924d0
4 changed files with 129 additions and 71 deletions

7
Cargo.lock generated
View file

@ -1908,6 +1908,7 @@ dependencies = [
"rkyv",
"serde",
"serde_json",
"skillratings",
"uuid",
]
@ -2615,6 +2616,12 @@ version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3a9fe34e3e7a50316060351f37187a3f546bce95496156754b601a5fa71b76e"
[[package]]
name = "skillratings"
version = "0.28.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1a6ee7559737c1adcd9184f168a04dc360c84878907c3ecc5c33c2320be1d47a"
[[package]]
name = "slab"
version = "0.4.12"