evaluate: switch to Elo ratings with skillratings crate
Replace sort-based ranking with proper Elo system: - Each agent TYPE has a persistent Elo rating (agent-elo.json) - Each matchup: pick two random types, grab a recent action from each, LLM compares, update ratings - Ratings persist across daily evaluations — natural recency bias from continuous updates against current opponents - K=32 for fast adaptation to prompt changes Usage: poc-memory agent evaluate --matchups 30 --model haiku Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
This commit is contained in:
parent
e2a6bc4c8b
commit
16777924d0
4 changed files with 129 additions and 71 deletions
7
Cargo.lock
generated
7
Cargo.lock
generated
|
|
@ -1908,6 +1908,7 @@ dependencies = [
|
|||
"rkyv",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"skillratings",
|
||||
"uuid",
|
||||
]
|
||||
|
||||
|
|
@ -2615,6 +2616,12 @@ version = "0.1.5"
|
|||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e3a9fe34e3e7a50316060351f37187a3f546bce95496156754b601a5fa71b76e"
|
||||
|
||||
[[package]]
|
||||
name = "skillratings"
|
||||
version = "0.28.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "1a6ee7559737c1adcd9184f168a04dc360c84878907c3ecc5c33c2320be1d47a"
|
||||
|
||||
[[package]]
|
||||
name = "slab"
|
||||
version = "0.4.12"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue