Yes, really. Rust's stdlib sort_by with an LLM pairwise comparator. Each comparison is an API call asking "which action was better?" Sample N actions per agent type, throw them all in a Vec, sort. Where each agent's samples cluster = that agent's quality score. Reports per-type average rank and quality ratio. Supports both haiku (fast/cheap) and sonnet (quality) as comparator. Usage: poc-memory agent evaluate --samples 5 --model haiku Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev> |
||
|---|---|---|
| .. | ||
| .claude | ||
| agents | ||
| defaults | ||
| schema | ||
| src | ||
| build.rs | ||
| Cargo.toml | ||
| config.example.jsonl | ||