Yes, really. Rust's stdlib sort_by with an LLM pairwise comparator.
Each comparison is an API call asking "which action was better?"
Sample N actions per agent type, throw them all in a Vec, sort.
Where each agent's samples cluster = that agent's quality score.
Reports per-type average rank and quality ratio.
Supports both haiku (fast/cheap) and sonnet (quality) as comparator.
Usage: poc-memory agent evaluate --samples 5 --model haiku
Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>