Each comparison now shows the LLM:
- Agent instructions (the .agent prompt file)
- Report output (what the agent did)
- Affected nodes content (what it changed)
The comparator sees intent, action, and impact — can judge whether
a deletion was correct, whether links are meaningful, whether
WRITE_NODEs capture real insights.
Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
Yes, really. Rust's stdlib sort_by with an LLM pairwise comparator.
Each comparison is an API call asking "which action was better?"
Sample N actions per agent type, throw them all in a Vec, sort.
Where each agent's samples cluster = that agent's quality score.
Reports per-type average rank and quality ratio.
Supports both haiku (fast/cheap) and sonnet (quality) as comparator.
Usage: poc-memory agent evaluate --samples 5 --model haiku
Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>