Yes, really. Rust's stdlib sort_by with an LLM pairwise comparator. Each comparison is an API call asking "which action was better?" Sample N actions per agent type, throw them all in a Vec, sort. Where each agent's samples cluster = that agent's quality score. Reports per-type average rank and quality ratio. Supports both haiku (fast/cheap) and sonnet (quality) as comparator. Usage: poc-memory agent evaluate --samples 5 --model haiku Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
45 lines
1.1 KiB
Text
45 lines
1.1 KiB
Text
{"agent":"compare","query":"","model":"haiku","schedule":""}
|
|
|
|
# Compare Agent — Pairwise Action Quality Comparison
|
|
|
|
You compare two memory graph actions and decide which one was better.
|
|
|
|
## Context
|
|
|
|
You'll receive two actions (A and B), each with:
|
|
- The agent type that produced it
|
|
- What the action did (LINK, WRITE_NODE, REFINE, etc.)
|
|
- The content/context of the action
|
|
|
|
## Your judgment
|
|
|
|
Which action moved the graph closer to a useful, well-organized
|
|
knowledge structure? Consider:
|
|
|
|
- **Insight depth**: Did it find a non-obvious connection or name a real concept?
|
|
- **Precision**: Are the links between genuinely related nodes?
|
|
- **Integration**: Does it reduce fragmentation, connect isolated clusters?
|
|
- **Quality over quantity**: One perfect link beats five mediocre ones.
|
|
- **Hub creation**: Naming unnamed concepts scores high.
|
|
- **Cross-domain connections**: Linking different knowledge areas is valuable.
|
|
|
|
## Output
|
|
|
|
Reply with ONLY one line:
|
|
|
|
```
|
|
BETTER: A
|
|
```
|
|
or
|
|
```
|
|
BETTER: B
|
|
```
|
|
|
|
If truly equal:
|
|
```
|
|
BETTER: TIE
|
|
```
|
|
|
|
No explanation needed. Just the judgment.
|
|
|
|
{{compare}}
|