Commit graph

14 commits

Author SHA1 Message Date
ProofOfConcept
c959b2c964 evaluate: fix RNG — xorshift32 replaces degenerate LCG
The LCG was producing only 2 distinct matchup pairs due to poor
constants. Switch to xorshift32 for proper coverage of all type pairs.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:57:58 -04:00
ProofOfConcept
16777924d0 evaluate: switch to Elo ratings with skillratings crate
Replace sort-based ranking with proper Elo system:
- Each agent TYPE has a persistent Elo rating (agent-elo.json)
- Each matchup: pick two random types, grab a recent action from
  each, LLM compares, update ratings
- Ratings persist across daily evaluations — natural recency bias
  from continuous updates against current opponents
- K=32 for fast adaptation to prompt changes

Usage: poc-memory agent evaluate --matchups 30 --model haiku

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:53:46 -04:00
ProofOfConcept
e2a6bc4c8b evaluate: remove TIE option, force binary judgment
TIE causes inconsistency in sort (A=B, B=C but A>C breaks ordering).
Force the comparator to always pick a winner. Default to A if response
is unparseable.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:48:01 -04:00
ProofOfConcept
0cecfdb352 evaluate: fix agent prompt path, dedup affected nodes, add --dry-run
- Use CARGO_MANIFEST_DIR for agent file path (same as defs.rs)
- Dedup affected nodes extracted from reports
- --dry-run shows example comparison prompt without LLM calls

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:44:12 -04:00
ProofOfConcept
415180eeab evaluate: ask for reasoning in comparisons
Chain-of-thought: "say which is better and why" forces clearer
judgment and gives us analysis data for improving agents.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:36:55 -04:00
ProofOfConcept
39e3d69e3c evaluate: dedup agent prompt when comparing same agent type
When both actions are from the same agent, show the instructions once
and just compare the two report outputs + affected nodes. Saves tokens
and makes the comparison cleaner.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:35:11 -04:00
ProofOfConcept
b964335317 evaluate: include agent prompt + affected nodes in comparisons
Each comparison now shows the LLM:
- Agent instructions (the .agent prompt file)
- Report output (what the agent did)
- Affected nodes content (what it changed)

The comparator sees intent, action, and impact — can judge whether
a deletion was correct, whether links are meaningful, whether
WRITE_NODEs capture real insights.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:34:10 -04:00
ProofOfConcept
433d36aea8 evaluate: use rayon par_sort_by for parallel LLM comparisons
Merge sort parallelizes naturally — multiple LLM comparison calls
happen concurrently. Safe because merge sort terminates correctly
even with non-deterministic comparators (unlike quicksort).

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:27:28 -04:00
ProofOfConcept
e12dea503b agent evaluate: sort agent actions by quality using Vec::sort_by with LLM
Yes, really. Rust's stdlib sort_by with an LLM pairwise comparator.
Each comparison is an API call asking "which action was better?"

Sample N actions per agent type, throw them all in a Vec, sort.
Where each agent's samples cluster = that agent's quality score.
Reports per-type average rank and quality ratio.

Supports both haiku (fast/cheap) and sonnet (quality) as comparator.

Usage: poc-memory agent evaluate --samples 5 --model haiku

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 19:24:07 -04:00
ProofOfConcept
99db511403 cli: move helpers to cli modules, main.rs under 1100 lines
Move CLI-specific helpers to their cli/ modules:
- journal_tail_entries, journal_tail_digests, extract_title,
  find_current_transcript → cli/journal.rs
- get_group_content → cli/misc.rs
- cmd_journal_write, cmd_journal_tail, cmd_load_context follow

These are presentation/session helpers, not library code — they
belong in the CLI layer per Kent's guidance.

main.rs: 3130 → 1054 lines (66% reduction).

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 18:14:52 -04:00
ProofOfConcept
8640d50990 cli: extract journal and misc commands, complete split
Move remaining extractable handlers into cli/journal.rs and cli/misc.rs.
Functions depending on main.rs helpers (cmd_journal_tail, cmd_journal_write,
cmd_load_context, cmd_cursor, cmd_daemon, cmd_digest, cmd_experience_mine,
cmd_apply_agent) remain in main.rs — next step is moving those helpers
to library code.

main.rs: 3130 → 1331 lines (57% reduction).
cli/ total: 1860 lines across 6 focused files.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 18:10:22 -04:00
ProofOfConcept
f423cf22df cli: extract agent and admin commands from main.rs
Move agent handlers (consolidate, replay, digest, experience-mine,
fact-mine, knowledge-loop, apply-*) into cli/agent.rs.

Move admin handlers (init, fsck, dedup, bulk-rename, health,
daily-check, import, export) into cli/admin.rs.

Functions tightly coupled to Clap types (cmd_daemon, cmd_digest,
cmd_apply_agent, cmd_experience_mine) remain in main.rs.

main.rs: 3130 → 1586 lines (49% reduction).

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 18:06:27 -04:00
ProofOfConcept
aa2fddf137 cli: extract node commands from main.rs into cli/node.rs
Move 15 node subcommand handlers (310 lines) out of main.rs:
render, write, used, wrong, not-relevant, not-useful, gap,
node-delete, node-rename, history, list-keys, list-edges,
dump-json, lookup-bump, lookups.

main.rs: 2518 → 2193 lines.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 18:02:12 -04:00
ProofOfConcept
c8d86e94c1 cli: extract graph commands from main.rs into cli/graph.rs
Move 18 graph subcommand handlers (594 lines) out of main.rs:
link, link-add, link-impact, link-audit, link-orphans,
triangle-close, cap-degree, normalize-strengths, differentiate,
trace, spectral-*, organize, interference.

main.rs: 3130 → 2518 lines.

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
2026-03-14 17:59:46 -04:00