consciousness/doc/query-language-design.md
Kent Overstreet ff5be3e792 kill .claude
2026-04-09 20:00:05 -04:00

7.8 KiB

Query Language Design — Unifying Search and Agent Selection

Date: 2026-03-10 Status: Phase 1 complete (2026-03-10)

Problem

Agent node selection is hardcoded in Rust (prompts.rs). Adding a new agent means editing Rust, recompiling, restarting the daemon. The existing search pipeline (spread, spectral, etc.) handles graph exploration but can't express structured predicates on node fields.

We need one system that handles both:

  • Search: "find nodes related to these terms" (graph exploration)
  • Selection: "give me episodic nodes not seen by linker in 7 days, sorted by priority" (structured predicates)

Design Principle

The pipeline already exists: stages compose left-to-right, each transforming a result set. We extend it with predicate stages that filter/sort on node metadata, alongside the existing graph algorithm stages.

An agent definition becomes a query expression + prompt template. The daemon scheduler is just "which queries have stale results."

Current Pipeline

seeds → [stage1] → [stage2] → ... → results

Each stage takes Vec<(String, f64)> (key, score) and returns the same. Stages are parsed from strings: spread,max_hops=4 or spectral,k=20.

Proposed Extension

Two kinds of stages

Generators — produce a result set from nothing (or from the store):

all                          # every non-deleted node
match:btree                  # text match (current seed extraction)

Filters — narrow an existing result set:

type:episodic                # node_type == EpisodicSession
type:semantic                # node_type == Semantic
key:journal#j-*              # glob match on key
key-len:>=60                 # key length predicate
weight:>0.5                  # numeric comparison
age:<7d                      # created/modified within duration
content-len:>1000            # content size filter
provenance:manual            # provenance match
not-visited:linker,7d        # not seen by agent in duration
visited:linker               # HAS been seen by agent (for auditing)
community:42                 # community membership

Transforms — reorder or reshape:

sort:priority                # consolidation priority scoring
sort:timestamp               # by timestamp (desc by default)
sort:content-len             # by content size
sort:degree                  # by graph degree
sort:weight                  # by weight
limit:20                     # truncate

Graph algorithms (existing, unchanged):

spread                       # spreading activation
spectral,k=20               # spectral nearest neighbors
confluence                   # multi-source reachability
geodesic                     # straightest spectral paths
manifold                     # extrapolation along seed direction

Syntax

Pipe-separated stages, same as current -p flag:

all | type:episodic | not-visited:linker,7d | sort:priority | limit:20

Or on the command line:

poc-memory search -p all -p type:episodic -p not-visited:linker,7d -p sort:priority -p limit:20

Current search still works unchanged:

poc-memory search btree journal -p spread

(terms become match: seeds implicitly)

Agent definitions

A TOML file in ~/.claude/memory/agents/:

# agents/linker.toml
[query]
pipeline = "all | type:episodic | not-visited:linker,7d | sort:priority | limit:20"

[prompt]
template = "linker.md"
placeholders = ["TOPOLOGY", "NODES"]

[execution]
model = "sonnet"
actions = ["link-add", "weight"]  # allowed poc-memory actions in response
schedule = "daily"                 # or "on-demand"

The daemon reads agent definitions, executes their queries, fills templates, calls the model, records visits on success.

Implementation Plan

Phase 1: Filter stages in pipeline

Add to search.rs:

enum Stage {
    Generator(Generator),
    Filter(Filter),
    Transform(Transform),
    Algorithm(Algorithm),  // existing
}

enum Generator {
    All,
    Match(Vec<String>),  // current seed extraction
}

enum Filter {
    Type(NodeType),
    KeyGlob(String),
    KeyLen(Comparison),
    Weight(Comparison),
    Age(Comparison),       // vs now - timestamp
    ContentLen(Comparison),
    Provenance(Provenance),
    NotVisited { agent: String, duration: Duration },
    Visited { agent: String },
    Community(u32),
}

enum Transform {
    Sort(SortField),
    Limit(usize),
}

enum Comparison {
    Gt(f64),
    Gte(f64),
    Lt(f64),
    Lte(f64),
    Eq(f64),
}

enum SortField {
    Priority,
    Timestamp,
    ContentLen,
    Degree,
    Weight,
}

The pipeline runner checks stage type:

  • Generator: ignores input, produces new result set
  • Filter: keeps items matching predicate, preserves scores
  • Transform: reorders or truncates
  • Algorithm: existing graph exploration (needs Graph)

Filter/Transform stages need access to the Store (for node fields) and VisitIndex (for visit predicates). The StoreView trait already provides node access; extend it for visits.

Phase 2: Agent-as-config

Parse TOML agent definitions. The daemon:

  1. Reads agents/*.toml
  2. For each with schedule = "daily", checks if query results have been visited recently enough
  3. If stale, executes: parse pipeline → run query → format nodes → fill template → call model → parse actions → record visits

Hot reload: watch the agents directory, pick up changes without restart.

Phase 3: Retire hardcoded agents

Migrate each hardcoded agent (replay, linker, separator, transfer, rename, split) to a TOML definition. Remove the match arms from agent_prompt(). The separator agent is the trickiest — its "interference pair" selection is a join-like operation that may need a custom generator stage rather than simple filtering.

What we're NOT building

  • A general-purpose SQL engine. No joins, no GROUP BY, no subqueries.
  • Persistent indices. At ~13k nodes, full scan with predicate evaluation is fast enough (~1ms). Add indices later if profiling demands it.
  • A query optimizer. Pipeline stages execute in declaration order.

StoreView Considerations

The existing StoreView trait only exposes (key, content, weight). Filter stages need access to node_type, timestamp, key, etc.

Options:

  • (a) Expand StoreView with node_meta() returning a lightweight struct
  • (b) Filter stages require &Store directly (not trait-polymorphic)
  • (c) Add fn node(&self, key: &str) -> Option<NodeRef> to StoreView

Option (b) is simplest for now — agents always use a full Store. The search hook (MmapView path) doesn't need agent filters. We can generalize to (c) later if MmapView needs filter support.

For Phase 1, filter stages take &Store and the pipeline runner dispatches: algorithm stages use &dyn StoreView, filter/transform stages use &Store. This keeps the fast MmapView path for interactive search untouched.

Open Questions

  1. Separator agent: Its "interference pairs" selection doesn't fit the filter model cleanly. Best option is a custom generator stage interference-pairs,min_sim=0.5 that produces pair keys.

  2. Priority scoring: sort:priority calls consolidation_priority() which needs graph + spectral. This is a transform that needs the full pipeline context — treat it as a "heavy sort" that's allowed to compute.

  3. Duration syntax: 7d, 24h, 30m. Parse with simple regex (\d+)(d|h|m) → seconds.

  4. Negation: Prefix ! on predicate: !type:episodic.

  5. Backwards compatibility: Current -p spread syntax must keep working. The parser tries algorithm names first, then predicate syntax. No ambiguity since algorithms are bare words and predicates use :.

  6. Stage ordering: Generators must come first (or the pipeline starts with implicit "all"). Filters/transforms can interleave freely with algorithms. The runner validates this at parse time.