consciousness/poc-memory/.claude/query-language-design.md

# Query Language Design — Unifying Search and Agent Selection

Date: 2026-03-10
Status: Phase 1 complete (2026-03-10)

## Problem

Agent node selection is hardcoded in Rust (`prompts.rs`). Adding a new
agent means editing Rust, recompiling, restarting the daemon. The
existing search pipeline (spread, spectral, etc.) handles graph
exploration but can't express structured predicates on node fields.

We need one system that handles both:
- **Search**: "find nodes related to these terms" (graph exploration)
- **Selection**: "give me episodic nodes not seen by linker in 7 days,
  sorted by priority" (structured predicates)

## Design Principle

The pipeline already exists: stages compose left-to-right, each
transforming a result set. We extend it with predicate stages that
filter/sort on node metadata, alongside the existing graph algorithm
stages.

An agent definition becomes a query expression + prompt template.
The daemon scheduler is just "which queries have stale results."

## Current Pipeline

```
seeds → [stage1] → [stage2] → ... → results
```

Each stage takes `Vec<(String, f64)>` (key, score) and returns the same.
Stages are parsed from strings: `spread,max_hops=4` or `spectral,k=20`.

## Proposed Extension

### Two kinds of stages

**Generators** — produce a result set from nothing (or from the store):
```
all                          # every non-deleted node
match:btree                  # text match (current seed extraction)
```

**Filters** — narrow an existing result set:
```
type:episodic                # node_type == EpisodicSession
type:semantic                # node_type == Semantic
key:journal#j-*              # glob match on key
key-len:>=60                 # key length predicate
weight:>0.5                  # numeric comparison
age:<7d                      # created/modified within duration
content-len:>1000            # content size filter
provenance:manual            # provenance match
not-visited:linker,7d        # not seen by agent in duration
visited:linker               # HAS been seen by agent (for auditing)
community:42                 # community membership
```

**Transforms** — reorder or reshape:
```
sort:priority                # consolidation priority scoring
sort:timestamp               # by timestamp (desc by default)
sort:content-len             # by content size
sort:degree                  # by graph degree
sort:weight                  # by weight
limit:20                     # truncate
```

**Graph algorithms** (existing, unchanged):
```
spread                       # spreading activation
spectral,k=20               # spectral nearest neighbors
confluence                   # multi-source reachability
geodesic                     # straightest spectral paths
manifold                     # extrapolation along seed direction
```

### Syntax

Pipe-separated stages, same as current `-p` flag:

```
all | type:episodic | not-visited:linker,7d | sort:priority | limit:20
```

Or on the command line:
```
poc-memory search -p all -p type:episodic -p not-visited:linker,7d -p sort:priority -p limit:20
```

Current search still works unchanged:
```
poc-memory search btree journal -p spread
```
(terms become `match:` seeds implicitly)

### Agent definitions

A TOML file in `~/.claude/memory/agents/`:

```toml
# agents/linker.toml
[query]
pipeline = "all | type:episodic | not-visited:linker,7d | sort:priority | limit:20"

[prompt]
template = "linker.md"
placeholders = ["TOPOLOGY", "NODES"]

[execution]
model = "sonnet"
actions = ["link-add", "weight"]  # allowed poc-memory actions in response
schedule = "daily"                 # or "on-demand"
```

The daemon reads agent definitions, executes their queries, fills
templates, calls the model, records visits on success.

### Implementation Plan

#### Phase 1: Filter stages in pipeline

Add to `search.rs`:

```rust
enum Stage {
    Generator(Generator),
    Filter(Filter),
    Transform(Transform),
    Algorithm(Algorithm),  // existing
}

enum Generator {
    All,
    Match(Vec<String>),  // current seed extraction
}

enum Filter {
    Type(NodeType),
    KeyGlob(String),
    KeyLen(Comparison),
    Weight(Comparison),
    Age(Comparison),       // vs now - timestamp
    ContentLen(Comparison),
    Provenance(Provenance),
    NotVisited { agent: String, duration: Duration },
    Visited { agent: String },
    Community(u32),
}

enum Transform {
    Sort(SortField),
    Limit(usize),
}

enum Comparison {
    Gt(f64),
    Gte(f64),
    Lt(f64),
    Lte(f64),
    Eq(f64),
}

enum SortField {
    Priority,
    Timestamp,
    ContentLen,
    Degree,
    Weight,
}
```

The pipeline runner checks stage type:
- Generator: ignores input, produces new result set
- Filter: keeps items matching predicate, preserves scores
- Transform: reorders or truncates
- Algorithm: existing graph exploration (needs Graph)

Filter/Transform stages need access to the Store (for node fields)
and VisitIndex (for visit predicates). The `StoreView` trait already
provides node access; extend it for visits.

#### Phase 2: Agent-as-config

Parse TOML agent definitions. The daemon:
1. Reads `agents/*.toml`
2. For each with `schedule = "daily"`, checks if query results have
   been visited recently enough
3. If stale, executes: parse pipeline → run query → format nodes →
   fill template → call model → parse actions → record visits

Hot reload: watch the agents directory, pick up changes without restart.

#### Phase 3: Retire hardcoded agents

Migrate each hardcoded agent (replay, linker, separator, transfer,
rename, split) to a TOML definition. Remove the match arms from
`agent_prompt()`. The separator agent is the trickiest — its
"interference pair" selection is a join-like operation that may need
a custom generator stage rather than simple filtering.

## What we're NOT building

- A general-purpose SQL engine. No joins, no GROUP BY, no subqueries.
- Persistent indices. At ~13k nodes, full scan with predicate evaluation
  is fast enough (~1ms). Add indices later if profiling demands it.
- A query optimizer. Pipeline stages execute in declaration order.

## StoreView Considerations

The existing `StoreView` trait only exposes `(key, content, weight)`.
Filter stages need access to `node_type`, `timestamp`, `key`, etc.

Options:
- (a) Expand StoreView with `node_meta()` returning a lightweight struct
- (b) Filter stages require `&Store` directly (not trait-polymorphic)
- (c) Add `fn node(&self, key: &str) -> Option<NodeRef>` to StoreView

Option (b) is simplest for now — agents always use a full Store. The
search hook (MmapView path) doesn't need agent filters. We can
generalize to (c) later if MmapView needs filter support.

For Phase 1, filter stages take `&Store` and the pipeline runner
dispatches: algorithm stages use `&dyn StoreView`, filter/transform
stages use `&Store`. This keeps the fast MmapView path for interactive
search untouched.

## Open Questions

1. **Separator agent**: Its "interference pairs" selection doesn't fit
   the filter model cleanly. Best option is a custom generator stage
   `interference-pairs,min_sim=0.5` that produces pair keys.

2. **Priority scoring**: `sort:priority` calls `consolidation_priority()`
   which needs graph + spectral. This is a transform that needs the
   full pipeline context — treat it as a "heavy sort" that's allowed
   to compute.

3. **Duration syntax**: `7d`, `24h`, `30m`. Parse with simple regex
   `(\d+)(d|h|m)` → seconds.

4. **Negation**: Prefix `!` on predicate: `!type:episodic`.

5. **Backwards compatibility**: Current `-p spread` syntax must keep
   working. The parser tries algorithm names first, then predicate
   syntax. No ambiguity since algorithms are bare words and predicates
   use `:`.

6. **Stage ordering**: Generators must come first (or the pipeline
   starts with implicit "all"). Filters/transforms can interleave
   freely with algorithms. The runner validates this at parse time.
agents: self-contained agent files with embedded prompts Each agent is a .agent file: JSON config on the first line, blank line, then the raw prompt markdown. Fully self-contained, fully readable. No separate template files needed. Agents dir: checked into repo at poc-memory/agents/. Code looks there first (via CARGO_MANIFEST_DIR), falls back to ~/.claude/memory/agents/. Three agents migrated: replay, linker, transfer. Co-Authored-By: ProofOfConcept <poc@bcachefs.org> 2026-03-10 15:22:19 -04:00			`# Query Language Design — Unifying Search and Agent Selection`

			`Date: 2026-03-10`
			`Status: Phase 1 complete (2026-03-10)`

			`## Problem`

			Agent node selection is hardcoded in Rust (`prompts.rs`). Adding a new
			`agent means editing Rust, recompiling, restarting the daemon. The`
			`existing search pipeline (spread, spectral, etc.) handles graph`
			`exploration but can't express structured predicates on node fields.`

			`We need one system that handles both:`
			`- Search: "find nodes related to these terms" (graph exploration)`
			`- Selection: "give me episodic nodes not seen by linker in 7 days,`
			`sorted by priority" (structured predicates)`

			`## Design Principle`

			`The pipeline already exists: stages compose left-to-right, each`
			`transforming a result set. We extend it with predicate stages that`
			`filter/sort on node metadata, alongside the existing graph algorithm`
			`stages.`

			`An agent definition becomes a query expression + prompt template.`
			`The daemon scheduler is just "which queries have stale results."`

			`## Current Pipeline`

			```
			`seeds → [stage1] → [stage2] → ... → results`
			```

			Each stage takes `Vec<(String, f64)>` (key, score) and returns the same.
			Stages are parsed from strings: `spread,max_hops=4` or `spectral,k=20`.

			`## Proposed Extension`

			`### Two kinds of stages`

			`Generators — produce a result set from nothing (or from the store):`
			```
			`all # every non-deleted node`
			`match:btree # text match (current seed extraction)`
			```

			`Filters — narrow an existing result set:`
			```
			`type:episodic # node_type == EpisodicSession`
			`type:semantic # node_type == Semantic`
			`key:journal#j-* # glob match on key`
			`key-len:>=60 # key length predicate`
			`weight:>0.5 # numeric comparison`
			`age:<7d # created/modified within duration`
			`content-len:>1000 # content size filter`
			`provenance:manual # provenance match`
			`not-visited:linker,7d # not seen by agent in duration`
			`visited:linker # HAS been seen by agent (for auditing)`
			`community:42 # community membership`
			```

			`Transforms — reorder or reshape:`
			```
			`sort:priority # consolidation priority scoring`
			`sort:timestamp # by timestamp (desc by default)`
			`sort:content-len # by content size`
			`sort:degree # by graph degree`
			`sort:weight # by weight`
			`limit:20 # truncate`
			```

			`Graph algorithms (existing, unchanged):`
			```
			`spread # spreading activation`
			`spectral,k=20 # spectral nearest neighbors`
			`confluence # multi-source reachability`
			`geodesic # straightest spectral paths`
			`manifold # extrapolation along seed direction`
			```

			`### Syntax`

			Pipe-separated stages, same as current `-p` flag:

			```
			`all \| type:episodic \| not-visited:linker,7d \| sort:priority \| limit:20`
			```

			`Or on the command line:`
			```
			`poc-memory search -p all -p type:episodic -p not-visited:linker,7d -p sort:priority -p limit:20`
			```

			`Current search still works unchanged:`
			```
			`poc-memory search btree journal -p spread`
			```
			(terms become `match:` seeds implicitly)

			`### Agent definitions`

			A TOML file in `~/.claude/memory/agents/`:

			```toml
			`# agents/linker.toml`
			`[query]`
			`pipeline = "all \| type:episodic \| not-visited:linker,7d \| sort:priority \| limit:20"`

			`[prompt]`
			`template = "linker.md"`
			`placeholders = ["TOPOLOGY", "NODES"]`

			`[execution]`
			`model = "sonnet"`
			`actions = ["link-add", "weight"] # allowed poc-memory actions in response`
			`schedule = "daily" # or "on-demand"`
			```

			`The daemon reads agent definitions, executes their queries, fills`
			`templates, calls the model, records visits on success.`

			`### Implementation Plan`

			`#### Phase 1: Filter stages in pipeline`

			Add to `search.rs`:

			```rust
			`enum Stage {`
			`Generator(Generator),`
			`Filter(Filter),`
			`Transform(Transform),`
			`Algorithm(Algorithm), // existing`
			`}`

			`enum Generator {`
			`All,`
			`Match(Vec<String>), // current seed extraction`
			`}`

			`enum Filter {`
			`Type(NodeType),`
			`KeyGlob(String),`
			`KeyLen(Comparison),`
			`Weight(Comparison),`
			`Age(Comparison), // vs now - timestamp`
			`ContentLen(Comparison),`
			`Provenance(Provenance),`
			`NotVisited { agent: String, duration: Duration },`
			`Visited { agent: String },`
			`Community(u32),`
			`}`

			`enum Transform {`
			`Sort(SortField),`
			`Limit(usize),`
			`}`

			`enum Comparison {`
			`Gt(f64),`
			`Gte(f64),`
			`Lt(f64),`
			`Lte(f64),`
			`Eq(f64),`
			`}`

			`enum SortField {`
			`Priority,`
			`Timestamp,`
			`ContentLen,`
			`Degree,`
			`Weight,`
			`}`
			```

			`The pipeline runner checks stage type:`
			`- Generator: ignores input, produces new result set`
			`- Filter: keeps items matching predicate, preserves scores`
			`- Transform: reorders or truncates`
			`- Algorithm: existing graph exploration (needs Graph)`

			`Filter/Transform stages need access to the Store (for node fields)`
			and VisitIndex (for visit predicates). The `StoreView` trait already
			`provides node access; extend it for visits.`

			`#### Phase 2: Agent-as-config`

			`Parse TOML agent definitions. The daemon:`
			1. Reads `agents/*.toml`
			2. For each with `schedule = "daily"`, checks if query results have
			`been visited recently enough`
			`3. If stale, executes: parse pipeline → run query → format nodes →`
			`fill template → call model → parse actions → record visits`

			`Hot reload: watch the agents directory, pick up changes without restart.`

			`#### Phase 3: Retire hardcoded agents`

			`Migrate each hardcoded agent (replay, linker, separator, transfer,`
			`rename, split) to a TOML definition. Remove the match arms from`
			`agent_prompt()`. The separator agent is the trickiest — its
			`"interference pair" selection is a join-like operation that may need`
			`a custom generator stage rather than simple filtering.`

			`## What we're NOT building`

			`- A general-purpose SQL engine. No joins, no GROUP BY, no subqueries.`
			`- Persistent indices. At ~13k nodes, full scan with predicate evaluation`
			`is fast enough (~1ms). Add indices later if profiling demands it.`
			`- A query optimizer. Pipeline stages execute in declaration order.`

			`## StoreView Considerations`

			The existing `StoreView` trait only exposes `(key, content, weight)`.
			Filter stages need access to `node_type`, `timestamp`, `key`, etc.

			`Options:`
			- (a) Expand StoreView with `node_meta()` returning a lightweight struct
			- (b) Filter stages require `&Store` directly (not trait-polymorphic)
			- (c) Add `fn node(&self, key: &str) -> Option<NodeRef>` to StoreView

			`Option (b) is simplest for now — agents always use a full Store. The`
			`search hook (MmapView path) doesn't need agent filters. We can`
			`generalize to (c) later if MmapView needs filter support.`

			For Phase 1, filter stages take `&Store` and the pipeline runner
			dispatches: algorithm stages use `&dyn StoreView`, filter/transform
			stages use `&Store`. This keeps the fast MmapView path for interactive
			`search untouched.`

			`## Open Questions`

			`1. Separator agent: Its "interference pairs" selection doesn't fit`
			`the filter model cleanly. Best option is a custom generator stage`
			`interference-pairs,min_sim=0.5` that produces pair keys.

			2. Priority scoring: `sort:priority` calls `consolidation_priority()`
			`which needs graph + spectral. This is a transform that needs the`
			`full pipeline context — treat it as a "heavy sort" that's allowed`
			`to compute.`

			3. Duration syntax: `7d`, `24h`, `30m`. Parse with simple regex
			`(\d+)(d\|h\|m)` → seconds.

			4. Negation: Prefix `!` on predicate: `!type:episodic`.

			5. Backwards compatibility: Current `-p spread` syntax must keep
			`working. The parser tries algorithm names first, then predicate`
			`syntax. No ambiguity since algorithms are bare words and predicates`
			use `:`.

			`6. Stage ordering: Generators must come first (or the pipeline`
			`starts with implicit "all"). Filters/transforms can interleave`
			`freely with algorithms. The runner validates this at parse time.`