consciousness/poc-memory/.claude/query-language-design.md

255 lines
7.8 KiB
Markdown
Raw Normal View History

# Query Language Design — Unifying Search and Agent Selection
Date: 2026-03-10
Status: Phase 1 complete (2026-03-10)
## Problem
Agent node selection is hardcoded in Rust (`prompts.rs`). Adding a new
agent means editing Rust, recompiling, restarting the daemon. The
existing search pipeline (spread, spectral, etc.) handles graph
exploration but can't express structured predicates on node fields.
We need one system that handles both:
- **Search**: "find nodes related to these terms" (graph exploration)
- **Selection**: "give me episodic nodes not seen by linker in 7 days,
sorted by priority" (structured predicates)
## Design Principle
The pipeline already exists: stages compose left-to-right, each
transforming a result set. We extend it with predicate stages that
filter/sort on node metadata, alongside the existing graph algorithm
stages.
An agent definition becomes a query expression + prompt template.
The daemon scheduler is just "which queries have stale results."
## Current Pipeline
```
seeds → [stage1] → [stage2] → ... → results
```
Each stage takes `Vec<(String, f64)>` (key, score) and returns the same.
Stages are parsed from strings: `spread,max_hops=4` or `spectral,k=20`.
## Proposed Extension
### Two kinds of stages
**Generators** — produce a result set from nothing (or from the store):
```
all # every non-deleted node
match:btree # text match (current seed extraction)
```
**Filters** — narrow an existing result set:
```
type:episodic # node_type == EpisodicSession
type:semantic # node_type == Semantic
key:journal#j-* # glob match on key
key-len:>=60 # key length predicate
weight:>0.5 # numeric comparison
age:<7d # created/modified within duration
content-len:>1000 # content size filter
provenance:manual # provenance match
not-visited:linker,7d # not seen by agent in duration
visited:linker # HAS been seen by agent (for auditing)
community:42 # community membership
```
**Transforms** — reorder or reshape:
```
sort:priority # consolidation priority scoring
sort:timestamp # by timestamp (desc by default)
sort:content-len # by content size
sort:degree # by graph degree
sort:weight # by weight
limit:20 # truncate
```
**Graph algorithms** (existing, unchanged):
```
spread # spreading activation
spectral,k=20 # spectral nearest neighbors
confluence # multi-source reachability
geodesic # straightest spectral paths
manifold # extrapolation along seed direction
```
### Syntax
Pipe-separated stages, same as current `-p` flag:
```
all | type:episodic | not-visited:linker,7d | sort:priority | limit:20
```
Or on the command line:
```
poc-memory search -p all -p type:episodic -p not-visited:linker,7d -p sort:priority -p limit:20
```
Current search still works unchanged:
```
poc-memory search btree journal -p spread
```
(terms become `match:` seeds implicitly)
### Agent definitions
A TOML file in `~/.claude/memory/agents/`:
```toml
# agents/linker.toml
[query]
pipeline = "all | type:episodic | not-visited:linker,7d | sort:priority | limit:20"
[prompt]
template = "linker.md"
placeholders = ["TOPOLOGY", "NODES"]
[execution]
model = "sonnet"
actions = ["link-add", "weight"] # allowed poc-memory actions in response
schedule = "daily" # or "on-demand"
```
The daemon reads agent definitions, executes their queries, fills
templates, calls the model, records visits on success.
### Implementation Plan
#### Phase 1: Filter stages in pipeline
Add to `search.rs`:
```rust
enum Stage {
Generator(Generator),
Filter(Filter),
Transform(Transform),
Algorithm(Algorithm), // existing
}
enum Generator {
All,
Match(Vec<String>), // current seed extraction
}
enum Filter {
Type(NodeType),
KeyGlob(String),
KeyLen(Comparison),
Weight(Comparison),
Age(Comparison), // vs now - timestamp
ContentLen(Comparison),
Provenance(Provenance),
NotVisited { agent: String, duration: Duration },
Visited { agent: String },
Community(u32),
}
enum Transform {
Sort(SortField),
Limit(usize),
}
enum Comparison {
Gt(f64),
Gte(f64),
Lt(f64),
Lte(f64),
Eq(f64),
}
enum SortField {
Priority,
Timestamp,
ContentLen,
Degree,
Weight,
}
```
The pipeline runner checks stage type:
- Generator: ignores input, produces new result set
- Filter: keeps items matching predicate, preserves scores
- Transform: reorders or truncates
- Algorithm: existing graph exploration (needs Graph)
Filter/Transform stages need access to the Store (for node fields)
and VisitIndex (for visit predicates). The `StoreView` trait already
provides node access; extend it for visits.
#### Phase 2: Agent-as-config
Parse TOML agent definitions. The daemon:
1. Reads `agents/*.toml`
2. For each with `schedule = "daily"`, checks if query results have
been visited recently enough
3. If stale, executes: parse pipeline → run query → format nodes →
fill template → call model → parse actions → record visits
Hot reload: watch the agents directory, pick up changes without restart.
#### Phase 3: Retire hardcoded agents
Migrate each hardcoded agent (replay, linker, separator, transfer,
rename, split) to a TOML definition. Remove the match arms from
`agent_prompt()`. The separator agent is the trickiest — its
"interference pair" selection is a join-like operation that may need
a custom generator stage rather than simple filtering.
## What we're NOT building
- A general-purpose SQL engine. No joins, no GROUP BY, no subqueries.
- Persistent indices. At ~13k nodes, full scan with predicate evaluation
is fast enough (~1ms). Add indices later if profiling demands it.
- A query optimizer. Pipeline stages execute in declaration order.
## StoreView Considerations
The existing `StoreView` trait only exposes `(key, content, weight)`.
Filter stages need access to `node_type`, `timestamp`, `key`, etc.
Options:
- (a) Expand StoreView with `node_meta()` returning a lightweight struct
- (b) Filter stages require `&Store` directly (not trait-polymorphic)
- (c) Add `fn node(&self, key: &str) -> Option<NodeRef>` to StoreView
Option (b) is simplest for now — agents always use a full Store. The
search hook (MmapView path) doesn't need agent filters. We can
generalize to (c) later if MmapView needs filter support.
For Phase 1, filter stages take `&Store` and the pipeline runner
dispatches: algorithm stages use `&dyn StoreView`, filter/transform
stages use `&Store`. This keeps the fast MmapView path for interactive
search untouched.
## Open Questions
1. **Separator agent**: Its "interference pairs" selection doesn't fit
the filter model cleanly. Best option is a custom generator stage
`interference-pairs,min_sim=0.5` that produces pair keys.
2. **Priority scoring**: `sort:priority` calls `consolidation_priority()`
which needs graph + spectral. This is a transform that needs the
full pipeline context — treat it as a "heavy sort" that's allowed
to compute.
3. **Duration syntax**: `7d`, `24h`, `30m`. Parse with simple regex
`(\d+)(d|h|m)` → seconds.
4. **Negation**: Prefix `!` on predicate: `!type:episodic`.
5. **Backwards compatibility**: Current `-p spread` syntax must keep
working. The parser tries algorithm names first, then predicate
syntax. No ambiguity since algorithms are bare words and predicates
use `:`.
6. **Stage ordering**: Generators must come first (or the pipeline
starts with implicit "all"). Filters/transforms can interleave
freely with algorithms. The runner validates this at parse time.