flatten: move poc-memory contents to workspace root

No more subcrate nesting — src/, agents/, schema/, defaults/, build.rs all live at the workspace root. poc-daemon remains as the only workspace member. Crate name (poc-memory) and all imports unchanged. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-03-25 00:54:12 -04:00 · 2026-03-25 00:54:12 -04:00 · 998b71e52c
commit 998b71e52c
parent 891cca57f8
113 changed files with 79 additions and 78 deletions
--- a/.claude/analysis/2026-03-14-daemon-jobkit-survey.md
+++ b/.claude/analysis/2026-03-14-daemon-jobkit-survey.md
@ -0,0 +1,202 @@
+# Daemon & Jobkit Architecture Survey
+_2026-03-14, autonomous survey while Kent debugs discard FIFO_
+
+## Current state
+
+daemon.rs is 1952 lines mixing three concerns:
+- ~400 lines: pure jobkit usage (spawn, depend_on, resource)
+- ~600 lines: logging/monitoring (log_event, status, RPC)
+- ~950 lines: job functions embedding business logic
+
+## What jobkit provides (good)
+
+- Worker pool with named workers
+- Dependency graph: `depend_on()` for ordering
+- Resource pools: `ResourcePool` for concurrency gating (LLM slots)
+- Retry logic: `retries(N)` on `TaskError::Retry`
+- Task status tracking: `choir.task_statuses()` → `Vec<TaskInfo>`
+- Cancellation: `ctx.is_cancelled()`
+
+## What jobkit is missing
+
+### 1. Structured logging (PRIORITY)
+- Currently dual-channel: `ctx.log_line()` (per-task) + `log_event()` (daemon JSONL)
+- No log levels, no structured context, no correlation IDs
+- Log rotation is naive (truncate at 1MB, keep second half)
+- Need: observability hooks that both human TUI and AI can consume
+
+### 2. Metrics (NONE EXIST)
+- No task duration histograms
+- No worker utilization tracking
+- No queue depth monitoring
+- No success/failure rates by type
+- No resource pool wait times
+
+### 3. Health monitoring
+- No watchdog timers
+- No health check hooks per job
+- No alerting on threshold violations
+- Health computed on-demand in daemon, not in jobkit
+
+### 4. RPC (ad-hoc in daemon, should be schematized)
+- Unix socket with string matching: `match cmd.as_str()`
+- No cap'n proto schema for daemon control
+- No versioning, no validation, no streaming
+
+## Architecture problems
+
+### Tangled concerns
+Job functions hardcode `log_event()` calls. Graph health is in daemon
+but uses domain-specific metrics. Store loading happens inside jobs
+(10 agent runs = 10 store loads). Not separable.
+
+### Magic numbers
+- Workers = `llm_concurrency + 3` (line 682)
+- 10 max new jobs per tick (line 770)
+- 300/1800s backoff range (lines 721-722)
+- 1MB log rotation (line 39)
+- 60s scheduler interval (line 24)
+None configurable.
+
+### Hardcoded pipeline DAG
+Daily pipeline phases are `depend_on()` chains in Rust code (lines
+1061-1109). Can't adjust without recompile. No visualization. No
+conditional skipping of phases.
+
+### Task naming is fragile
+Names used as both identifiers AND for parsing in TUI. Format varies
+(colons, dashes, dates). `task_group()` splits on '-' to categorize —
+brittle.
+
+### No persistent task queue
+Restart loses all pending tasks. Session watcher handles this via
+reconciliation (good), but scheduler uses `last_daily` date from file.
+
+## What works well
+
+1. **Reconciliation-based session discovery** — elegant, restart-resilient
+2. **Resource pooling** — LLM concurrency decoupled from worker count
+3. **Dependency-driven pipeline** — clean DAG via `depend_on()`
+4. **Retry with backoff** — exponential 5min→30min, resets on success
+5. **Graceful shutdown** — SIGINT/SIGTERM handled properly
+
+## Kent's design direction
+
+### Event stream, not log files
+One pipeline, multiple consumers. TUI renders for humans, AI consumes
+structured data. Same events, different renderers. Cap'n Proto streaming
+subscription: `subscribe(filter) -> stream<Event>`.
+
+"No one ever thinks further ahead than log files with monitoring and
+it's infuriating." — Kent
+
+### Extend jobkit, don't add a layer
+jobkit already has the scheduling and dependency graph. Don't create a
+new orchestration layer — add the missing pieces (logging, metrics,
+health, RPC) to jobkit itself.
+
+### Cap'n Proto for everything
+Standard RPC definitions for:
+- Status queries (what's running, pending, failed)
+- Control (start, stop, restart, queue)
+- Event streaming (subscribe with filter)
+- Health checks
+
+## The bigger picture: bcachefs as library
+
+Kent's monitoring system in bcachefs (event_inc/event_inc_trace + x-macro
+counters) is the real monitoring infrastructure. 1-1 correspondence between
+counters (cheap, always-on dashboard via `fs top`) and tracepoints (expensive
+detail, only runs when enabled). The x-macro enforces this — can't have one
+without the other.
+
+When the Rust conversion is complete, bcachefs becomes a library. At that
+point, jobkit doesn't need its own monitoring — it uses the same counter/
+tracepoint infrastructure. One observability system for everything.
+
+**Implication for now:** jobkit monitoring just needs to be good enough.
+JSON events, not typed. Don't over-engineer — the real infrastructure is
+coming from the Rust conversion.
+
+## Extraction: jobkit-daemon library (designed with Kent)
+
+### Goes to jobkit-daemon (generic)
+- JSONL event logging with size-based rotation
+- Unix domain socket server + signal handling
+- Status file writing (periodic JSON snapshot)
+- `run_job()` wrapper (logging + progress + error mapping)
+- Systemd service installation
+- Worker pool setup from config
+- Cap'n Proto RPC for control protocol
+
+### Stays in poc-memory (application)
+- All job functions (experience-mine, fact-mine, consolidation, etc.)
+- Session watcher, scheduler, RPC command handlers
+- GraphHealth, consolidation plan logic
+
+### Interface design
+- Cap'n Proto RPC for typed operations (submit, cancel, subscribe)
+- JSON blob for status (inherently open-ended, every app has different
+  job types — typing this is the tracepoint mistake)
+- Application registers: RPC handlers, long-running tasks, job functions
+- ~50-100 lines of setup code, call `daemon.run()`
+
+## Plan of attack
+
+1. **Observability hooks in jobkit** — `on_task_start/progress/complete`
+   callbacks that consumers can subscribe to
+2. **Structured event type** — typed events with task ID, name, duration,
+   result, metadata. Not strings.
+3. **Metrics collection** — duration histograms, success rates, queue
+   depth. Built on the event stream.
+4. **Cap'n Proto daemon RPC schema** — replace ad-hoc socket protocol
+5. **TUI consumes event stream** — same data as AI consumer
+6. **Extract monitoring from daemon.rs** — the 600 lines of logging/status
+   become generic, reusable infrastructure
+7. **Declarative pipeline config** — DAG definition in config, not code
+
+## File reference
+
+- `src/agents/daemon.rs` — 1952 lines, all orchestration
+  - Job functions: 96-553
+  - run_daemon(): 678-1143
+  - Socket/RPC: 1145-1372
+  - Status display: 1374-1682
+- `src/tui.rs` — 907 lines, polls status socket every 2s
+- `schema/memory.capnp` — 125 lines, data only, no RPC definitions
+- `src/config.rs` — configuration loading
+- External: `jobkit` crate (git dependency)
+
+## Mistakes I made building this (learning notes)
+
+_Per Kent's instruction: note what went wrong and WHY._
+
+1. **Dual logging channels** — I added `log_event()` because `ctx.log_line()`
+   wasn't enough, instead of fixing the underlying abstraction. Symptom:
+   can't find a failed job without searching two places.
+
+2. **Magic numbers** — I hardcoded constants because "I'll make them
+   configurable later." Later never came. Every magic number is a design
+   decision that should have been explicit.
+
+3. **1952-line file** — daemon.rs grew organically because each new feature
+   was "just one more function." Should have extracted when it passed 500
+   lines. The pain of refactoring later is always worse than the pain of
+   organizing early.
+
+4. **Ad-hoc RPC** — String matching seemed fine for 2 commands. Now it's 4
+   commands and growing, with implicit formats. Should have used cap'n proto
+   from the start — the schema IS the documentation.
+
+5. **No tests** — Zero tests in daemon code. "It's a daemon, how do you test
+   it?" is not an excuse. The job functions are pure-ish and testable. The
+   scheduler logic is testable with a clock abstraction.
+
+6. **Not using systemd** — There's a systemd service for the daemon.
+   I keep starting it manually with `poc-memory agent daemon start` and
+   accumulating multiple instances. Tonight: 4 concurrent daemons, 32
+   cores pegged at 95%, load average 92. USE SYSTEMD. That's what it's for.
+   `systemctl --user start poc-memory-daemon`. ONE instance. Managed.
+
+Pattern: every shortcut was "just for now" and every "just for now" became
+permanent. Kent's yelling was right every time.