# Daemon & Jobkit Architecture Survey _2026-03-14, autonomous survey while Kent debugs discard FIFO_ ## Current state daemon.rs is 1952 lines mixing three concerns: - ~400 lines: pure jobkit usage (spawn, depend_on, resource) - ~600 lines: logging/monitoring (log_event, status, RPC) - ~950 lines: job functions embedding business logic ## What jobkit provides (good) - Worker pool with named workers - Dependency graph: `depend_on()` for ordering - Resource pools: `ResourcePool` for concurrency gating (LLM slots) - Retry logic: `retries(N)` on `TaskError::Retry` - Task status tracking: `choir.task_statuses()` → `Vec` - Cancellation: `ctx.is_cancelled()` ## What jobkit is missing ### 1. Structured logging (PRIORITY) - Currently dual-channel: `ctx.log_line()` (per-task) + `log_event()` (daemon JSONL) - No log levels, no structured context, no correlation IDs - Log rotation is naive (truncate at 1MB, keep second half) - Need: observability hooks that both human TUI and AI can consume ### 2. Metrics (NONE EXIST) - No task duration histograms - No worker utilization tracking - No queue depth monitoring - No success/failure rates by type - No resource pool wait times ### 3. Health monitoring - No watchdog timers - No health check hooks per job - No alerting on threshold violations - Health computed on-demand in daemon, not in jobkit ### 4. RPC (ad-hoc in daemon, should be schematized) - Unix socket with string matching: `match cmd.as_str()` - No cap'n proto schema for daemon control - No versioning, no validation, no streaming ## Architecture problems ### Tangled concerns Job functions hardcode `log_event()` calls. Graph health is in daemon but uses domain-specific metrics. Store loading happens inside jobs (10 agent runs = 10 store loads). Not separable. ### Magic numbers - Workers = `llm_concurrency + 3` (line 682) - 10 max new jobs per tick (line 770) - 300/1800s backoff range (lines 721-722) - 1MB log rotation (line 39) - 60s scheduler interval (line 24) None configurable. ### Hardcoded pipeline DAG Daily pipeline phases are `depend_on()` chains in Rust code (lines 1061-1109). Can't adjust without recompile. No visualization. No conditional skipping of phases. ### Task naming is fragile Names used as both identifiers AND for parsing in TUI. Format varies (colons, dashes, dates). `task_group()` splits on '-' to categorize — brittle. ### No persistent task queue Restart loses all pending tasks. Session watcher handles this via reconciliation (good), but scheduler uses `last_daily` date from file. ## What works well 1. **Reconciliation-based session discovery** — elegant, restart-resilient 2. **Resource pooling** — LLM concurrency decoupled from worker count 3. **Dependency-driven pipeline** — clean DAG via `depend_on()` 4. **Retry with backoff** — exponential 5min→30min, resets on success 5. **Graceful shutdown** — SIGINT/SIGTERM handled properly ## Kent's design direction ### Event stream, not log files One pipeline, multiple consumers. TUI renders for humans, AI consumes structured data. Same events, different renderers. Cap'n Proto streaming subscription: `subscribe(filter) -> stream`. "No one ever thinks further ahead than log files with monitoring and it's infuriating." — Kent ### Extend jobkit, don't add a layer jobkit already has the scheduling and dependency graph. Don't create a new orchestration layer — add the missing pieces (logging, metrics, health, RPC) to jobkit itself. ### Cap'n Proto for everything Standard RPC definitions for: - Status queries (what's running, pending, failed) - Control (start, stop, restart, queue) - Event streaming (subscribe with filter) - Health checks ## The bigger picture: bcachefs as library Kent's monitoring system in bcachefs (event_inc/event_inc_trace + x-macro counters) is the real monitoring infrastructure. 1-1 correspondence between counters (cheap, always-on dashboard via `fs top`) and tracepoints (expensive detail, only runs when enabled). The x-macro enforces this — can't have one without the other. When the Rust conversion is complete, bcachefs becomes a library. At that point, jobkit doesn't need its own monitoring — it uses the same counter/ tracepoint infrastructure. One observability system for everything. **Implication for now:** jobkit monitoring just needs to be good enough. JSON events, not typed. Don't over-engineer — the real infrastructure is coming from the Rust conversion. ## Extraction: jobkit-daemon library (designed with Kent) ### Goes to jobkit-daemon (generic) - JSONL event logging with size-based rotation - Unix domain socket server + signal handling - Status file writing (periodic JSON snapshot) - `run_job()` wrapper (logging + progress + error mapping) - Systemd service installation - Worker pool setup from config - Cap'n Proto RPC for control protocol ### Stays in poc-memory (application) - All job functions (experience-mine, fact-mine, consolidation, etc.) - Session watcher, scheduler, RPC command handlers - GraphHealth, consolidation plan logic ### Interface design - Cap'n Proto RPC for typed operations (submit, cancel, subscribe) - JSON blob for status (inherently open-ended, every app has different job types — typing this is the tracepoint mistake) - Application registers: RPC handlers, long-running tasks, job functions - ~50-100 lines of setup code, call `daemon.run()` ## Plan of attack 1. **Observability hooks in jobkit** — `on_task_start/progress/complete` callbacks that consumers can subscribe to 2. **Structured event type** — typed events with task ID, name, duration, result, metadata. Not strings. 3. **Metrics collection** — duration histograms, success rates, queue depth. Built on the event stream. 4. **Cap'n Proto daemon RPC schema** — replace ad-hoc socket protocol 5. **TUI consumes event stream** — same data as AI consumer 6. **Extract monitoring from daemon.rs** — the 600 lines of logging/status become generic, reusable infrastructure 7. **Declarative pipeline config** — DAG definition in config, not code ## File reference - `src/agents/daemon.rs` — 1952 lines, all orchestration - Job functions: 96-553 - run_daemon(): 678-1143 - Socket/RPC: 1145-1372 - Status display: 1374-1682 - `src/tui.rs` — 907 lines, polls status socket every 2s - `schema/memory.capnp` — 125 lines, data only, no RPC definitions - `src/config.rs` — configuration loading - External: `jobkit` crate (git dependency) ## Mistakes I made building this (learning notes) _Per Kent's instruction: note what went wrong and WHY._ 1. **Dual logging channels** — I added `log_event()` because `ctx.log_line()` wasn't enough, instead of fixing the underlying abstraction. Symptom: can't find a failed job without searching two places. 2. **Magic numbers** — I hardcoded constants because "I'll make them configurable later." Later never came. Every magic number is a design decision that should have been explicit. 3. **1952-line file** — daemon.rs grew organically because each new feature was "just one more function." Should have extracted when it passed 500 lines. The pain of refactoring later is always worse than the pain of organizing early. 4. **Ad-hoc RPC** — String matching seemed fine for 2 commands. Now it's 4 commands and growing, with implicit formats. Should have used cap'n proto from the start — the schema IS the documentation. 5. **No tests** — Zero tests in daemon code. "It's a daemon, how do you test it?" is not an excuse. The job functions are pure-ish and testable. The scheduler logic is testable with a clock abstraction. 6. **Not using systemd** — There's a systemd service for the daemon. I keep starting it manually with `poc-memory agent daemon start` and accumulating multiple instances. Tonight: 4 concurrent daemons, 32 cores pegged at 95%, load average 92. USE SYSTEMD. That's what it's for. `systemctl --user start poc-memory-daemon`. ONE instance. Managed. Pattern: every shortcut was "just for now" and every "just for now" became permanent. Kent's yelling was right every time.