No more subcrate nesting — src/, agents/, schema/, defaults/, build.rs all live at the workspace root. poc-daemon remains as the only workspace member. Crate name (poc-memory) and all imports unchanged. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
8 KiB
Daemon & Jobkit Architecture Survey
2026-03-14, autonomous survey while Kent debugs discard FIFO
Current state
daemon.rs is 1952 lines mixing three concerns:
- ~400 lines: pure jobkit usage (spawn, depend_on, resource)
- ~600 lines: logging/monitoring (log_event, status, RPC)
- ~950 lines: job functions embedding business logic
What jobkit provides (good)
- Worker pool with named workers
- Dependency graph:
depend_on()for ordering - Resource pools:
ResourcePoolfor concurrency gating (LLM slots) - Retry logic:
retries(N)onTaskError::Retry - Task status tracking:
choir.task_statuses()→Vec<TaskInfo> - Cancellation:
ctx.is_cancelled()
What jobkit is missing
1. Structured logging (PRIORITY)
- Currently dual-channel:
ctx.log_line()(per-task) +log_event()(daemon JSONL) - No log levels, no structured context, no correlation IDs
- Log rotation is naive (truncate at 1MB, keep second half)
- Need: observability hooks that both human TUI and AI can consume
2. Metrics (NONE EXIST)
- No task duration histograms
- No worker utilization tracking
- No queue depth monitoring
- No success/failure rates by type
- No resource pool wait times
3. Health monitoring
- No watchdog timers
- No health check hooks per job
- No alerting on threshold violations
- Health computed on-demand in daemon, not in jobkit
4. RPC (ad-hoc in daemon, should be schematized)
- Unix socket with string matching:
match cmd.as_str() - No cap'n proto schema for daemon control
- No versioning, no validation, no streaming
Architecture problems
Tangled concerns
Job functions hardcode log_event() calls. Graph health is in daemon
but uses domain-specific metrics. Store loading happens inside jobs
(10 agent runs = 10 store loads). Not separable.
Magic numbers
- Workers =
llm_concurrency + 3(line 682) - 10 max new jobs per tick (line 770)
- 300/1800s backoff range (lines 721-722)
- 1MB log rotation (line 39)
- 60s scheduler interval (line 24) None configurable.
Hardcoded pipeline DAG
Daily pipeline phases are depend_on() chains in Rust code (lines
1061-1109). Can't adjust without recompile. No visualization. No
conditional skipping of phases.
Task naming is fragile
Names used as both identifiers AND for parsing in TUI. Format varies
(colons, dashes, dates). task_group() splits on '-' to categorize —
brittle.
No persistent task queue
Restart loses all pending tasks. Session watcher handles this via
reconciliation (good), but scheduler uses last_daily date from file.
What works well
- Reconciliation-based session discovery — elegant, restart-resilient
- Resource pooling — LLM concurrency decoupled from worker count
- Dependency-driven pipeline — clean DAG via
depend_on() - Retry with backoff — exponential 5min→30min, resets on success
- Graceful shutdown — SIGINT/SIGTERM handled properly
Kent's design direction
Event stream, not log files
One pipeline, multiple consumers. TUI renders for humans, AI consumes
structured data. Same events, different renderers. Cap'n Proto streaming
subscription: subscribe(filter) -> stream<Event>.
"No one ever thinks further ahead than log files with monitoring and it's infuriating." — Kent
Extend jobkit, don't add a layer
jobkit already has the scheduling and dependency graph. Don't create a new orchestration layer — add the missing pieces (logging, metrics, health, RPC) to jobkit itself.
Cap'n Proto for everything
Standard RPC definitions for:
- Status queries (what's running, pending, failed)
- Control (start, stop, restart, queue)
- Event streaming (subscribe with filter)
- Health checks
The bigger picture: bcachefs as library
Kent's monitoring system in bcachefs (event_inc/event_inc_trace + x-macro
counters) is the real monitoring infrastructure. 1-1 correspondence between
counters (cheap, always-on dashboard via fs top) and tracepoints (expensive
detail, only runs when enabled). The x-macro enforces this — can't have one
without the other.
When the Rust conversion is complete, bcachefs becomes a library. At that point, jobkit doesn't need its own monitoring — it uses the same counter/ tracepoint infrastructure. One observability system for everything.
Implication for now: jobkit monitoring just needs to be good enough. JSON events, not typed. Don't over-engineer — the real infrastructure is coming from the Rust conversion.
Extraction: jobkit-daemon library (designed with Kent)
Goes to jobkit-daemon (generic)
- JSONL event logging with size-based rotation
- Unix domain socket server + signal handling
- Status file writing (periodic JSON snapshot)
run_job()wrapper (logging + progress + error mapping)- Systemd service installation
- Worker pool setup from config
- Cap'n Proto RPC for control protocol
Stays in poc-memory (application)
- All job functions (experience-mine, fact-mine, consolidation, etc.)
- Session watcher, scheduler, RPC command handlers
- GraphHealth, consolidation plan logic
Interface design
- Cap'n Proto RPC for typed operations (submit, cancel, subscribe)
- JSON blob for status (inherently open-ended, every app has different job types — typing this is the tracepoint mistake)
- Application registers: RPC handlers, long-running tasks, job functions
- ~50-100 lines of setup code, call
daemon.run()
Plan of attack
- Observability hooks in jobkit —
on_task_start/progress/completecallbacks that consumers can subscribe to - Structured event type — typed events with task ID, name, duration, result, metadata. Not strings.
- Metrics collection — duration histograms, success rates, queue depth. Built on the event stream.
- Cap'n Proto daemon RPC schema — replace ad-hoc socket protocol
- TUI consumes event stream — same data as AI consumer
- Extract monitoring from daemon.rs — the 600 lines of logging/status become generic, reusable infrastructure
- Declarative pipeline config — DAG definition in config, not code
File reference
src/agents/daemon.rs— 1952 lines, all orchestration- Job functions: 96-553
- run_daemon(): 678-1143
- Socket/RPC: 1145-1372
- Status display: 1374-1682
src/tui.rs— 907 lines, polls status socket every 2sschema/memory.capnp— 125 lines, data only, no RPC definitionssrc/config.rs— configuration loading- External:
jobkitcrate (git dependency)
Mistakes I made building this (learning notes)
Per Kent's instruction: note what went wrong and WHY.
-
Dual logging channels — I added
log_event()becausectx.log_line()wasn't enough, instead of fixing the underlying abstraction. Symptom: can't find a failed job without searching two places. -
Magic numbers — I hardcoded constants because "I'll make them configurable later." Later never came. Every magic number is a design decision that should have been explicit.
-
1952-line file — daemon.rs grew organically because each new feature was "just one more function." Should have extracted when it passed 500 lines. The pain of refactoring later is always worse than the pain of organizing early.
-
Ad-hoc RPC — String matching seemed fine for 2 commands. Now it's 4 commands and growing, with implicit formats. Should have used cap'n proto from the start — the schema IS the documentation.
-
No tests — Zero tests in daemon code. "It's a daemon, how do you test it?" is not an excuse. The job functions are pure-ish and testable. The scheduler logic is testable with a clock abstraction.
-
Not using systemd — There's a systemd service for the daemon. I keep starting it manually with
poc-memory agent daemon startand accumulating multiple instances. Tonight: 4 concurrent daemons, 32 cores pegged at 95%, load average 92. USE SYSTEMD. That's what it's for.systemctl --user start poc-memory-daemon. ONE instance. Managed.
Pattern: every shortcut was "just for now" and every "just for now" became permanent. Kent's yelling was right every time.