Create jobkit-daemon crate with generic daemon infrastructure: - event_log: JSONL append with size-based rotation - socket: Unix domain socket RPC client and server with signal handling - status: JSON status file read/write Migrate daemon.rs to use the library: - Worker pool setup via Daemon::new() - Socket loop + signal handling via Daemon::run() - RPC handlers as registered closures - Logging, status writing, send_rpc all delegate to library Migrate tui.rs to use socket::send_rpc() instead of inline UnixStream. daemon.rs: 1952 → 1806 lines (-146), old status_socket_loop removed. tui.rs: socket boilerplate removed. Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
202 lines
8 KiB
Markdown
202 lines
8 KiB
Markdown
# Daemon & Jobkit Architecture Survey
|
|
_2026-03-14, autonomous survey while Kent debugs discard FIFO_
|
|
|
|
## Current state
|
|
|
|
daemon.rs is 1952 lines mixing three concerns:
|
|
- ~400 lines: pure jobkit usage (spawn, depend_on, resource)
|
|
- ~600 lines: logging/monitoring (log_event, status, RPC)
|
|
- ~950 lines: job functions embedding business logic
|
|
|
|
## What jobkit provides (good)
|
|
|
|
- Worker pool with named workers
|
|
- Dependency graph: `depend_on()` for ordering
|
|
- Resource pools: `ResourcePool` for concurrency gating (LLM slots)
|
|
- Retry logic: `retries(N)` on `TaskError::Retry`
|
|
- Task status tracking: `choir.task_statuses()` → `Vec<TaskInfo>`
|
|
- Cancellation: `ctx.is_cancelled()`
|
|
|
|
## What jobkit is missing
|
|
|
|
### 1. Structured logging (PRIORITY)
|
|
- Currently dual-channel: `ctx.log_line()` (per-task) + `log_event()` (daemon JSONL)
|
|
- No log levels, no structured context, no correlation IDs
|
|
- Log rotation is naive (truncate at 1MB, keep second half)
|
|
- Need: observability hooks that both human TUI and AI can consume
|
|
|
|
### 2. Metrics (NONE EXIST)
|
|
- No task duration histograms
|
|
- No worker utilization tracking
|
|
- No queue depth monitoring
|
|
- No success/failure rates by type
|
|
- No resource pool wait times
|
|
|
|
### 3. Health monitoring
|
|
- No watchdog timers
|
|
- No health check hooks per job
|
|
- No alerting on threshold violations
|
|
- Health computed on-demand in daemon, not in jobkit
|
|
|
|
### 4. RPC (ad-hoc in daemon, should be schematized)
|
|
- Unix socket with string matching: `match cmd.as_str()`
|
|
- No cap'n proto schema for daemon control
|
|
- No versioning, no validation, no streaming
|
|
|
|
## Architecture problems
|
|
|
|
### Tangled concerns
|
|
Job functions hardcode `log_event()` calls. Graph health is in daemon
|
|
but uses domain-specific metrics. Store loading happens inside jobs
|
|
(10 agent runs = 10 store loads). Not separable.
|
|
|
|
### Magic numbers
|
|
- Workers = `llm_concurrency + 3` (line 682)
|
|
- 10 max new jobs per tick (line 770)
|
|
- 300/1800s backoff range (lines 721-722)
|
|
- 1MB log rotation (line 39)
|
|
- 60s scheduler interval (line 24)
|
|
None configurable.
|
|
|
|
### Hardcoded pipeline DAG
|
|
Daily pipeline phases are `depend_on()` chains in Rust code (lines
|
|
1061-1109). Can't adjust without recompile. No visualization. No
|
|
conditional skipping of phases.
|
|
|
|
### Task naming is fragile
|
|
Names used as both identifiers AND for parsing in TUI. Format varies
|
|
(colons, dashes, dates). `task_group()` splits on '-' to categorize —
|
|
brittle.
|
|
|
|
### No persistent task queue
|
|
Restart loses all pending tasks. Session watcher handles this via
|
|
reconciliation (good), but scheduler uses `last_daily` date from file.
|
|
|
|
## What works well
|
|
|
|
1. **Reconciliation-based session discovery** — elegant, restart-resilient
|
|
2. **Resource pooling** — LLM concurrency decoupled from worker count
|
|
3. **Dependency-driven pipeline** — clean DAG via `depend_on()`
|
|
4. **Retry with backoff** — exponential 5min→30min, resets on success
|
|
5. **Graceful shutdown** — SIGINT/SIGTERM handled properly
|
|
|
|
## Kent's design direction
|
|
|
|
### Event stream, not log files
|
|
One pipeline, multiple consumers. TUI renders for humans, AI consumes
|
|
structured data. Same events, different renderers. Cap'n Proto streaming
|
|
subscription: `subscribe(filter) -> stream<Event>`.
|
|
|
|
"No one ever thinks further ahead than log files with monitoring and
|
|
it's infuriating." — Kent
|
|
|
|
### Extend jobkit, don't add a layer
|
|
jobkit already has the scheduling and dependency graph. Don't create a
|
|
new orchestration layer — add the missing pieces (logging, metrics,
|
|
health, RPC) to jobkit itself.
|
|
|
|
### Cap'n Proto for everything
|
|
Standard RPC definitions for:
|
|
- Status queries (what's running, pending, failed)
|
|
- Control (start, stop, restart, queue)
|
|
- Event streaming (subscribe with filter)
|
|
- Health checks
|
|
|
|
## The bigger picture: bcachefs as library
|
|
|
|
Kent's monitoring system in bcachefs (event_inc/event_inc_trace + x-macro
|
|
counters) is the real monitoring infrastructure. 1-1 correspondence between
|
|
counters (cheap, always-on dashboard via `fs top`) and tracepoints (expensive
|
|
detail, only runs when enabled). The x-macro enforces this — can't have one
|
|
without the other.
|
|
|
|
When the Rust conversion is complete, bcachefs becomes a library. At that
|
|
point, jobkit doesn't need its own monitoring — it uses the same counter/
|
|
tracepoint infrastructure. One observability system for everything.
|
|
|
|
**Implication for now:** jobkit monitoring just needs to be good enough.
|
|
JSON events, not typed. Don't over-engineer — the real infrastructure is
|
|
coming from the Rust conversion.
|
|
|
|
## Extraction: jobkit-daemon library (designed with Kent)
|
|
|
|
### Goes to jobkit-daemon (generic)
|
|
- JSONL event logging with size-based rotation
|
|
- Unix domain socket server + signal handling
|
|
- Status file writing (periodic JSON snapshot)
|
|
- `run_job()` wrapper (logging + progress + error mapping)
|
|
- Systemd service installation
|
|
- Worker pool setup from config
|
|
- Cap'n Proto RPC for control protocol
|
|
|
|
### Stays in poc-memory (application)
|
|
- All job functions (experience-mine, fact-mine, consolidation, etc.)
|
|
- Session watcher, scheduler, RPC command handlers
|
|
- GraphHealth, consolidation plan logic
|
|
|
|
### Interface design
|
|
- Cap'n Proto RPC for typed operations (submit, cancel, subscribe)
|
|
- JSON blob for status (inherently open-ended, every app has different
|
|
job types — typing this is the tracepoint mistake)
|
|
- Application registers: RPC handlers, long-running tasks, job functions
|
|
- ~50-100 lines of setup code, call `daemon.run()`
|
|
|
|
## Plan of attack
|
|
|
|
1. **Observability hooks in jobkit** — `on_task_start/progress/complete`
|
|
callbacks that consumers can subscribe to
|
|
2. **Structured event type** — typed events with task ID, name, duration,
|
|
result, metadata. Not strings.
|
|
3. **Metrics collection** — duration histograms, success rates, queue
|
|
depth. Built on the event stream.
|
|
4. **Cap'n Proto daemon RPC schema** — replace ad-hoc socket protocol
|
|
5. **TUI consumes event stream** — same data as AI consumer
|
|
6. **Extract monitoring from daemon.rs** — the 600 lines of logging/status
|
|
become generic, reusable infrastructure
|
|
7. **Declarative pipeline config** — DAG definition in config, not code
|
|
|
|
## File reference
|
|
|
|
- `src/agents/daemon.rs` — 1952 lines, all orchestration
|
|
- Job functions: 96-553
|
|
- run_daemon(): 678-1143
|
|
- Socket/RPC: 1145-1372
|
|
- Status display: 1374-1682
|
|
- `src/tui.rs` — 907 lines, polls status socket every 2s
|
|
- `schema/memory.capnp` — 125 lines, data only, no RPC definitions
|
|
- `src/config.rs` — configuration loading
|
|
- External: `jobkit` crate (git dependency)
|
|
|
|
## Mistakes I made building this (learning notes)
|
|
|
|
_Per Kent's instruction: note what went wrong and WHY._
|
|
|
|
1. **Dual logging channels** — I added `log_event()` because `ctx.log_line()`
|
|
wasn't enough, instead of fixing the underlying abstraction. Symptom:
|
|
can't find a failed job without searching two places.
|
|
|
|
2. **Magic numbers** — I hardcoded constants because "I'll make them
|
|
configurable later." Later never came. Every magic number is a design
|
|
decision that should have been explicit.
|
|
|
|
3. **1952-line file** — daemon.rs grew organically because each new feature
|
|
was "just one more function." Should have extracted when it passed 500
|
|
lines. The pain of refactoring later is always worse than the pain of
|
|
organizing early.
|
|
|
|
4. **Ad-hoc RPC** — String matching seemed fine for 2 commands. Now it's 4
|
|
commands and growing, with implicit formats. Should have used cap'n proto
|
|
from the start — the schema IS the documentation.
|
|
|
|
5. **No tests** — Zero tests in daemon code. "It's a daemon, how do you test
|
|
it?" is not an excuse. The job functions are pure-ish and testable. The
|
|
scheduler logic is testable with a clock abstraction.
|
|
|
|
6. **Not using systemd** — There's a systemd service for the daemon.
|
|
I keep starting it manually with `poc-memory agent daemon start` and
|
|
accumulating multiple instances. Tonight: 4 concurrent daemons, 32
|
|
cores pegged at 95%, load average 92. USE SYSTEMD. That's what it's for.
|
|
`systemctl --user start poc-memory-daemon`. ONE instance. Managed.
|
|
|
|
Pattern: every shortcut was "just for now" and every "just for now" became
|
|
permanent. Kent's yelling was right every time.
|