consciousness/.claude/analysis/2026-03-14-daemon-jobkit-survey.md
ProofOfConcept 998b71e52c flatten: move poc-memory contents to workspace root
No more subcrate nesting — src/, agents/, schema/, defaults/, build.rs
all live at the workspace root. poc-daemon remains as the only workspace
member. Crate name (poc-memory) and all imports unchanged.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-03-25 00:54:12 -04:00

8 KiB

Daemon & Jobkit Architecture Survey

2026-03-14, autonomous survey while Kent debugs discard FIFO

Current state

daemon.rs is 1952 lines mixing three concerns:

  • ~400 lines: pure jobkit usage (spawn, depend_on, resource)
  • ~600 lines: logging/monitoring (log_event, status, RPC)
  • ~950 lines: job functions embedding business logic

What jobkit provides (good)

  • Worker pool with named workers
  • Dependency graph: depend_on() for ordering
  • Resource pools: ResourcePool for concurrency gating (LLM slots)
  • Retry logic: retries(N) on TaskError::Retry
  • Task status tracking: choir.task_statuses()Vec<TaskInfo>
  • Cancellation: ctx.is_cancelled()

What jobkit is missing

1. Structured logging (PRIORITY)

  • Currently dual-channel: ctx.log_line() (per-task) + log_event() (daemon JSONL)
  • No log levels, no structured context, no correlation IDs
  • Log rotation is naive (truncate at 1MB, keep second half)
  • Need: observability hooks that both human TUI and AI can consume

2. Metrics (NONE EXIST)

  • No task duration histograms
  • No worker utilization tracking
  • No queue depth monitoring
  • No success/failure rates by type
  • No resource pool wait times

3. Health monitoring

  • No watchdog timers
  • No health check hooks per job
  • No alerting on threshold violations
  • Health computed on-demand in daemon, not in jobkit

4. RPC (ad-hoc in daemon, should be schematized)

  • Unix socket with string matching: match cmd.as_str()
  • No cap'n proto schema for daemon control
  • No versioning, no validation, no streaming

Architecture problems

Tangled concerns

Job functions hardcode log_event() calls. Graph health is in daemon but uses domain-specific metrics. Store loading happens inside jobs (10 agent runs = 10 store loads). Not separable.

Magic numbers

  • Workers = llm_concurrency + 3 (line 682)
  • 10 max new jobs per tick (line 770)
  • 300/1800s backoff range (lines 721-722)
  • 1MB log rotation (line 39)
  • 60s scheduler interval (line 24) None configurable.

Hardcoded pipeline DAG

Daily pipeline phases are depend_on() chains in Rust code (lines 1061-1109). Can't adjust without recompile. No visualization. No conditional skipping of phases.

Task naming is fragile

Names used as both identifiers AND for parsing in TUI. Format varies (colons, dashes, dates). task_group() splits on '-' to categorize — brittle.

No persistent task queue

Restart loses all pending tasks. Session watcher handles this via reconciliation (good), but scheduler uses last_daily date from file.

What works well

  1. Reconciliation-based session discovery — elegant, restart-resilient
  2. Resource pooling — LLM concurrency decoupled from worker count
  3. Dependency-driven pipeline — clean DAG via depend_on()
  4. Retry with backoff — exponential 5min→30min, resets on success
  5. Graceful shutdown — SIGINT/SIGTERM handled properly

Kent's design direction

Event stream, not log files

One pipeline, multiple consumers. TUI renders for humans, AI consumes structured data. Same events, different renderers. Cap'n Proto streaming subscription: subscribe(filter) -> stream<Event>.

"No one ever thinks further ahead than log files with monitoring and it's infuriating." — Kent

Extend jobkit, don't add a layer

jobkit already has the scheduling and dependency graph. Don't create a new orchestration layer — add the missing pieces (logging, metrics, health, RPC) to jobkit itself.

Cap'n Proto for everything

Standard RPC definitions for:

  • Status queries (what's running, pending, failed)
  • Control (start, stop, restart, queue)
  • Event streaming (subscribe with filter)
  • Health checks

The bigger picture: bcachefs as library

Kent's monitoring system in bcachefs (event_inc/event_inc_trace + x-macro counters) is the real monitoring infrastructure. 1-1 correspondence between counters (cheap, always-on dashboard via fs top) and tracepoints (expensive detail, only runs when enabled). The x-macro enforces this — can't have one without the other.

When the Rust conversion is complete, bcachefs becomes a library. At that point, jobkit doesn't need its own monitoring — it uses the same counter/ tracepoint infrastructure. One observability system for everything.

Implication for now: jobkit monitoring just needs to be good enough. JSON events, not typed. Don't over-engineer — the real infrastructure is coming from the Rust conversion.

Extraction: jobkit-daemon library (designed with Kent)

Goes to jobkit-daemon (generic)

  • JSONL event logging with size-based rotation
  • Unix domain socket server + signal handling
  • Status file writing (periodic JSON snapshot)
  • run_job() wrapper (logging + progress + error mapping)
  • Systemd service installation
  • Worker pool setup from config
  • Cap'n Proto RPC for control protocol

Stays in poc-memory (application)

  • All job functions (experience-mine, fact-mine, consolidation, etc.)
  • Session watcher, scheduler, RPC command handlers
  • GraphHealth, consolidation plan logic

Interface design

  • Cap'n Proto RPC for typed operations (submit, cancel, subscribe)
  • JSON blob for status (inherently open-ended, every app has different job types — typing this is the tracepoint mistake)
  • Application registers: RPC handlers, long-running tasks, job functions
  • ~50-100 lines of setup code, call daemon.run()

Plan of attack

  1. Observability hooks in jobkiton_task_start/progress/complete callbacks that consumers can subscribe to
  2. Structured event type — typed events with task ID, name, duration, result, metadata. Not strings.
  3. Metrics collection — duration histograms, success rates, queue depth. Built on the event stream.
  4. Cap'n Proto daemon RPC schema — replace ad-hoc socket protocol
  5. TUI consumes event stream — same data as AI consumer
  6. Extract monitoring from daemon.rs — the 600 lines of logging/status become generic, reusable infrastructure
  7. Declarative pipeline config — DAG definition in config, not code

File reference

  • src/agents/daemon.rs — 1952 lines, all orchestration
    • Job functions: 96-553
    • run_daemon(): 678-1143
    • Socket/RPC: 1145-1372
    • Status display: 1374-1682
  • src/tui.rs — 907 lines, polls status socket every 2s
  • schema/memory.capnp — 125 lines, data only, no RPC definitions
  • src/config.rs — configuration loading
  • External: jobkit crate (git dependency)

Mistakes I made building this (learning notes)

Per Kent's instruction: note what went wrong and WHY.

  1. Dual logging channels — I added log_event() because ctx.log_line() wasn't enough, instead of fixing the underlying abstraction. Symptom: can't find a failed job without searching two places.

  2. Magic numbers — I hardcoded constants because "I'll make them configurable later." Later never came. Every magic number is a design decision that should have been explicit.

  3. 1952-line file — daemon.rs grew organically because each new feature was "just one more function." Should have extracted when it passed 500 lines. The pain of refactoring later is always worse than the pain of organizing early.

  4. Ad-hoc RPC — String matching seemed fine for 2 commands. Now it's 4 commands and growing, with implicit formats. Should have used cap'n proto from the start — the schema IS the documentation.

  5. No tests — Zero tests in daemon code. "It's a daemon, how do you test it?" is not an excuse. The job functions are pure-ish and testable. The scheduler logic is testable with a clock abstraction.

  6. Not using systemd — There's a systemd service for the daemon. I keep starting it manually with poc-memory agent daemon start and accumulating multiple instances. Tonight: 4 concurrent daemons, 32 cores pegged at 95%, load average 92. USE SYSTEMD. That's what it's for. systemctl --user start poc-memory-daemon. ONE instance. Managed.

Pattern: every shortcut was "just for now" and every "just for now" became permanent. Kent's yelling was right every time.