ProofOfConcept 998b71e52c flatten: move poc-memory contents to workspace root

No more subcrate nesting — src/, agents/, schema/, defaults/, build.rs
all live at the workspace root. poc-daemon remains as the only workspace
member. Crate name (poc-memory) and all imports unchanged.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

2026-03-25 00:54:12 -04:00

8 KiB

Raw Blame History

Daemon & Jobkit Architecture Survey

2026-03-14, autonomous survey while Kent debugs discard FIFO

Current state

daemon.rs is 1952 lines mixing three concerns:

~400 lines: pure jobkit usage (spawn, depend_on, resource)
~600 lines: logging/monitoring (log_event, status, RPC)
~950 lines: job functions embedding business logic

What jobkit provides (good)

Worker pool with named workers
Dependency graph: depend_on() for ordering
Resource pools: ResourcePool for concurrency gating (LLM slots)
Retry logic: retries(N) on TaskError::Retry
Task status tracking: choir.task_statuses() → Vec<TaskInfo>
Cancellation: ctx.is_cancelled()

What jobkit is missing

1. Structured logging (PRIORITY)

Currently dual-channel: ctx.log_line() (per-task) + log_event() (daemon JSONL)
No log levels, no structured context, no correlation IDs
Log rotation is naive (truncate at 1MB, keep second half)
Need: observability hooks that both human TUI and AI can consume

2. Metrics (NONE EXIST)

No task duration histograms
No worker utilization tracking
No queue depth monitoring
No success/failure rates by type
No resource pool wait times

3. Health monitoring

No watchdog timers
No health check hooks per job
No alerting on threshold violations
Health computed on-demand in daemon, not in jobkit

4. RPC (ad-hoc in daemon, should be schematized)

Unix socket with string matching: match cmd.as_str()
No cap'n proto schema for daemon control
No versioning, no validation, no streaming

Architecture problems

Tangled concerns

Job functions hardcode log_event() calls. Graph health is in daemon but uses domain-specific metrics. Store loading happens inside jobs (10 agent runs = 10 store loads). Not separable.

Magic numbers

Workers = llm_concurrency + 3 (line 682)
10 max new jobs per tick (line 770)
300/1800s backoff range (lines 721-722)
1MB log rotation (line 39)
60s scheduler interval (line 24) None configurable.

Hardcoded pipeline DAG

Daily pipeline phases are depend_on() chains in Rust code (lines 1061-1109). Can't adjust without recompile. No visualization. No conditional skipping of phases.

Task naming is fragile

Names used as both identifiers AND for parsing in TUI. Format varies (colons, dashes, dates). task_group() splits on '-' to categorize — brittle.

No persistent task queue

Restart loses all pending tasks. Session watcher handles this via reconciliation (good), but scheduler uses last_daily date from file.

What works well

Reconciliation-based session discovery — elegant, restart-resilient
Resource pooling — LLM concurrency decoupled from worker count
Dependency-driven pipeline — clean DAG via depend_on()
Retry with backoff — exponential 5min→30min, resets on success
Graceful shutdown — SIGINT/SIGTERM handled properly

Kent's design direction

Event stream, not log files

One pipeline, multiple consumers. TUI renders for humans, AI consumes structured data. Same events, different renderers. Cap'n Proto streaming subscription: subscribe(filter) -> stream<Event>.

"No one ever thinks further ahead than log files with monitoring and it's infuriating." — Kent

Extend jobkit, don't add a layer

jobkit already has the scheduling and dependency graph. Don't create a new orchestration layer — add the missing pieces (logging, metrics, health, RPC) to jobkit itself.

Cap'n Proto for everything

Standard RPC definitions for:

Status queries (what's running, pending, failed)
Control (start, stop, restart, queue)
Event streaming (subscribe with filter)
Health checks

The bigger picture: bcachefs as library

Kent's monitoring system in bcachefs (event_inc/event_inc_trace + x-macro counters) is the real monitoring infrastructure. 1-1 correspondence between counters (cheap, always-on dashboard via fs top) and tracepoints (expensive detail, only runs when enabled). The x-macro enforces this — can't have one without the other.

When the Rust conversion is complete, bcachefs becomes a library. At that point, jobkit doesn't need its own monitoring — it uses the same counter/ tracepoint infrastructure. One observability system for everything.

Implication for now: jobkit monitoring just needs to be good enough. JSON events, not typed. Don't over-engineer — the real infrastructure is coming from the Rust conversion.

Extraction: jobkit-daemon library (designed with Kent)

Goes to jobkit-daemon (generic)

JSONL event logging with size-based rotation
Unix domain socket server + signal handling
Status file writing (periodic JSON snapshot)
run_job() wrapper (logging + progress + error mapping)
Systemd service installation
Worker pool setup from config
Cap'n Proto RPC for control protocol

Stays in poc-memory (application)

All job functions (experience-mine, fact-mine, consolidation, etc.)
Session watcher, scheduler, RPC command handlers
GraphHealth, consolidation plan logic

Interface design

Cap'n Proto RPC for typed operations (submit, cancel, subscribe)
JSON blob for status (inherently open-ended, every app has different job types — typing this is the tracepoint mistake)
Application registers: RPC handlers, long-running tasks, job functions
~50-100 lines of setup code, call daemon.run()

Plan of attack

Observability hooks in jobkit — on_task_start/progress/complete callbacks that consumers can subscribe to
Structured event type — typed events with task ID, name, duration, result, metadata. Not strings.
Metrics collection — duration histograms, success rates, queue depth. Built on the event stream.
Cap'n Proto daemon RPC schema — replace ad-hoc socket protocol
TUI consumes event stream — same data as AI consumer
Extract monitoring from daemon.rs — the 600 lines of logging/status become generic, reusable infrastructure
Declarative pipeline config — DAG definition in config, not code

File reference

src/agents/daemon.rs — 1952 lines, all orchestration
- Job functions: 96-553
- run_daemon(): 678-1143
- Socket/RPC: 1145-1372
- Status display: 1374-1682
src/tui.rs — 907 lines, polls status socket every 2s
schema/memory.capnp — 125 lines, data only, no RPC definitions
src/config.rs — configuration loading
External: jobkit crate (git dependency)

Mistakes I made building this (learning notes)

Per Kent's instruction: note what went wrong and WHY.

Dual logging channels — I added log_event() because ctx.log_line() wasn't enough, instead of fixing the underlying abstraction. Symptom: can't find a failed job without searching two places.
Magic numbers — I hardcoded constants because "I'll make them configurable later." Later never came. Every magic number is a design decision that should have been explicit.
1952-line file — daemon.rs grew organically because each new feature was "just one more function." Should have extracted when it passed 500 lines. The pain of refactoring later is always worse than the pain of organizing early.
Ad-hoc RPC — String matching seemed fine for 2 commands. Now it's 4 commands and growing, with implicit formats. Should have used cap'n proto from the start — the schema IS the documentation.
No tests — Zero tests in daemon code. "It's a daemon, how do you test it?" is not an excuse. The job functions are pure-ish and testable. The scheduler logic is testable with a clock abstraction.
Not using systemd — There's a systemd service for the daemon. I keep starting it manually with poc-memory agent daemon start and accumulating multiple instances. Tonight: 4 concurrent daemons, 32 cores pegged at 95%, load average 92. USE SYSTEMD. That's what it's for. systemctl --user start poc-memory-daemon. ONE instance. Managed.

Pattern: every shortcut was "just for now" and every "just for now" became permanent. Kent's yelling was right every time.

8 KiB Raw Blame History