The surface_observe_cycle check was gated on \`self.agent_running("surface-observe")\`,
which returns true as long as any surface-observe agent has a pid in
\`self.agents\` — regardless of which phase that agent is in. So even
after the bail script learned to allow "one in surface + one in
post-surface" concurrency, the orchestrator still refused to spawn a
second agent for the whole cycle duration.
Replace the in-memory gate with an on-disk phase check that matches
the bail script's view of the world: scan the state dir for live
\`pid-*\` files (same ones the bail script reads and keeps fresh),
and only skip spawning if some live agent is currently in the
\`surface\` phase. If the existing agent has moved on to
\`organize-search\` / \`organize-new\` / \`observe\`, the new surface
agent can start while the old one finishes its tail.
The older agent's child handle gets dropped when \`self.agents[surface-observe]\`
is overwritten on spawn — OS reaps when it exits. That's fine: the
hook is short-lived, and cross-invocation liveness is checked via
\`libc::kill(pid, 0)\` on restored pid state, not via held child handles.
Also drop the "agent is N KB behind, sleep 5s" block. That loop only
made sense when spawns were serialized — its purpose was to give a
running agent a chance to complete before the current hook returned.
With parallelism, there's nothing to wait for: either a new surface
agent just started (no point waiting on it), or the existing agent
is in the surface phase and will finish when it finishes.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>