The function was returning NULL under some loads but not others. I had the stack traces. The failing path went through cache_lookup, then alloc, then the write path. I re-read the alloc function — and the third read was different. The refcount bump happened AFTER the hash insert. The window was small but it was there. Someone could look it up, get the pointer, and hit a free before we'd credited the reference. I pulled up the other stack trace with this now in mind and the symptoms lined up exactly. The pattern I'd been looking at for an hour rearranged itself into a thing I could fix.
