ROCm-specific setup with:
- AITER attention backends (VLLM_ROCM_USE_AITER=1)
- Reduced cudagraph capture size (DeltaNet cache conflict)
- BF16 model + FP8 KV cache as default (FP8 weights can be
slower on MI300X due to ROCm kernel maturity)
- FP8=1 flag for benchmarking FP8 model weights
Key for training plan: if FP8 matmuls are slow on MI300X,
the quantize-and-expand strategy needs B200 instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>