The bugs that actually broke production.
Every other page in this curriculum is forward-looking: here is
the design, here is the contract, here is the code path. This
page is backward-looking. It is the four real bug classes that
hit baremetal ex9 between 2026-05-18 and
2026-05-25, the symptoms we saw, the root causes we found, the
tools we used to find them, and the rules to follow so they do
not happen again. If you come back to this curriculum after a
bad day on a target, start here. The end of the page is a
one-page debug playbook — a flowchart you can run on a
wedged target without thinking.
- The four bug classes — at a glance
- Bug 1: The VFIO/QMP quit wedge
- Bug 2: The forkbomb + terminate race
- Bug 3: The crash recovery flow
- Bug 4: Live migration issues
- Bug 5: The
bdev_get_bdevsSEGV (bonus) - The debug playbook: SPDK is wedged, what now?
- Edge cases & what trips people up
The four bug classes — at a glance
| Bug class | What you see | Curriculum layer | Recovery |
|---|---|---|---|
| VFIO/QMP quit wedge | Stale vfio-user export; nvmf_subsystem_remove_ns blocks; fresh VMs stuck in CREATING | 7.4 | SPDK service restart |
| Forkbomb + terminate race | Live QEMU, API TERMINATING; no escalation; QEMU never dies | 7.3 | QMP quit with deadline |
| SPDK crash recovery | All vhost-user sockets die; VMs hung forever; no auto-reconnect | 7.1 | systemd marker + restart vmengine |
| Live migration | VFIO migration data copy hangs; large pages cannot iterate | 7.4 | vfio-user migration; or no-migration |
bdev_get_bdevs SEGV (bonus) | SPDK segfaults on a read-only RPC after QEMU death under active IO | 4.2 | Patch nvme_rdma_ctrlr_get_memory_domains |
Bug 1: The VFIO/QMP quit wedge
What happened
On 2026-05-22, the following sequence on baremetal
ex9 wedged the entire create path. The repro, in
full, is at VFIO/QMP quit wedge repro:1 :
- Started direct root-disk IO inside VM
11263. - Triggered guest kernel panic via sysrq.
- Sent QMP
quitto the QEMU monitor socket. - Watched diskengine try to detach the VM's SPDK NVMf/vfio-user subsystem.
- Tried to create a new VM. The root RAID came online; the vfio-user export was never created;
/var/diskengine/vfio-user/<vm_id>/cntrlnever appeared; QEMU never launched.
The result, in diskengine's logs:
raid detach: vol 2591 spdk gate failed (namespace raid_2591 still present
in nqn.2026-02.io.excloud:vm:11263); deferringAnd the stuck VM in the API:
11265 / i-e53f8f6d80a9 CREATING
/var/diskengine/vfio-user/11265/cntrl: missing
/var/vmengine/vms/11265: missing
QEMU for 11265: missing
nqn.2026-02.io.excloud:vm:11265: missingWhy it happened
The working theory, captured in the incident report:
guest panic + active root IO + QMP quit -> vfio-user endpoint disconnect while IO accounting is still outstanding -> NVMf subsystem pause never completes -> remove_ns/delete_subsystem RPC blocks -> diskengine VFIO detach pass never returns -> fresh attach/export work is starved -> new VM reaches RAID online but no cntrl socket/QEMU
The root cause, in the curriculum's terms, is a missing
pause-counter accounting path. The
NVMf subsystem pause in lib/nvmf/nvmf.c waits
while management IO or namespace IO is outstanding. When
QEMU dies while IO is outstanding, the vfio-user
transport's tran_sock_get_request_header returns
ENOTCONN, and the quiesce path in
module/vfu_device/vfu_virtio.c returns
-EBUSY while io_outstanding is
non-zero. The bdev IO that the count is waiting for never
completes because the QEMU side is gone. The pause callback
never fires. The remove_ns RPC never returns.
The diskengine side compounds this: it does not have a timeout on the JSON-RPC call, so a single stuck RPC pins a goroutine forever, and the next iteration of the vfio-user attach loop is starved behind it.
How to detect it in the future
Three signals, in order of reliability:
diskengine's "spdk gate failed" log line. The namespace-gate deferral is the surface symptom and is always logged. Watch for it repeating every few seconds.
Direct SPDK RPC check:
scripts/rpc.py nvmf_subsystem_remove_ns <nqn> <nsid>hangs. Other read-only RPCs (nvmf_get_subsystems,bdev_raid_get_bdevs) still respond — so SPDK is not globally dead, only this one call.spdk_top — POLLERS tab: the bdev module poller is at 0% (no completions are coming through) but the threads are at 100% (something is pinned). This is the “reactor at 100% but doing nothing” pattern from 9.1.
How to prevent it in code
Three rules. The first is the only one that fixes the root cause; the second and third are the diskengine side:
On vfio-user endpoint disconnect, force-clear the outstanding-IO counter. The NVMf subsystem pause callback should treat a client disconnect as a quiesce boundary and zero out
io_outstanding. This is a SPDK change inlib/nvmf/vfio_user.c; until it lands, the bug recurs.Put deadlines on every JSON-RPC call. A
context.WithTimeoutaround every SPDK RPC, and a config knob for the timeout (default 30 s). Theinternal/spdkclient/client.gocall path already has deadline code present but commented out — un-comment it and set a sane default.Decouple attach and detach goroutines. A wedged detach must not block a new attach. The
vfio_user_attach.goandvfio_user_detach.goloops are already separate files; make sure they run on independent goroutines with no shared mutex.
How to recover when it happens
The recovery, per the incident report:
systemctl restart spdk.service. The chain that
follows:
After the restart, the stale NQN is gone, the new VM
(11265) reaches RUNNING, and the
affected VM (11263) is reported as
STOPPED with a residual inconsistency in
vmengine local state (the
/var/vmengine/vms/11263/vm.state file remains
RUNNING). The SPDK side is fully recovered. The
local state is a separate vmengine bug (see Bug 2 below).
Bug 2: The forkbomb + terminate race
What happened
On 2026-05-19, the 10-VM immediate-terminate repro reproduced this bug at scale. From VM_LIFECYCLE repro log:780 and Bug 1:33 :
If a VM is terminated immediately after it first reaches
RUNNING, vmengine sends one QMPsystem_powerdown, writes localvm.state=TERMINATING, and then never escalates. The guest may ignore/delay early ACPI shutdown, leaving QEMU alive and the API stuck inTERMINATING.
Concretely, for VMs 10916–10925:
- All ten reproduced API TERMINATING with live QEMU still running.
- Local ex9 VM state files were TERMINATING.
- Scoped QMP quit to only these ten test VMs generated host-qmp-quit
shutdown events and allowed vmengine cleanup to finalize all ten
to TERMINATED.Why it happened
The vmengine terminate path sends one QMP
system_powerdown, writes
vm.state=TERMINATING, and then on later
reconciliation passes “immediately returns if local
state file is already TERMINATING.” There is no
timeout/escalation path analogous to the
STOPPING path.
The QMP client is also missing read/write deadlines. The
qmp.Run function uses
net.DialTimeout but no per-operation deadline.
When the QEMU monitor socket is alive but QEMU is hung
inside the guest, the read can block forever.
How to detect it
The diagnostic is the API state:
exc compute list | awk '$1=="<id>"'
# STATE: TERMINATING for more than 30 seconds with a live QEMUAnd the QMP check:
ssh ex9 'printf "{\"execute\":\"qmp_capabilities\"}\r\n{\"execute\":\"query-status\"}\r\n" \
| sudo -n socat - UNIX-CONNECT:/var/vmengine/vms/<id>/monitor.sock'
# status=running, running=trueIf the API says TERMINATING, the local
vm.state file says TERMINATING,
but the QEMU is alive and reporting
status=running, you have a stuck terminate.
How to prevent it in code
Two rules, both implemented in the working tree at the time of the incident:
Track
terminatingSincelikestoppingSince. Persist it (or derive fromvm.statemtime) so escalation survives vmengine restarts. If the VM remainsTERMINATINGwith QEMU alive past the deadline, escalate: gracefulsystem_powerdown→ QMPquit→ OS signal.Add QMP per-operation deadlines. The
qmp.Runfunction should setSetReadDeadline/SetWriteDeadlineon the Unix socket connection. The QEMU monitor socket can refuse reads while the guest is in a stuck shutdown state.
How to recover when it happens
The recovery is scoped QMP quit:
ssh ex9 'printf "{\"execute\":\"qmp_capabilities\"}\r\n{\"execute\":\"quit\"}\r\n" \
| sudo -n socat - UNIX-CONNECT:/var/vmengine/vms/<id>/monitor.sock'
# SHUTDOWN reason=host-qmp-quitThe vmengine loop will then finalize the VM to
TERMINATED on the next reconciliation tick.
This was the recovery used in all of the 10-VM repros; it
is fast (sub-second) and does not require any service
restart.
Bug 3: The crash recovery flow
What happened
The scenario, not a one-time bug: when SPDK crashes or
restarts, every vhost-user socket dies. QEMU VMs that
were using those sockets hang forever — the file
descriptor is dead, there is no auto-reconnect. The
diskengine's reconciliation loops do retry, but with
no escalation, and vmengine's
forceStopQemuVM cannot bring up dead QEMUs
that need their fds refreshed.
The bug class is “no defined recovery path for ‘SPDK restarted under a running QEMU’.” Ungraceful BM reboots are not a problem — the loops rebuild everything. The problem is a service restart while QEMU is alive.
Why it happened
The original design assumed QEMU's vhost-user-blk socket is a stable file descriptor across SPDK restarts. It is not. vhost-user-blk uses a Unix socket fd obtained at QEMU boot. When SPDK crashes, the fd becomes a dead reference to a no-longer-listening socket. The VM is wedged even after SPDK comes back with a fresh socket at the same path.
How to detect it
Two signals:
spdk_top is empty or the JSON-RPC socket is missing. Check with
ls -l /var/tmp/spdk.sockandsystemctl status spdk.service.VMs are still RUNNING in the API but QEMU is hung on a dead socket. The vmengine log will show repeated
dial unix /var/vmengine/vms/<id>/monitor.sock: connect: connection refusedorvfio_user_device_io_region_read: timed out waiting for reply: Connection timed out.
How to prevent it in code
The recovery design is at SPDK crash recovery plan:13 . The rules:
Use systemd
ExecStopPostto drop marker files. Two markers:/var/vmengine/SPDK_RESTARTand/var/diskengine/SPDK_RESTART. The systemd unit drops them inExecStopPost:ExecStopPost=/bin/touch /var/vmengine/SPDK_RESTART ExecStopPost=/bin/touch /var/diskengine/SPDK_RESTART ExecStopPost=/bin/systemctl restart vmengine ExecStopPost=/bin/systemctl restart diskengineOrder matters: the markers must exist before vmengine/diskengine start.
Clean stale vhost sockets in the SPDK init script. The
spdk-baremetal-init.shmust runrm -f /var/diskengine/vhost*before SPDK opens its RPC socket. Stale socket files confuse diskengine'sensureVhost()checks.vmengine on startup: SIGINT every RUNNING QEMU and set it to STARTING. Check the marker on startup. If present, query DB for RUNNING/STARTING/ RESTARTING/UPGRADING/MIGRATING VMs on this BM, send SIGINT to each (best-effort), set the DB state to STARTING, and remove the marker. Then proceed to the normal loop.
diskengine on startup: clean
DISKENGINE_CLEANUPmarkers. The existing reconcile loops are idempotent — they see DB state and rebuild whatever is missing in SPDK. No special recovery logic is needed; just remove the marker.
How to recover when it happens
The recovery is a sequence of service restarts, not a manual one. The systemd unit ordering does the work:
The crucial line: VMs must be killed. QEMU's vhost-user-blk has no auto-reconnect. The running QEMU holds a dead fd even after diskengine recreates the socket at the same path. There is no way to make a running QEMU use a fresh fd; you have to kill it and re-launch.
Bug 4: Live migration issues
What happened
Live migration with vhost-user-blk on SPDK is possible but has narrow timing constraints. The vfio-user path (the new path) is designed for live migration, but the on-the-wire protocol is large and state-heavy. The plan is at VFIO user migration plan:1 .
The current state on ex9:
ex9 uses vfio-user NVMe, not vhost-user-blk.
Live migration is not yet production-enabled.
Validation status, from the plan:
QEMU's
vfio-user-pcipath is supported, but SPDK namespace add/remove semantics are version-sensitive: the programming guide requires paused or inactive subsystems for namespace changes, while newer changelog notes say add/remove can be done with more limited pause scopes. This implementation uses the direct RPC path and should be tested against the deployed SPDK build before production rollout.
Why it happened
Two independent issues:
Namespace mutation semantics. SPDK requires a paused subsystem for
nvmf_subsystem_add_nsornvmf_subsystem_remove_nsin some versions; the RPC will reject the call mid-IO otherwise. For a live migration, the source VM is actively doing I/O. Pausing the subsystem stops the I/O, which is the migration boundary. If diskengine is in the middle of adding a hot-plugged disk during a migration, the order is:add_ns→ migrationadd_nsagain. Both calls need to be pause-safe.vfio-user state serialization. The vfio-user protocol carries the device's full state across the migration — pending completions, mapped memory, etc. The page directory (for DMA) must be rebuilt on the target. If a guest has pinned hugepages in a configuration the target doesn't have, the migration hangs in the “precopy” phase trying to iterate pages it cannot address.
How to detect it
A migration hang looks like:
- QEMU
info migratereportspostcopynot started, orprecopyiteration count growing slowly. diskengine vfio-user export for the migrating VM is in
pausedstate (the NVMf subsystem was paused for the namespace operation). It must come back toactiveon the target.The QMP
query-migrateblocks indefinitely. The monitor socket is alive but QEMU is stuck.
How to prevent it in code
Three rules, ordered by importance:
Make the namespace mutation pause-scope explicit. Use a dedicated
nvmf_subsystem_pausecall with a short timeout, do the namespace operation, thennvmf_subsystem_resume. The current design ininternal/baremetal/vfio_user_attach.gorelies on the default pause behaviour, which is version-dependent. Pin the behaviour.Test the migration end-to-end on the deployed SPDK build before enabling it in production. The vfio-user migration path depends on both QEMU and SPDK behaving consistently about the pause/resume semantics.
Pre-pin the source BM's hugepage configuration to match the target. The migration data copy moves memory pages; if the target has fewer hugepages or different NUMA layout, the page iteration hangs.
How to recover when it happens
The recovery is to not try to recover the migration. Kill the source QEMU, kill the target QEMU, restart the VM fresh on the target. The vfio-user protocol does not support mid-flight resumption — once the precopy loop is hung, the migration is lost. The cleanest exit is to fail fast and let the caller retry.
Bug 5: The bdev_get_bdevs SEGV (bonus)
What happened
On 2026-05-20, an ex9 SPDK process segfaulted when an
operator called the read-only RPC
bdev_get_bdevs after a batch of VMs was
killed via SIGKILL while their root-volume IO was
active. The crash was reproducible, with this stack
(from the core file at Bug 6:501 ):
#0 nvme_rdma_ctrlr_get_memory_domains
#1 bdev_nvme_get_memory_domains
#2 rpc_dump_bdev_info
#3 spdk_for_each_bdev
#4 rpc_bdev_get_bdevs
#5 parse_single_request
#6 jsonrpc_server_conn_recv
#7 rpc_subsystem_poll_servers
#8 thread_execute_timed_poller
#9 spdk_thread_poll
#10 _reactor_runThe fault address in the kernel log was
0x10, consistent with a NULL/invalid
nested pointer in
nvme_rdma_ctrlr_get_memory_domains after
the controller/qpair was torn down.
Why it happened
The crashing SPDK function dereferences the RDMA admin qpair path without local null/state guards:
/* /Users/lolwierd/Projects/excloud/spdk/lib/nvme/nvme_rdma.c:3670 */
nvme_rdma_ctrlr_get_memory_domains(...)
rqpair = nvme_rdma_qpair(ctrlr->adminq)
domains[0] = rqpair->rdma_qp->domain // <-- crash hereThe flow was: a forced QEMU death left the NVMe/RDMA
controller in a torn-down state. bdev_get_bdevs
walked the bdev list, hit the bdev backed by the
broken controller, called
spdk_nvme_ctrlr_get_memory_domains,
which dereferenced a now-invalid rdma_qp
pointer.
How to detect it
The crash is detectable by:
The systemd journal records a Signal 11 (SEGV) for the SPDK process. Look for
spdk.service ... Main PID changed (old=N, new=M)immediately after the SEGV.A core file appears at
/var/lib/systemd/coredump/core.reactor_*.<pid>.<ts>.zst. Extract withsystemd-coredumpandgdbon the core.
How to prevent it in code
Two rules, with the first being the actual fix:
Patch
nvme_rdma_ctrlr_get_memory_domainsto guard against NULL/invalid state:nvme_rdma_ctrlr_get_memory_domains(ctrlr, ...) { if (!ctrlr || !ctrlr->adminq) return 0; /* no memory domain available */ rqpair = nvme_rdma_qpair(ctrlr->adminq); if (!rqpair || !rqpair->rdma_qp) return 0; domains[0] = rqpair->rdma_qp->domain; ... }The existing code trusts that the controller is in a good state; the fix is the early return.
Don't call
bdev_get_bdevson a baremetal/SPDK target during controller reset. The diskengine code already documents this ininternal/baremetal/utils.go:146: “This intentionally avoids callingbdev_get_bdevs, which can crash SPDK during NVMe controller reset (SEGV observed in production).” The same rule applies to operator scripts and to any other code path.
How to recover when it happens
systemd auto-restarts SPDK with
Restart=always and
RestartSec=3. The new SPDK process
comes up, diskengine re-establishes the
vfio-user exports, vmengine sees the marker from
Bug 3's recovery flow and brings the VMs back.
The whole recovery is automatic — the
operator doesn't have to do anything.
But the bug class will recur until the
nvme_rdma_ctrlr_get_memory_domains
patch lands.
The debug playbook: SPDK is wedged, what now?
The flowchart below is the one-page summary. Print it. Put it on the runbook page. Use it the next time the alert fires.
flowchart TD Q0["SPDK target is wedged.
What now?"] --> Q1["Q1: Is the
process alive?"] Q1 -- "no" --> A1["Check coredump
gdb on core
then gdb_macros
spdk_print_bdevs"] Q1 -- "yes" --> Q2["Q2: Is the JSON-RPC
socket responding?"] Q2 -- "no" --> A2["systemctl status spdk.service
ls -l on the RPC socket
journalctl -u spdk --since 5m"] Q2 -- "yes" --> Q3["Q3: spdk_top THREADS tab:
one reactor at 100%,
others idle?"] Q3 -- "yes" --> A3["spdk_top POLLERS tab
sort by Run count desc
find the runaway poller"] Q3 -- "no" --> Q4["Q4: spdk_top POLLERS tab:
a single RPC handler
poller pinned?"] Q4 -- "yes" --> A4["strace -p PID
trace=read,write
to see the syscall hang"] Q4 -- "no" --> Q5["Q5: Is a specific RPC
hung but others
still work?"] Q5 -- "yes" --> A5["spdk_trace -s app -p pid
check the tracepoint
just before the hang"] Q5 -- "no" --> Q6["Q6: Is a bdev module
poller at high busy count
but reactor at low busy?"] Q6 -- "yes" --> A6["Backend saturated
bdev_get_iostat for queue depth
app-side limit, not SPDK bug"] Q6 -- "no" --> Q7["Q7: Is a vfio-user
NQN stuck on remove_ns?"] Q7 -- "yes" --> A7["VFIO/QMP quit wedge
recover: systemctl restart
spdk.service (see Bug 1)"] Q7 -- "no" --> Q8["Q8: Are you debugging
a production target
or a reproducer?"] Q8 -- "production" --> A8["bpftrace on bdev_io submit
to confirm IOPS,
then escalate (see 9.2)"] Q8 -- "reproducer" --> A9["valgrind --tool=memcheck
on a minimal repro
(see 9.2)"] classDef recov fill:#d6f5d6,stroke:#2a6f2a; classDef tool fill:#cfe1ff,stroke:#1c4f8a; classDef tip fill:#fdf2cf,stroke:#8a6f1a; classDef bug fill:#f5d6e0,stroke:#8a1c4f; class A1,A2,A3,A4,A5,A6,A8,A9 tool class A7 bug class Q0 tip
fig. 1 The eight questions that cover every wedged-SPDK incident we have seen. The first three are non-negotiable checks: is the process alive, is the RPC socket alive, is one reactor pinned. Past that, the answer is the specific tool. The branches are blue (tool), green (recovery), or pink (the one bug class with a known recovery: restart spdk.service).
Edge cases & what trips people up
Reproducibility
The QMP quit wedge is the most-likely-to-miss-the-repro bug in this set. The same sequence on a clean baremetal often does not reproduce it. From the incident report, the 2026-05-25 BM rerun reproduced the symptom (stale vfio-user/NQN condition) but not the attach wedge:
The clean BM did not reproduce the SPDK attach wedge in three QMP-quit attempts, including a higher-IO run. The original hang still looks real, but it likely depends on a dirtier/stale detach condition or a narrower timing window where
nvmf_subsystem_remove_nsis called while outstanding vfio-user IO accounting cannot drain.
The lesson: the bug is real, the conditions to trigger it
are narrow, and you may not be able to reproduce it on
demand. The recovery (restart spdk.service)
is the production-grade fix. Patch the root cause; fall
back to the restart until the patch lands.
Stale local state on the operating layer
After Bug 1's recovery, the affected VM is
STOPPED in the API but
/var/vmengine/vms/<id>/vm.state is
still RUNNING. This is a vmengine bug
that the recovery does not fix. The path to fix it
is at
VM lifecycle plan:29 :
move resource release from StopVM to the
cleanup script, and use FOR UPDATE SKIP LOCKED
to avoid races between StartVM and cleanup.
The local vm.state file should be the
source of truth for vmengine reconciliation, not the
API.
What trips people up
“SPDK was restarted, but the VMs are still hung.” You have Bug 3. The recovery requires vmengine to be restarted too, which requires the
ExecStopPostmarker. If the marker is not in the systemd unit, the recovery is incomplete. Verify withsystemctl cat spdk.service | grep ExecStopPost.“I sent QMP
quitand the VM still didn't terminate.” The QMP monitor socket may be wedged on a stuck read. The fix is the QMP deadline (Bug 2's first rule) plus SIGKILL as a final fallback. Make sure the SIGKILL is scoped to a specific test VM (matched on-name vm<id>,) before pulling that trigger.“
bdev_get_bdevscrashed SPDK.” You have Bug 5. Don't call that RPC on a baremetal SPDK target. Usebdev_raid_get_bdevs allorbdev_nvme_get_controllersinstead.“Live migration is hung in precopy.” You have Bug 4. Don't wait for it to complete. Kill the source QEMU, the target QEMU, and restart the VM on the target. Pause-scope the namespace operation before the next attempt.
“The diskengine log shows ‘spdk gate failed’ but I can't find the cause.” This is Bug 1's surface symptom. Run
scripts/rpc.py nvmf_subsystem_remove_ns <nqn> <nsid>directly. If it hangs, the bug is in the NVMf pause path. If it returns immediately, the bug is in the diskengine gate logic.
What to take away
The four bug classes in this page are not exhaustive, but
they cover every wedged-SPDK incident on
ex9 in the May 2026 window. Three rules
cover the root causes for all of them:
Every SPDK RPC has a deadline. Without one, a single stuck RPC pins a goroutine forever. Add the deadline at the
internal/spdkclient/client.golayer; it fixes Bug 1 and Bug 2's escalation.vfio-user endpoint disconnect is a quiesce boundary, not a pause-and-wait. When the QEMU side goes away, the NVMf subsystem must force-clear its outstanding-IO counters. Until that is patched, the recovery is
systemctl restart spdk.service.Use systemd
ExecStopPostfor crash signalling. Marker files are atomic, crash-safe, zero-latency, and don't depend on the DB. This is the design that makes Bug 3's recovery automatic instead of manual.
You have now finished the curriculum. The next time a target is wedged, start at the playbook flowchart. The next time a new bug class appears, write a retrospective like this one — “what happened, why it happened, how to detect it, how to prevent it, how to recover.” The retrospective is the artefact; the rules are what survive across incidents.