The QMP quit wedge.
You're on call. A VM has been issued a QMP quit.
The API state moves to STOPPED. QEMU is gone.
The vfio-user socket file is still on disk. The NVMf
subsystem is still in SPDK with the namespace present.
diskengine's detach loop fires
nvmf_subsystem_remove_ns. The RPC hangs.
A new VM is created; its RAID comes online; the vfio-user
listener is never created. The new VM is stuck
CREATING. You systemctl restart
spdk.service and the new VM boots. You start
wondering what you missed.
This is the QMP quit wedge. It's not one bug. It's three bugs stacked: a missing quiesce, a lock held across a callback, and a state machine that has no timeout. Read this page.
- The reproduction (from the incident report)
- The teardown sequence on a clean QMP quit
- The threading-rule violation
- The lock-holding path: where the wedge actually lives
- Why "just add more shutdown RPCs" doesn't fix it
- The five things that must happen in order
- How diskengine's
vhost_detach.gois supposed to handle this - What the fix actually looks like
- Edge cases: multiple VMs, migration abort, kill -9, the forkbomb
The reproduction (from the incident report)
The reproduction comes from a 2026-05-22 incident report, captured in detail in SPDK_VFIO_QMP_QUIT_WEDGE_REPRO.md:1 . The reliable trigger, observed on a production baremetal host, is:
The key symptom is not that the victim cleanup fails. The victim cleanup is the trigger. The symptom is that every subsequent VM creation hangs in the same place: the vfio-user listener is created, the namespace is added, but the QEMU launch hangs because the vfio-user cntrl socket never appears.
The teardown sequence on a clean QMP quit
The "happy path" teardown, when every step works, looks like this. Each step is a separate actor (QEMU, libvfio-user, SPDK NVMf, SPDK vfio_user) talking to the next over a Unix socket or a shared-memory region.
flowchart TB A["1. QEMU receives QMP quit / QEMU exits"] --> B["2. QEMU closes the vfio-user socket"] B --> C["3. libvfio-user on SPDK side sees EPIPE / ENOTCONN"] C --> D["4. SPDK accept_poller sees connection close / endpoint is_attached = false"] D --> E["5. SPDK detach_device callback runs on the endpoint thread"] E --> F["6. SPDK ctrlr_quiesce + spdk_nvmf_subsystem_pause"] F --> G["7. SPDK NVMf waits for io_outstanding == 0 on every poll group"] G --> H["8. io_outstanding drains (completions fire) → pause callback fires"] H --> I["9. SPDK removes the namespace / deletes the subsystem / tears down vfio-user-ctrlr"] I --> J["10. diskengine sees the NQN gone, removes the cleanup marker"]
fig. 1 Ten steps. Each step has a timeout or a wait. Steps 7-8 are where the wedge happens: the pause callback never fires.
The wedge happens at branches A or B. The pause
stores the callback and returns. The
sgroup now has
cb_fn = nvmf_rpc_remove_ns_paused and
cb_arg = ctx. The callback will fire
when io_outstanding decrements. That
happens in nvmf_poll_group_decrement_io_outstanding
on the namespace's completion path. If the
completion never fires (because the backing bdev
is in a stuck state, or because the guest's I/O
was lost when QEMU died, or because of the
forkbomb scenario), the callback never fires, the
pause never completes, the
remove_ns RPC blocks forever.
The threading-rule violation
The pause callback fires from inside the bdev's
completion path. The bdev completion path runs on
the bdev's spdk_thread — which is the same
spdk_thread that the vfio-user transport's SQ
poller is running on, which is the same
spdk_thread that spdk_nvmf_subsystem_pause
was called on (because the pause was called from
the vhost_user_dev_quiesce_cb which
is on the endpoint's spdk_thread).
Wait — that's the same thread. So why doesn't the
completion run? Because the completion can't
run. The bdev_io is in a wait state. Specifically,
the bdev_io has been submitted, the bdev module
(e.g. NVMe) has it in a queue, and the NVMe
completion is expected via the doorbell. But the
guest (QEMU) just died. The doorbell never fires.
The NVMe completion never arrives. The bdev_io
sits in the queue forever. The
io_outstanding counter never
decrements. The pause callback never fires.
The lock-holding path: where the wedge actually lives
The "wedge" is not a deadlock. There is no
cycle of locks. The wedge is a wait on a
counter that never decrements. The counter
is ns_info->io_outstanding on
the namespace's poll group. The decrement
happens in the bdev_io completion path. The
completion path is, ultimately, the doorbell
ISR / poller for the vfio-user transport.
The vhost-side stop-poller times out after
SPDK_VHOST_SESSION_STOP_RETRY_TIMEOUT_IN_SEC
seconds (typically 30 s or so). When it times
out, it calls
vhost_user_session_stop_done(vsession, -ETIMEDOUT),
unregisters itself, and the kernel vhost code
thinks the session is done. The vsession is
left in a partially-torn-down state. The
bdev_ios in flight at timeout are leaked (they
never get to complete; their references are
never released).
On the vfio-user side, the equivalent is the
vfio_user_dev_quiesce_cb at
lib/nvmf/vfio_user.c:3223 .
It does not have the same stop-poller
pattern. It uses the NVMf subsystem's
pause/resume state machine. The state machine
is supposed to fire the callback when
io_outstanding == 0. The wedge
happens when io_outstanding
never reaches zero.
Tracing the actual lock chain
When the JSON-RPC handler is blocking on the
remove_ns response, what's the
state of the system? The handler is on the
app_thread (or whichever reactor handled the
JSON-RPC connection). The pause is on the
vfio-user endpoint's spdk_thread. The
pause's callback is supposed to fire on the
same thread that owns the sgroup. The sgroup
is owned by the poll group, which is on the
reactor that polls the NVMf subsystem.
The bdev_io completion path is
spdk_nvmf_qpair_process_completions
which calls
nvmf_poll_group_decrement_io_outstanding
which decrements ns_info->io_outstanding
and, if the pause is pending, fires the
callback. The decrement only happens after the
bdev_io is completed. The bdev_io is completed
in the bdev's submit path's completion callback.
For the NVMe-oF / vfio-user transport, the
completion comes from the vfio-user transport's
completion path, which is triggered by a write
to the doorbell region from the guest.
The guest died. No doorbell write. No
completion. No decrement. No callback. The
remove_ns handler is still
blocked on the JSON-RPC socket. The reactor
it's blocking on is also polling other work;
that's why spdk_top THREADS
shows the reactor as healthy. The reactor is
not wedged; it's just blocked on a syscall
(the JSON-RPC response write) that can never
complete.
sequenceDiagram participant QEMU participant LIB as libvfio-user participant SPDK as SPDK reactor participant BM as bdev layer participant RPC as JSON-RPC handler participant DE as diskengine QEMU->>LIB: QMP quit / connection close LIB-->>SPDK: ENOTCONN / EPIPE SPDK->>SPDK: vfio_user_dev_quiesce_cb SPDK->>SPDK: spdk_nvmf_subsystem_pause Note over SPDK: io_outstanding > 0 SPDK-->>SPDK: store cb_fn, return RPC->>SPDK: nvmf_subsystem_remove_ns Note over RPC: blocks on JSON-RPC response DE->>RPC: remove_ns (via socket) Note over DE: blocks on JSON-RPC response DE->>SPDK: 30s later, attach for new VM Note over DE: blocks too (all reactors busy on remove_ns) BM-->>SPDK: (no completion, guest is dead)
fig. 2 The five actors. The wedge is a missing arrow: the bdev layer never sends the completion. Every other actor is doing exactly what its code says to do.
Why "just add more shutdown RPCs" doesn't fix it
The first instinct when you see this is "add an
RPC that forces the subsystem to pause." The
SPDK side has spdk_nvmf_subsystem_pause
which does the dance above, and
nvmf_subsystem_set_state which
sets the state directly. The vfio-user side has
vfio_user_dev_quiesce_cb which
is the entry point.
None of these are the right knob. The root of the wedge is in the bdev layer, not in the NVMf layer or the vfio-user layer. The bdev_io is stuck on a bdev that doesn't know it's stuck. The bdev module (NVMe, AIO, malloc, ...) submitted the I/O and is waiting for a completion that the underlying device will never deliver. Until the bdev_io is force-completed (with an error), the counter never decrements and the pause never fires.
The five things that must happen in order
The teardown sequence, done correctly, is five distinct steps with five different actors. Each step has a deadline. If any step exceeds its deadline, the system falls back to a degraded cleanup. The order matters: earlier steps unlock later steps.
VM shutdown (QEMU side). The guest kernel has crashed or shut down. QEMU receives the signal. QEMU begins its shutdown sequence. QEMU's vfio-user-pci device closes the connection. The libvfio-user on the SPDK side sees the close.
libvfio-user connection close (SPDK side).
vfu_run_ctxreturns -1 witherrno = ENOTCONN. The endpoint'stgt_vfu_ctx_poller(at lib/vfu_tgt/tgt_endpoint.c:127 ) unregisters itself, callsendpoint->ops.detach_device(endpoint), and setsendpoint->is_attached = false. The accept poller starts listening for a new client.Backend device detach (NVMf side). The NVMf vfio-user transport's detach callback fires. The vfio_user-ctrlr is marked as detached. The NVMf subsystem is paused via
spdk_nvmf_subsystem_pause. The pause waits forio_outstanding == 0on every poll group.Drain in-flight I/O (bdev side). The in-flight bdev_ios complete (via the bdev module's completion path, which is on the bdev's spdk_thread).
io_outstandingdecrements. The pause's callback fires. The NVMf subsystem moves toPAUSEDstate. The detach callback returns.diskengine detach (Go side). diskengine's
startVfioUserDetachLoop:20runs the per-VM cleanup:
nvmf_subsystem_remove_ns(now fast — the subsystem is already paused),nvmf_delete_subsystem,rmdirthe/var/diskengine/vfio-user/<vmID>directory. The cleanup marker is removed.
How diskengine's vhost_detach.go is supposed to handle this
diskengine's detach logic is in startVhostDetachLoop:29 . The vhost-blk path is the simpler one (it's the one most likely to be re-enabled for testing). The vfio-user path is in startVfioUserDetachLoop:20 .
The diskengine-side fix is to put a deadline
on the JSON-RPC call. The Go HTTP-style
pattern is
ctx, cancel := context.WithTimeout(parent, 30*time.Second)
and pass ctx to the RPC client.
The spdkclient doesn't yet respect a
per-call deadline, but the JSON-RPC framework
does — it's just a matter of plumbing.
What the fix actually looks like
The fix has three parts: a timeout on the bdev_io, a deadline on the JSON-RPC call, and a force-cleanup path for the wedged subsystem.
1. Per-bdev_io timeout in the vfio-user transport
The vfio-user transport's submission path needs a
per-bdev_io timeout. When the timeout fires, the
bdev_io is force-completed with an error. The
completion decrements io_outstanding.
The pause's callback fires. The teardown proceeds.
The natural place to put this is in the SQ poller
of the vfio-user transport — a per-bdev_io
timestamp recorded at submit, checked in the
poller. If the bdev_io is older than the timeout,
the poller calls
spdk_bdev_io_complete(bdev_io, SPDK_BDEV_IO_STATUS_FAILED).
This unblocks the NVMf pause and the wedge
resolves.
2. Deadline on the JSON-RPC call from diskengine
The diskengine spdkclient needs a per-call
deadline. The cleanest API change is to add a
WithDeadline(time.Duration) option to
the
Client.Call:43 method. The
default deadline is something like 30 seconds. The
caller can override per-call (e.g. 5 seconds for
hot-loop RPCs, 2 minutes for bulk operations).
3. Force-cleanup for a wedged subsystem
The third part is a "force" RPC for the wedged
subsystem. The RPC tells SPDK to set the
subsystem's state to PAUSED directly,
bypassing the pause callback machinery. The
subsequent remove_ns and
delete_subsystem then run on the
already-paused subsystem and complete
immediately.
The RPC handler has to be careful: a force-cleanup
of a subsystem with in-flight I/O is dangerous
(the bdev_ios that were "in flight" still hold
references to the bdev). The force-cleanup has
to also force-complete the in-flight bdev_ios,
which means iterating the per-poll-group
sgroup's io_outstanding count and
finding the bdev_ios that match the count.
Edge cases: what else can break this
1. Multiple VMs shutting down simultaneously
Each VM shutdown is its own
remove_ns RPC. The RPCs are on
different goroutines in diskengine, but they
all hit the same SPDK reactor (or different
reactors, depending on which JSON-RPC socket
gets the request). The pause for one VM
shouldn't block the pause for another VM,
because the NVMf subsystems are different.
But the bdev layer is shared (the underlying
NVMe controllers are shared). If the bdev
layer is the source of the wedge, multiple
VM shutdowns all hit the same wedge. The
whole reactor handling JSON-RPC is stuck
waiting on the bdev. Every new
remove_ns RPC queues up.
2. Live migration abort
A live migration is in flight. The source QEMU
has paused the guest. The destination QEMU
hasn't finished receiving. The destination
QEMU fails (out of disk space, network drop).
The source QEMU resumes the guest. The
vfio-user-pci device on the source side is
still attached. From SPDK's perspective,
nothing changed — the endpoint is still
attached, the ctrlr is still RUNNING. The
abort is invisible to SPDK. If the abort
happens during a teardown sequence, the
teardown may not complete (e.g. the
detach_device callback may not
fire because the device is still attached).
The wedge is the same as the QMP quit case,
but with the additional complication that
the guest is still running and may issue new
I/O while the teardown is half-done.
3. kill -9 on QEMU
kill -9 is the path that does
not wedge (per the incident report).
The reason: kill -9 doesn't go
through QEMU's normal shutdown path. The
QEMU process dies immediately. The libvfio-user
on the SPDK side sees the connection drop (the
kernel closes the socket). The
tgt_vfu_ctx_poller sees
ENOTCONN, the
detach_device callback fires. The
detach path does not go through the
vfio_user_dev_quiesce_cb — it goes through the
immediate-detach path. The
vfio_user_device_reset at
is called with VFU_RESET_LOST_CONN,
which unregisters the interrupt and skips the
quiesce. The detach completes immediately. The
in-flight bdev_ios are force-completed (by the
bdev's NVMe driver's timeout, or by the
bdev_io's own timeout once part 1 of the fix
lands). No wedge.
4. The forkbomb scenario
The incident report's worst case: a guest
fork-bombs (thousands of processes), each
process is doing direct I/O to the root disk,
the API stop request comes in, QEMU is
unkillable because the guest kernel is busy,
QMP quit is sent manually. The
forkbomb itself doesn't wedge SPDK — the
forkbomb wedges the guest kernel. The QMP
quit then takes down QEMU. The
vfio-user connection drops. The teardown path
runs. The wedge is the same as the standard
QMP quit case. The forkbomb's role is to
guarantee that the bdev layer has a high
io_outstanding count at the
moment of teardown — many in-flight I/Os
that have not yet been serviced by the bdev.
The pause is more likely to take the
"wait for I/O" path because more I/Os are in
flight. The wait is more likely to exceed
whatever timeout exists. The wedge is more
likely to form.
5. nvmf_delete_subsystem while remove_ns is still pending
diskengine's detach loop, per the source, does
remove_ns and then
delete_subsystem in sequence. If
remove_ns is the wedged one,
delete_subsystem never runs. The
subsystem stays around. New attach work for
the same NQN conflicts with the leftover
subsystem. The diskengine logs the conflict as
"namespace raid_2591 still present in
nqn.2026-02.io.excloud:vm:11263; deferring" and
keeps retrying. Every retry is a
remove_ns that hits the same
wedge. The retry loop is unbounded. The only
fix is to bounce SPDK (the
systemctl restart spdk.service
that the incident report describes).
6. The "innocent global" race
The
spdk_nvmf_subsystem_state_changes
TAILQ on the subsystem is protected by
subsystem->mutex. The pause
appends a state-change context to the TAILQ.
The resume pops the head. If a second pause
request arrives while the first is in flight
(e.g. from a second remove_ns
call for a different namespace), the second
pause is queued behind the first. The first
pause's callback fires when
io_outstanding drains. The second
pause's callback never fires because the first
pause's removal of the namespace already
dropped the io_outstanding for the second
namespace. The state-change context is
orphaned. The TAILQ is corrupted. Subsequent
pause requests hang. The fix is to drain the
TAILQ on every callback, regardless of which
pause was satisfied.
7. The cleanup-marker race
The diskengine-side cleanup marker at
/var/vmengine/vms/<vmID>/DISKENGINE_CLEANUP
is set on API stop and removed on successful
detach. If the detach hangs, the marker stays.
The vmengine side checks for the marker before
re-launching QEMU. If the marker is present,
vmengine doesn't launch QEMU. The VM is stuck
in STOPPED until diskengine
cleans up. If diskengine is stuck on the
remove_ns, the marker never goes away, the VM
never relaunches, the user has to manually
remove the marker file. This is the
"secondary" wedge that the incident report
describes: the primary wedge is in SPDK; the
secondary is in diskengine's state
reconciliation.
Why it matters
The QMP quit wedge is the canonical example of why the threading rules exist. Every step of the teardown is a separate actor that assumes the previous step will eventually complete. The pause assumes the bdev_io will eventually complete. The bdev_io assumes the bdev will eventually complete. The bdev assumes the underlying device will eventually complete. The underlying device assumes the guest will eventually issue a completion. The guest is dead. The chain of assumptions breaks.
The fix is to make the chain robust to termination. Every step needs a deadline. Every wait needs a timeout. Every state transition needs a fallback. The three-part fix (per-bdev_io timeout, JSON-RPC deadline, force-cleanup RPC) is the minimal set of changes to make the teardown sequence complete, even when one of the actors misbehaves.
If you only have time to fix one thing, fix the per-bdev_io timeout. That's the bottom of the chain. Everything else can hang; the timeout will eventually unstick it.