Layer 7 · vhost / virtio / VFIO-user

The QMP quit wedge.

You're on call. A VM has been issued a QMP quit. The API state moves to STOPPED. QEMU is gone. The vfio-user socket file is still on disk. The NVMf subsystem is still in SPDK with the namespace present. diskengine's detach loop fires nvmf_subsystem_remove_ns. The RPC hangs. A new VM is created; its RAID comes online; the vfio-user listener is never created. The new VM is stuck CREATING. You systemctl restart spdk.service and the new VM boots. You start wondering what you missed.

This is the QMP quit wedge. It's not one bug. It's three bugs stacked: a missing quiesce, a lock held across a callback, and a state machine that has no timeout. Read this page.

~20 min read3 diagramsprerequisites: 7.3 · 2.4
On this page
  1. The reproduction (from the incident report)
  2. The teardown sequence on a clean QMP quit
  3. The threading-rule violation
  4. The lock-holding path: where the wedge actually lives
  5. Why "just add more shutdown RPCs" doesn't fix it
  6. The five things that must happen in order
  7. How diskengine's vhost_detach.go is supposed to handle this
  8. What the fix actually looks like
  9. Edge cases: multiple VMs, migration abort, kill -9, the forkbomb

The reproduction (from the incident report)

The reproduction comes from a 2026-05-22 incident report, captured in detail in SPDK_VFIO_QMP_QUIT_WEDGE_REPRO.md:1 . The reliable trigger, observed on a production baremetal host, is:

STEP 01
Run active root-disk direct IO in a guest VM
8 direct-IO loops against /dev/nvme0n1
STEP 02
Panic the guest with sysrq crash
echo c > /proc/sysrq-trigger
STEP 03
Send QMP quit to the guest's monitor socket
{"execute":"quit"} → SHUTDOWN host-qmp-quit
STEP 04
Watch diskengine try to detach the VM's NVMf subsystem
nvmf_subsystem_remove_ns RPC hangs
STEP 05
Create a new VM — the new RAID is online but the vfio-user export never appears
vmengine repeats 'EBS device not ready'

The key symptom is not that the victim cleanup fails. The victim cleanup is the trigger. The symptom is that every subsequent VM creation hangs in the same place: the vfio-user listener is created, the namespace is added, but the QEMU launch hangs because the vfio-user cntrl socket never appears.

The teardown sequence on a clean QMP quit

The "happy path" teardown, when every step works, looks like this. Each step is a separate actor (QEMU, libvfio-user, SPDK NVMf, SPDK vfio_user) talking to the next over a Unix socket or a shared-memory region.

flowchart TB
A["1. QEMU receives QMP quit / QEMU exits"] --> B["2. QEMU closes the vfio-user socket"]
B --> C["3. libvfio-user on SPDK side sees EPIPE / ENOTCONN"]
C --> D["4. SPDK accept_poller sees connection close / endpoint is_attached = false"]
D --> E["5. SPDK detach_device callback runs on the endpoint thread"]
E --> F["6. SPDK ctrlr_quiesce + spdk_nvmf_subsystem_pause"]
F --> G["7. SPDK NVMf waits for io_outstanding == 0 on every poll group"]
G --> H["8. io_outstanding drains (completions fire) → pause callback fires"]
H --> I["9. SPDK removes the namespace / deletes the subsystem / tears down vfio-user-ctrlr"]
I --> J["10. diskengine sees the NQN gone, removes the cleanup marker"]
fig. 1 — the clean teardown sequence · tap or scroll to zoom · ↗ for fullscreen

fig. 1   Ten steps. Each step has a timeout or a wait. Steps 7-8 are where the wedge happens: the pause callback never fires.

The wedge happens at branches A or B. The pause stores the callback and returns. The sgroup now has cb_fn = nvmf_rpc_remove_ns_paused and cb_arg = ctx. The callback will fire when io_outstanding decrements. That happens in nvmf_poll_group_decrement_io_outstanding on the namespace's completion path. If the completion never fires (because the backing bdev is in a stuck state, or because the guest's I/O was lost when QEMU died, or because of the forkbomb scenario), the callback never fires, the pause never completes, the remove_ns RPC blocks forever.

The threading-rule violation

The pause callback fires from inside the bdev's completion path. The bdev completion path runs on the bdev's spdk_thread — which is the same spdk_thread that the vfio-user transport's SQ poller is running on, which is the same spdk_thread that spdk_nvmf_subsystem_pause was called on (because the pause was called from the vhost_user_dev_quiesce_cb which is on the endpoint's spdk_thread).

Wait — that's the same thread. So why doesn't the completion run? Because the completion can't run. The bdev_io is in a wait state. Specifically, the bdev_io has been submitted, the bdev module (e.g. NVMe) has it in a queue, and the NVMe completion is expected via the doorbell. But the guest (QEMU) just died. The doorbell never fires. The NVMe completion never arrives. The bdev_io sits in the queue forever. The io_outstanding counter never decrements. The pause callback never fires.

The lock-holding path: where the wedge actually lives

The "wedge" is not a deadlock. There is no cycle of locks. The wedge is a wait on a counter that never decrements. The counter is ns_info->io_outstanding on the namespace's poll group. The decrement happens in the bdev_io completion path. The completion path is, ultimately, the doorbell ISR / poller for the vfio-user transport.

The vhost-side stop-poller times out after SPDK_VHOST_SESSION_STOP_RETRY_TIMEOUT_IN_SEC seconds (typically 30 s or so). When it times out, it calls vhost_user_session_stop_done(vsession, -ETIMEDOUT), unregisters itself, and the kernel vhost code thinks the session is done. The vsession is left in a partially-torn-down state. The bdev_ios in flight at timeout are leaked (they never get to complete; their references are never released).

On the vfio-user side, the equivalent is the vfio_user_dev_quiesce_cb at lib/nvmf/vfio_user.c:3223 . It does not have the same stop-poller pattern. It uses the NVMf subsystem's pause/resume state machine. The state machine is supposed to fire the callback when io_outstanding == 0. The wedge happens when io_outstanding never reaches zero.

Tracing the actual lock chain

When the JSON-RPC handler is blocking on the remove_ns response, what's the state of the system? The handler is on the app_thread (or whichever reactor handled the JSON-RPC connection). The pause is on the vfio-user endpoint's spdk_thread. The pause's callback is supposed to fire on the same thread that owns the sgroup. The sgroup is owned by the poll group, which is on the reactor that polls the NVMf subsystem.

The bdev_io completion path is spdk_nvmf_qpair_process_completions which calls nvmf_poll_group_decrement_io_outstanding which decrements ns_info->io_outstanding and, if the pause is pending, fires the callback. The decrement only happens after the bdev_io is completed. The bdev_io is completed in the bdev's submit path's completion callback. For the NVMe-oF / vfio-user transport, the completion comes from the vfio-user transport's completion path, which is triggered by a write to the doorbell region from the guest.

The guest died. No doorbell write. No completion. No decrement. No callback. The remove_ns handler is still blocked on the JSON-RPC socket. The reactor it's blocking on is also polling other work; that's why spdk_top THREADS shows the reactor as healthy. The reactor is not wedged; it's just blocked on a syscall (the JSON-RPC response write) that can never complete.

sequenceDiagram
participant QEMU
participant LIB as libvfio-user
participant SPDK as SPDK reactor
participant BM as bdev layer
participant RPC as JSON-RPC handler
participant DE as diskengine

QEMU->>LIB: QMP quit / connection close
LIB-->>SPDK: ENOTCONN / EPIPE
SPDK->>SPDK: vfio_user_dev_quiesce_cb
SPDK->>SPDK: spdk_nvmf_subsystem_pause
Note over SPDK: io_outstanding > 0
SPDK-->>SPDK: store cb_fn, return
RPC->>SPDK: nvmf_subsystem_remove_ns
Note over RPC: blocks on JSON-RPC response
DE->>RPC: remove_ns (via socket)
Note over DE: blocks on JSON-RPC response
DE->>SPDK: 30s later, attach for new VM
Note over DE: blocks too (all reactors busy on remove_ns)
BM-->>SPDK: (no completion, guest is dead)
fig. 2 — the wedge, in sequence · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The five actors. The wedge is a missing arrow: the bdev layer never sends the completion. Every other actor is doing exactly what its code says to do.

Why "just add more shutdown RPCs" doesn't fix it

The first instinct when you see this is "add an RPC that forces the subsystem to pause." The SPDK side has spdk_nvmf_subsystem_pause which does the dance above, and nvmf_subsystem_set_state which sets the state directly. The vfio-user side has vfio_user_dev_quiesce_cb which is the entry point.

None of these are the right knob. The root of the wedge is in the bdev layer, not in the NVMf layer or the vfio-user layer. The bdev_io is stuck on a bdev that doesn't know it's stuck. The bdev module (NVMe, AIO, malloc, ...) submitted the I/O and is waiting for a completion that the underlying device will never deliver. Until the bdev_io is force-completed (with an error), the counter never decrements and the pause never fires.

The five things that must happen in order

The teardown sequence, done correctly, is five distinct steps with five different actors. Each step has a deadline. If any step exceeds its deadline, the system falls back to a degraded cleanup. The order matters: earlier steps unlock later steps.

  1. VM shutdown (QEMU side). The guest kernel has crashed or shut down. QEMU receives the signal. QEMU begins its shutdown sequence. QEMU's vfio-user-pci device closes the connection. The libvfio-user on the SPDK side sees the close.

  2. libvfio-user connection close (SPDK side). vfu_run_ctx returns -1 with errno = ENOTCONN. The endpoint's tgt_vfu_ctx_poller (at lib/vfu_tgt/tgt_endpoint.c:127 ) unregisters itself, calls endpoint->ops.detach_device(endpoint), and sets endpoint->is_attached = false. The accept poller starts listening for a new client.

  3. Backend device detach (NVMf side). The NVMf vfio-user transport's detach callback fires. The vfio_user-ctrlr is marked as detached. The NVMf subsystem is paused via spdk_nvmf_subsystem_pause. The pause waits for io_outstanding == 0 on every poll group.

  4. Drain in-flight I/O (bdev side). The in-flight bdev_ios complete (via the bdev module's completion path, which is on the bdev's spdk_thread). io_outstanding decrements. The pause's callback fires. The NVMf subsystem moves to PAUSED state. The detach callback returns.

  5. diskengine detach (Go side). diskengine's

    startVfioUserDetachLoop:20

    runs the per-VM cleanup: nvmf_subsystem_remove_ns (now fast — the subsystem is already paused), nvmf_delete_subsystem, rmdir the /var/diskengine/vfio-user/<vmID> directory. The cleanup marker is removed.

How diskengine's vhost_detach.go is supposed to handle this

diskengine's detach logic is in startVhostDetachLoop:29 . The vhost-blk path is the simpler one (it's the one most likely to be re-enabled for testing). The vfio-user path is in startVfioUserDetachLoop:20 .

The diskengine-side fix is to put a deadline on the JSON-RPC call. The Go HTTP-style pattern is ctx, cancel := context.WithTimeout(parent, 30*time.Second) and pass ctx to the RPC client. The spdkclient doesn't yet respect a per-call deadline, but the JSON-RPC framework does — it's just a matter of plumbing.

What the fix actually looks like

The fix has three parts: a timeout on the bdev_io, a deadline on the JSON-RPC call, and a force-cleanup path for the wedged subsystem.

1. Per-bdev_io timeout in the vfio-user transport

The vfio-user transport's submission path needs a per-bdev_io timeout. When the timeout fires, the bdev_io is force-completed with an error. The completion decrements io_outstanding. The pause's callback fires. The teardown proceeds.

The natural place to put this is in the SQ poller of the vfio-user transport — a per-bdev_io timestamp recorded at submit, checked in the poller. If the bdev_io is older than the timeout, the poller calls spdk_bdev_io_complete(bdev_io, SPDK_BDEV_IO_STATUS_FAILED). This unblocks the NVMf pause and the wedge resolves.

2. Deadline on the JSON-RPC call from diskengine

The diskengine spdkclient needs a per-call deadline. The cleanest API change is to add a WithDeadline(time.Duration) option to the Client.Call:43 method. The default deadline is something like 30 seconds. The caller can override per-call (e.g. 5 seconds for hot-loop RPCs, 2 minutes for bulk operations).

3. Force-cleanup for a wedged subsystem

The third part is a "force" RPC for the wedged subsystem. The RPC tells SPDK to set the subsystem's state to PAUSED directly, bypassing the pause callback machinery. The subsequent remove_ns and delete_subsystem then run on the already-paused subsystem and complete immediately.

The RPC handler has to be careful: a force-cleanup of a subsystem with in-flight I/O is dangerous (the bdev_ios that were "in flight" still hold references to the bdev). The force-cleanup has to also force-complete the in-flight bdev_ios, which means iterating the per-poll-group sgroup's io_outstanding count and finding the bdev_ios that match the count.

Edge cases: what else can break this

1. Multiple VMs shutting down simultaneously

Each VM shutdown is its own remove_ns RPC. The RPCs are on different goroutines in diskengine, but they all hit the same SPDK reactor (or different reactors, depending on which JSON-RPC socket gets the request). The pause for one VM shouldn't block the pause for another VM, because the NVMf subsystems are different. But the bdev layer is shared (the underlying NVMe controllers are shared). If the bdev layer is the source of the wedge, multiple VM shutdowns all hit the same wedge. The whole reactor handling JSON-RPC is stuck waiting on the bdev. Every new remove_ns RPC queues up.

2. Live migration abort

A live migration is in flight. The source QEMU has paused the guest. The destination QEMU hasn't finished receiving. The destination QEMU fails (out of disk space, network drop). The source QEMU resumes the guest. The vfio-user-pci device on the source side is still attached. From SPDK's perspective, nothing changed — the endpoint is still attached, the ctrlr is still RUNNING. The abort is invisible to SPDK. If the abort happens during a teardown sequence, the teardown may not complete (e.g. the detach_device callback may not fire because the device is still attached). The wedge is the same as the QMP quit case, but with the additional complication that the guest is still running and may issue new I/O while the teardown is half-done.

3. kill -9 on QEMU

kill -9 is the path that does not wedge (per the incident report). The reason: kill -9 doesn't go through QEMU's normal shutdown path. The QEMU process dies immediately. The libvfio-user on the SPDK side sees the connection drop (the kernel closes the socket). The tgt_vfu_ctx_poller sees ENOTCONN, the detach_device callback fires. The detach path does not go through the vfio_user_dev_quiesce_cb — it goes through the immediate-detach path. The vfio_user_device_reset at

lib/nvmf/vfio_user.c:3278

is called with VFU_RESET_LOST_CONN, which unregisters the interrupt and skips the quiesce. The detach completes immediately. The in-flight bdev_ios are force-completed (by the bdev's NVMe driver's timeout, or by the bdev_io's own timeout once part 1 of the fix lands). No wedge.

4. The forkbomb scenario

The incident report's worst case: a guest fork-bombs (thousands of processes), each process is doing direct I/O to the root disk, the API stop request comes in, QEMU is unkillable because the guest kernel is busy, QMP quit is sent manually. The forkbomb itself doesn't wedge SPDK — the forkbomb wedges the guest kernel. The QMP quit then takes down QEMU. The vfio-user connection drops. The teardown path runs. The wedge is the same as the standard QMP quit case. The forkbomb's role is to guarantee that the bdev layer has a high io_outstanding count at the moment of teardown — many in-flight I/Os that have not yet been serviced by the bdev. The pause is more likely to take the "wait for I/O" path because more I/Os are in flight. The wait is more likely to exceed whatever timeout exists. The wedge is more likely to form.

5. nvmf_delete_subsystem while remove_ns is still pending

diskengine's detach loop, per the source, does remove_ns and then delete_subsystem in sequence. If remove_ns is the wedged one, delete_subsystem never runs. The subsystem stays around. New attach work for the same NQN conflicts with the leftover subsystem. The diskengine logs the conflict as "namespace raid_2591 still present in nqn.2026-02.io.excloud:vm:11263; deferring" and keeps retrying. Every retry is a remove_ns that hits the same wedge. The retry loop is unbounded. The only fix is to bounce SPDK (the systemctl restart spdk.service that the incident report describes).

6. The "innocent global" race

The spdk_nvmf_subsystem_state_changes TAILQ on the subsystem is protected by subsystem->mutex. The pause appends a state-change context to the TAILQ. The resume pops the head. If a second pause request arrives while the first is in flight (e.g. from a second remove_ns call for a different namespace), the second pause is queued behind the first. The first pause's callback fires when io_outstanding drains. The second pause's callback never fires because the first pause's removal of the namespace already dropped the io_outstanding for the second namespace. The state-change context is orphaned. The TAILQ is corrupted. Subsequent pause requests hang. The fix is to drain the TAILQ on every callback, regardless of which pause was satisfied.

7. The cleanup-marker race

The diskengine-side cleanup marker at /var/vmengine/vms/<vmID>/DISKENGINE_CLEANUP is set on API stop and removed on successful detach. If the detach hangs, the marker stays. The vmengine side checks for the marker before re-launching QEMU. If the marker is present, vmengine doesn't launch QEMU. The VM is stuck in STOPPED until diskengine cleans up. If diskengine is stuck on the remove_ns, the marker never goes away, the VM never relaunches, the user has to manually remove the marker file. This is the "secondary" wedge that the incident report describes: the primary wedge is in SPDK; the secondary is in diskengine's state reconciliation.

Why it matters

The QMP quit wedge is the canonical example of why the threading rules exist. Every step of the teardown is a separate actor that assumes the previous step will eventually complete. The pause assumes the bdev_io will eventually complete. The bdev_io assumes the bdev will eventually complete. The bdev assumes the underlying device will eventually complete. The underlying device assumes the guest will eventually issue a completion. The guest is dead. The chain of assumptions breaks.

The fix is to make the chain robust to termination. Every step needs a deadline. Every wait needs a timeout. Every state transition needs a fallback. The three-part fix (per-bdev_io timeout, JSON-RPC deadline, force-cleanup RPC) is the minimal set of changes to make the teardown sequence complete, even when one of the actors misbehaves.

If you only have time to fix one thing, fix the per-bdev_io timeout. That's the bottom of the chain. Everything else can hang; the timeout will eventually unstick it.