Layer 7 · vhost / virtio / VFIO-user

vhost-user.

A virtio backend that lives outside QEMU. A Unix socket carrying a small fixed set of control messages. Two rings, one in each direction, mapped from a guest memory table. That's the whole protocol. Everything that goes wrong on a baremetal host — every QMP quit, every live-migration, every malformed message — happens inside this protocol. Read this page before you debug the QMP wedge on page 7.4.

~15 min read2 diagramsprerequisites: 2.1 · 4.1
On this page
  1. What vhost-user is, and what problem it solves
  2. The protocol at a glance: a Unix socket, two rings, a control grammar
  3. Split vs packed virtqueues
  4. The control plane: SET_MEM_TABLE, SET_VRING_*, GET_CONFIG, …
  5. The data plane: kick (guest → backend) and call (backend → guest)
  6. How SPDK implements vhost-user (lib/vhost/rte_vhost_user.c)
  7. How QEMU implements the front end (hw/virtio/vhost-user.c)
  8. The connection lifecycle: connect, configure, start, stop, disconnect
  9. Edge cases & what trips people up

What vhost-user is, and what problem it solves

A virtio device has two sides: a frontend inside the guest (a virtio-blk, virtio-scsi, or virtio-net driver) and a backend in the host that actually services the I/O. For QEMU, the most natural backend is QEMU itself, in the host kernel-mode QEMU process. That works fine until you want the backend to be a separate, more specialised process — an SPDK target, for example, that owns its own hugepages and threads.

vhost-user is the protocol that lets that separate backend process be the virtio backend. The frontend (still inside the guest) thinks it's talking to a QEMU-owned virtio-pci device. The backend (SPDK) thinks it's a standalone virtio device with two rings. The two sides coordinate over a Unix-domain socket, using a small fixed-grammar message protocol.

The original vhost was a kernel API. A character device in /dev/vhost-net let the kernel back a virtio-net device for QEMU. That was fast but not flexible: every backend had to live in the kernel. vhost-user takes the protocol out of the kernel and puts it on a Unix socket, so any userspace process can speak it. The kernel vhost code still does the heavy lifting for the data plane (vhost/vhost.c in the kernel), but the configuration plane — what memory the guest gave us, where the rings are, which features are negotiated — is now a userspace protocol between two processes.

The protocol at a glance: a Unix socket, two rings, a control grammar

The protocol has two halves:

  1. The control plane. A reliable, ordered stream of fixed-format messages on a single Unix socket. The frontend sends SET_MEM_TABLE, SET_VRING_ADDR, SET_VRING_KICK, and friends. The backend sends SET_MEM_TABLE's reply, and occasional SET_VRING_CALL-style notifications for control. The messages are small (a header plus a payload) and the wire format is defined by the vhost-user spec.

  2. The data plane. Two virtqueues per notification, mapped from guest memory. The frontend pushes descriptors into the avail ring. The backend consumes them. The backend pushes used entries into the used ring. The frontend consumes them. Each side "kicks" the other through an eventfd that's also mapped from the memory table.

flowchart LR
subgraph guest["Guest (Linux virtio driver)"]
  GD["virtio-blk / virtio-scsi / virtio-net driver"]
end

subgraph qemu["QEMU process"]
  QE["vhost-user frontend (hw/virtio/vhost-user.c)"]
end

subgraph spdk["SPDK process (vhost target)"]
  SE["vhost-user backend (lib/vhost/rte_vhost_user.c)"]
  SC["vhost-scsi.c or vhost_blk.c"]
  SB["bdev / lvol / nvmf layer"]
end

GD -- "MMIO on virtio-pci" --> QE
QE -- "control messages (Unix socket)" --> SE
SE --> SC --> SB

GD -.->|shared memory: avail ring, used ring, kickfd, callfd, guest memory table| SE
fig. 1 — the vhost-user protocol · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The three-process topology of vhost-user. The guest never knows it's not talking to QEMU-owned I/O; the control messages ride a Unix socket between QEMU and SPDK; the data plane rides two shared rings in guest memory.

The control-message grammar is small. The full list lives in the spec, but the high-traffic ones are:

MessageDirectionWhat it does
SET_OWNERQEMU → SPDKSets the process that "owns" the device (the one that will close the socket on shutdown).
GET_FEATURES / SET_FEATURESbothNegotiate the virtio feature bits. The two sides AND the requested features with their own host features.
SET_MEM_TABLEQEMU → SPDKTells the backend the guest's physical memory regions, so the backend can mmap them and translate guest-physical addresses to virtual addresses.
SET_VRING_NUMQEMU → SPDKSize of a particular virtqueue (always a power of two).
SET_VRING_ADDRQEMU → SPDKGuest-physical addresses of the descriptor table, avail ring, and used ring for a virtqueue.
SET_VRING_KICK / SET_VRING_CALLQEMU → SPDKPasses the eventfds the backend uses to know "guest pushed an avail" (kick) and the guest uses to know "backend pushed a used" (call). The eventfds are also memory-mapped from SET_MEM_TABLE.
SET_VRING_ENABLEQEMU → SPDKEnable or disable a virtqueue. The backend refuses new submissions to disabled rings.
GET_CONFIG / SET_CONFIGbothRead or write the device-specific configuration space (capacity, geometry, queue counts, …).
GET_VRING_BASEQEMU → SPDK"I am stopping the device. What was the last avail/used index?" Used for live migration handoff.

Split vs packed virtqueues

The data plane has two physical layouts: split and packed. The split layout is the original; the packed layout is newer and denser. Both are negotiated at feature time via VIRTIO_F_RING_PACKED.

Split (legacy)

Three separate arrays in guest memory:

  • Descriptor table: a fixed-size array of vring_desc entries (16 bytes each: addr, len, flags, next). Chains via the next field.
  • Avail ring: the guest posts head-descriptor indices here. The avail->idx field increments on every push.
  • Used ring: the backend posts completed entries here. The used->idx field increments on every push.

The idx field is 16 bits. Wraparound is handled by last_avail_idx on the backend side.

Packed (modern)

One single ring of vring_packed_desc entries (16 bytes each), with a wrap counter:

  • The flags field carries both the descriptor flags and the avail/used bit (single-producer/single-consumer synchronization).
  • A wrap_counter bit tells the consumer which generation of descriptors it's reading. Replaces the avail->idx / used->idx split.
  • Half the cache misses of the split layout because the descriptor, the avail marker, and the used marker all live next to each other.

The same code path is taken for packed rings; the descriptor format is different but the consumer logic is the same: find new requests, mark them consumed, signal the guest. See lib/vhost/rte_vhost_user.c:1041 where the enable_device_vq function reads vsession->negotiated_features & (1ULL << VIRTIO_F_RING_PACKED) and stores it on the virtqueue for the poller to consult.

The control plane in detail: SET_MEM_TABLE and friends

The most important control message is VHOST_USER_SET_MEM_TABLE. It tells the backend the guest's physical memory layout. Without it, the backend can't translate a guest-virtual address (GPA) into a host-virtual address (HVA) and the data plane is dead.

The full control-message dispatch is in extern_vhost_pre_msg_handler at lib/vhost/rte_vhost_user.c:1441 . The function handles 40+ message types, but most are one-liners that just store a value on the spdk_vhost_session for later.

The control messages that do work (allocate, destroy, mmap, mlock) are: SET_MEM_TABLE (mmap guest memory regions), the SET_VRING_* family (store guest-physical addresses and eventfds on the virtqueue), SET_FEATURES (negotiate which virtio extensions are on), and the destroy handlers (stop_device, destroy_connection). Everything else is bookkeeping.

The data plane: kick (guest → backend) and call (backend → guest)

The data plane is two rings, one in each direction. The guest pushes descriptors into the avail ring; the backend pulls them out, services them, and pushes the completions into the used ring; the guest pulls them out.

To avoid polling, each ring has a paired eventfd:

  • Kick (guest → backend). The guest writes one byte to the kick eventfd whenever it pushes new descriptors into the avail ring. The backend's epoll sees the eventfd as readable, the reactor calls the backend's data-plane poller, the poller reads the avail ring, dispatches the I/Os.

  • Call (backend → guest). The backend writes one byte to the call eventfd when it pushes completions into the used ring. The guest's virtio driver wakes up, drains the used ring, hands the completed I/Os to the block layer.

flowchart TB
A["Guest user-space reads /dev/vda"] --> B["Guest virtio-blk driver / pushes descs to avail ring"]
B --> C["Guest writes 1 byte to kickfd"]
C -- "epoll" --> D["SPDK reactor poller / vhost_vq_avail_ring_get"]
D --> E[Service I/O via bdev]
E --> F[Push completions to used ring]
F --> G[SPDK writes 1 byte to callfd]
G -- "eventfd" --> H[Guest virtio driver wakes]
H --> I[Guest copies data to user-space]
I --> A
fig. 2 — the data plane in motion · tap or scroll to zoom · ↗ for fullscreen

fig. 2   One read(2) on the guest. The hot path is the two eventfd wakes and the two ring writes; everything in between is in-memory DMA.

In poll mode (the SPDK default), the backend doesn't actually wait for the kick eventfd. The reactor's main loop calls the backend's poller on every iteration, and the poller just reads avail->idx. If there are no new requests, the poll is cheap. If there are, the poller services them. The eventfd is still passed via SET_VRING_KICK for protocol correctness and for live migration; it's just not in the fast path.

How SPDK implements vhost-user

The vhost-user backend is split into three files:

  • lib/vhost/vhost.c — global state, the device RB-tree, the spdk_vhost_lock() pthread mutex that protects it, the spdk_vhost_dev_* API. Look at lib/vhost/vhost.c:20 for g_vhost_mutex and lib/vhost/vhost.c:139 for the vhost_dev_register critical section.

  • lib/vhost/rte_vhost_user.c — the protocol layer. All the vhost-user control messages, the socket setup, the per-vhost-user spdk_vhost_user_dev state (with its own user_dev->lock pthread mutex), the connection lifecycle. This is the file the QMP-quit wedge lives in.

  • lib/vhost/vhost_scsi.c or lib/vhost/vhost_blk.c — the device-specific backend. Implements start_session, stop_session, alloc_vq_tasks, the bdev submission path, the poller that drains the avail ring.

How QEMU implements the front end

The QEMU side lives in hw/virtio/vhost-user.c (QEMU's tree, not SPDK's). It speaks the same vhost-user protocol over a Unix socket. The relevant entry point is vhost_user_backend_connect():

  1. QEMU forks, the child becomes the SPDK process (or finds one already running), and the socket file at e.g. /var/diskengine/vhost/12345/cntrl is what they connect over.

  2. The QEMU side runs SET_OWNER first (so the backend knows the FD is owned by a process and will get cleaned up when it exits).

  3. The QEMU side runs GET_FEATURES, the backend replies with its feature mask, QEMU ANDs it with the guest's requirements and runs SET_FEATURES back.

  4. The QEMU side runs the memory table (SET_MEM_TABLE), the per-virtqueue configuration (SET_VRING_NUM, SET_VRING_ADDR, SET_VRING_KICK, SET_VRING_CALL).

  5. The QEMU side runs SET_VRING_ENABLE on each queue. The backend's enable_device_vq function at

    lib/vhost/rte_vhost_user.c:1027

    allocates the per-queue task pool, looks up the kickfd/callfd from the device, and the poller is now ready.

From here on, the data plane runs without the control socket. The control socket is used for live migration (the GET_VRING_BASE / SET_VRING_BASE handshake) and for configuration changes (resize, hotplug, etc.).

The connection lifecycle: connect, configure, start, stop, disconnect

STEP 01
vhost_register_unix_socket
SPDK creates the listening Unix socket at e.g. /var/diskengine/vhost/<id>/cntrl
STEP 02
QEMU connect()
QEMU connect(2)s to the socket; the kernel vhost code accepts and creates an internal vid
STEP 03
new_connection
Kernel vhost calls SPDK's new_connection (rte_vhost_user.c:880); SPDK allocates a vsession, adds it to user_dev->vsessions, takes user_dev->lock
STEP 04
Control-plane handshake
QEMU runs SET_OWNER, GET_FEATURES, SET_MEM_TABLE, SET_VRING_NUM/ADDR/KICK/CALL on every virtqueue
STEP 05
start_device
QEMU runs SET_VRING_ENABLE; kernel vhost calls start_device (rte_vhost_user.c:1127); SPDK hops to vdev->thread, registers the poller
STEP 06
Data plane runs
Poller drains the avail ring on every reactor tick, services I/O, fills the used ring, kicks the callfd
STEP 07
stop_device
QMP quit / live-migration pause / QEMU crash; kernel vhost calls SPDK's stop_device; SPDK unregisters the data-plane poller and registers a stop_poller that waits for in-flight I/O to drain
STEP 08
destroy_connection
Kernel vhost frees the vid and calls SPDK's destroy_connection (rte_vhost_user.c:1189); SPDK frees the session

Edge cases & what trips people up

1. The QEMU process dies unceremoniously

When QEMU is kill -9ed, the kernel vhost code sees the process's descriptors disappear, which causes it to call destroy_device then destroy_connection on the backend. The backend's stop_device path runs. The vsession is unregistered from the data-plane poller. If there was in-flight I/O when QEMU died, the stop poller waits for those bdev_ios to complete (the bdev layer will time them out via the NVMe driver). The user_dev->lock contention that happens here is the QMP-quit-wedge candidate — the backend has nowhere to drain, because the I/O was lost to the guest's death. The destroy_connection function then frees the vsession.

2. A malformed vhost-user message

The control-message dispatcher in extern_vhost_pre_msg_handler at

lib/vhost/rte_vhost_user.c:1441

returns RTE_VHOST_MSG_RESULT_ERR for an unknown message ID, which causes the kernel vhost code to close the socket. The backend treats the close the same way it treats a QEMU death. The session is torn down. If the malformed message arrived during a state change (e.g. in the middle of a stop sequence), the stop poller has a high chance of timing out.

3. The socket file is reused

If a vhost controller is created with the same name as a previous one, the new vhost_register_unix_socket call will fail with EADDRINUSE — the old socket file is still on disk because the previous SPDK process was killed without running vhost_driver_unregister. The controller creation fails. The fix is to remove the stale socket file from /var/diskengine/vhost/<ctrlr> before retrying. diskengine has a recovery pass for this in

lib/vhost/rte_vhost_user.c:1916

(vhost_driver_unregister during vhost_user_dev_unregister) but it only runs on a clean unregister, not a crash.

4. Live migration

Live migration is a controlled tear-down of the old session and a controlled bring-up of the new one. The source QEMU sends VHOST_USER_GET_VRING_BASE to ask the backend for the last-used indices on each virtqueue. The source QEMU serializes the guest's memory and sends it to the destination. The destination QEMU restarts the device with a new SET_MEM_TABLE (different guest memory regions). The backend's extern_vhost_pre_msg_handler handles SET_MEM_TABLE during a live session by tearing the session down and marking it for restart. This is the vsession->needs_restart = true branch in the snippet above. The new SET_VRING_ENABLE message is what actually re-starts the poller.

5. The "innocent file" trap

/var/diskengine/vhost/<ctrlr> is owned by the first SPDK process that created it. If a second SPDK process tries to bind(2) the same path, the bind fails. The fix is to either have the second SPDK process unlink(2) the stale socket first, or to use a process-local PID in the path. diskengine does the latter (it puts the VM ID in the path, not the SPDK process PID), so two VMs with the same ID across a restart will collide.

6. The "I called SET_VRING_KICK without SET_MEM_TABLE" trap

The control-message ordering is not enforced by the protocol. The backend's extern_vhost_pre_msg_handler is best-effort. If the front end sends SET_VRING_KICK before SET_MEM_TABLE, the backend stores the eventfd, but the data-plane poller can't access it because the guest memory hasn't been mapped. The poller will see no requests, the guest will see no completions, and the device will appear hung. The fix is to never start a data-plane poller until vsession->mem != NULL. The start_device function in the snippet above does this check explicitly.

7. The QMP-quit-wedge candidate

When the QEMU process exits via QMP quit, the kernel vhost code calls the backend's stop_device, which calls _stop_session. For the SCSI backend, that's lib/vhost/vhost_scsi.c:1542 , which unregisters the requestq poller and registers a stop_poller that waits for in-flight bdev_ios to drain. If a bdev_io is in task_cnt > 0 because the bdev layer has a long retry queue, the stop poller will spin on pthread_mutex_trylock(&user_dev->lock) and task_cnt checks for SPDK_VHOST_SESSION_STOP_RETRY_TIMEOUT_IN_SEC seconds. After that, the session is force-closed but the bdev_ios keep their references. This is the structural wedge that page 7.4 will dissect.

Why it matters

The QMP quit wedge that costs you a service restart is a vhost-user protocol bug. The data path, the control path, and the lifecycle of the per-vhost user device are all vhost-user. The next three pages go deeper:

  • 7.2 — the two flavours (vhost-blk and vhost-scsi) and when each is used.
  • 7.3 — the vfio-user alternative and how it differs in transport and cost.
  • 7.4 — the QMP quit wedge, in full, with the lock holding path and the teardown sequence the process is supposed to do.