vhost-user.
A virtio backend that lives outside QEMU. A Unix socket carrying a small fixed set of control messages. Two rings, one in each direction, mapped from a guest memory table. That's the whole protocol. Everything that goes wrong on a baremetal host — every QMP quit, every live-migration, every malformed message — happens inside this protocol. Read this page before you debug the QMP wedge on page 7.4.
- What vhost-user is, and what problem it solves
- The protocol at a glance: a Unix socket, two rings, a control grammar
- Split vs packed virtqueues
- The control plane: SET_MEM_TABLE, SET_VRING_*, GET_CONFIG, …
- The data plane: kick (guest → backend) and call (backend → guest)
- How SPDK implements vhost-user (
lib/vhost/rte_vhost_user.c) - How QEMU implements the front end (
hw/virtio/vhost-user.c) - The connection lifecycle: connect, configure, start, stop, disconnect
- Edge cases & what trips people up
What vhost-user is, and what problem it solves
A virtio device has two sides: a frontend inside the guest (a virtio-blk, virtio-scsi, or virtio-net driver) and a backend in the host that actually services the I/O. For QEMU, the most natural backend is QEMU itself, in the host kernel-mode QEMU process. That works fine until you want the backend to be a separate, more specialised process — an SPDK target, for example, that owns its own hugepages and threads.
vhost-user is the protocol that lets that separate backend process be the virtio backend. The frontend (still inside the guest) thinks it's talking to a QEMU-owned virtio-pci device. The backend (SPDK) thinks it's a standalone virtio device with two rings. The two sides coordinate over a Unix-domain socket, using a small fixed-grammar message protocol.
The original vhost was a kernel API. A character device in
/dev/vhost-net let the kernel back a virtio-net
device for QEMU. That was fast but not flexible: every
backend had to live in the kernel. vhost-user takes the
protocol out of the kernel and puts it on a Unix socket, so
any userspace process can speak it. The kernel vhost code
still does the heavy lifting for the data plane
(vhost/vhost.c in the kernel), but the
configuration plane — what memory the guest gave us, where
the rings are, which features are negotiated — is now a
userspace protocol between two processes.
The protocol at a glance: a Unix socket, two rings, a control grammar
The protocol has two halves:
The control plane. A reliable, ordered stream of fixed-format messages on a single Unix socket. The frontend sends
SET_MEM_TABLE,SET_VRING_ADDR,SET_VRING_KICK, and friends. The backend sendsSET_MEM_TABLE's reply, and occasionalSET_VRING_CALL-style notifications for control. The messages are small (a header plus a payload) and the wire format is defined by the vhost-user spec.The data plane. Two virtqueues per notification, mapped from guest memory. The frontend pushes descriptors into the avail ring. The backend consumes them. The backend pushes used entries into the used ring. The frontend consumes them. Each side "kicks" the other through an
eventfdthat's also mapped from the memory table.
flowchart LR subgraph guest["Guest (Linux virtio driver)"] GD["virtio-blk / virtio-scsi / virtio-net driver"] end subgraph qemu["QEMU process"] QE["vhost-user frontend (hw/virtio/vhost-user.c)"] end subgraph spdk["SPDK process (vhost target)"] SE["vhost-user backend (lib/vhost/rte_vhost_user.c)"] SC["vhost-scsi.c or vhost_blk.c"] SB["bdev / lvol / nvmf layer"] end GD -- "MMIO on virtio-pci" --> QE QE -- "control messages (Unix socket)" --> SE SE --> SC --> SB GD -.->|shared memory: avail ring, used ring, kickfd, callfd, guest memory table| SE
fig. 1 The three-process topology of vhost-user. The guest never knows it's not talking to QEMU-owned I/O; the control messages ride a Unix socket between QEMU and SPDK; the data plane rides two shared rings in guest memory.
The control-message grammar is small. The full list lives in the spec, but the high-traffic ones are:
| Message | Direction | What it does |
|---|---|---|
SET_OWNER | QEMU → SPDK | Sets the process that "owns" the device (the one that will close the socket on shutdown). |
GET_FEATURES /
SET_FEATURES | both | Negotiate the virtio feature bits. The two sides
AND the requested features with their
own host features. |
SET_MEM_TABLE | QEMU → SPDK | Tells the backend the guest's physical memory regions, so the backend can mmap them and translate guest-physical addresses to virtual addresses. |
SET_VRING_NUM | QEMU → SPDK | Size of a particular virtqueue (always a power of two). |
SET_VRING_ADDR | QEMU → SPDK | Guest-physical addresses of the descriptor table, avail ring, and used ring for a virtqueue. |
SET_VRING_KICK /
SET_VRING_CALL | QEMU → SPDK | Passes the eventfds the backend uses to
know "guest pushed an avail" (kick) and the guest
uses to know "backend pushed a used" (call). The
eventfds are also memory-mapped from
SET_MEM_TABLE. |
SET_VRING_ENABLE | QEMU → SPDK | Enable or disable a virtqueue. The backend refuses new submissions to disabled rings. |
GET_CONFIG /
SET_CONFIG | both | Read or write the device-specific configuration space (capacity, geometry, queue counts, …). |
GET_VRING_BASE | QEMU → SPDK | "I am stopping the device. What was the last avail/used index?" Used for live migration handoff. |
Split vs packed virtqueues
The data plane has two physical layouts: split and
packed. The split layout is the original; the packed
layout is newer and denser. Both are negotiated at feature
time via VIRTIO_F_RING_PACKED.
Three separate arrays in guest memory:
- Descriptor table: a fixed-size
array of
vring_descentries (16 bytes each: addr, len, flags, next). Chains via thenextfield. - Avail ring: the guest posts
head-descriptor indices here. The
avail->idxfield increments on every push. - Used ring: the backend posts
completed entries here. The
used->idxfield increments on every push.
The idx field is 16 bits. Wraparound is
handled by last_avail_idx on the backend
side.
One single ring of vring_packed_desc
entries (16 bytes each), with a wrap counter:
- The
flagsfield carries both the descriptor flags and the avail/used bit (single-producer/single-consumer synchronization). - A
wrap_counterbit tells the consumer which generation of descriptors it's reading. Replaces theavail->idx/used->idxsplit. - Half the cache misses of the split layout because the descriptor, the avail marker, and the used marker all live next to each other.
The same code path is taken for packed rings; the descriptor
format is different but the consumer logic is the same:
find new requests, mark them consumed, signal the guest.
See lib/vhost/rte_vhost_user.c:1041
where the enable_device_vq function reads
vsession->negotiated_features & (1ULL <<
VIRTIO_F_RING_PACKED) and stores it on the
virtqueue for the poller to consult.
The control plane in detail: SET_MEM_TABLE and friends
The most important control message is
VHOST_USER_SET_MEM_TABLE. It tells the backend
the guest's physical memory layout. Without it, the
backend can't translate a guest-virtual address (GPA) into
a host-virtual address (HVA) and the data plane is dead.
The full control-message dispatch is in
extern_vhost_pre_msg_handler at
lib/vhost/rte_vhost_user.c:1441 .
The function handles 40+ message types, but most are
one-liners that just store a value on the
spdk_vhost_session for later.
The control messages that do work (allocate,
destroy, mmap, mlock) are: SET_MEM_TABLE
(mmap guest memory regions), the
SET_VRING_* family (store guest-physical
addresses and eventfds on the virtqueue),
SET_FEATURES (negotiate which virtio
extensions are on), and the destroy handlers
(stop_device, destroy_connection).
Everything else is bookkeeping.
The data plane: kick (guest → backend) and call (backend → guest)
The data plane is two rings, one in each direction. The guest pushes descriptors into the avail ring; the backend pulls them out, services them, and pushes the completions into the used ring; the guest pulls them out.
To avoid polling, each ring has a paired
eventfd:
Kick (guest → backend). The guest writes one byte to the kick eventfd whenever it pushes new descriptors into the avail ring. The backend's epoll sees the eventfd as readable, the reactor calls the backend's data-plane poller, the poller reads the avail ring, dispatches the I/Os.
Call (backend → guest). The backend writes one byte to the call eventfd when it pushes completions into the used ring. The guest's virtio driver wakes up, drains the used ring, hands the completed I/Os to the block layer.
flowchart TB A["Guest user-space reads /dev/vda"] --> B["Guest virtio-blk driver / pushes descs to avail ring"] B --> C["Guest writes 1 byte to kickfd"] C -- "epoll" --> D["SPDK reactor poller / vhost_vq_avail_ring_get"] D --> E[Service I/O via bdev] E --> F[Push completions to used ring] F --> G[SPDK writes 1 byte to callfd] G -- "eventfd" --> H[Guest virtio driver wakes] H --> I[Guest copies data to user-space] I --> A
fig. 2 One read(2) on the guest. The hot path is
the two eventfd wakes and the two ring
writes; everything in between is in-memory DMA.
In poll mode (the SPDK default), the backend doesn't
actually wait for the kick eventfd. The reactor's
main loop calls the backend's poller on every
iteration, and the poller just reads
avail->idx. If there are no new
requests, the poll is cheap. If there are, the poller
services them. The eventfd is still passed via
SET_VRING_KICK for protocol correctness
and for live migration; it's just not in the fast path.
How SPDK implements vhost-user
The vhost-user backend is split into three files:
lib/vhost/vhost.c— global state, the device RB-tree, thespdk_vhost_lock()pthread mutex that protects it, thespdk_vhost_dev_*API. Look at lib/vhost/vhost.c:20 forg_vhost_mutexand lib/vhost/vhost.c:139 for thevhost_dev_registercritical section.lib/vhost/rte_vhost_user.c— the protocol layer. All the vhost-user control messages, the socket setup, the per-vhost-userspdk_vhost_user_devstate (with its ownuser_dev->lockpthread mutex), the connection lifecycle. This is the file the QMP-quit wedge lives in.lib/vhost/vhost_scsi.corlib/vhost/vhost_blk.c— the device-specific backend. Implementsstart_session,stop_session,alloc_vq_tasks, the bdev submission path, the poller that drains the avail ring.
How QEMU implements the front end
The QEMU side lives in
hw/virtio/vhost-user.c (QEMU's tree, not
SPDK's). It speaks the same vhost-user protocol over a
Unix socket. The relevant entry point is
vhost_user_backend_connect():
QEMU forks, the child becomes the SPDK process (or finds one already running), and the socket file at e.g.
/var/diskengine/vhost/12345/cntrlis what they connect over.The QEMU side runs
SET_OWNERfirst (so the backend knows the FD is owned by a process and will get cleaned up when it exits).The QEMU side runs
GET_FEATURES, the backend replies with its feature mask, QEMUANDs it with the guest's requirements and runsSET_FEATURESback.The QEMU side runs the memory table (
SET_MEM_TABLE), the per-virtqueue configuration (SET_VRING_NUM,SET_VRING_ADDR,SET_VRING_KICK,SET_VRING_CALL).The QEMU side runs
lib/vhost/rte_vhost_user.c:1027SET_VRING_ENABLEon each queue. The backend'senable_device_vqfunction atallocates the per-queue task pool, looks up the kickfd/callfd from the device, and the poller is now ready.
From here on, the data plane runs without the control
socket. The control socket is used for live migration
(the GET_VRING_BASE /
SET_VRING_BASE handshake) and for
configuration changes (resize, hotplug, etc.).
The connection lifecycle: connect, configure, start, stop, disconnect
Edge cases & what trips people up
1. The QEMU process dies unceremoniously
When QEMU is kill -9ed, the kernel vhost
code sees the process's descriptors disappear, which
causes it to call destroy_device then
destroy_connection on the backend. The
backend's stop_device path runs. The
vsession is unregistered from the data-plane poller.
If there was in-flight I/O when QEMU died, the stop
poller waits for those bdev_ios to complete (the
bdev layer will time them out via the NVMe driver).
The user_dev->lock contention that happens here is
the QMP-quit-wedge candidate — the backend has
nowhere to drain, because the I/O was lost to the
guest's death. The destroy_connection
function then frees the vsession.
2. A malformed vhost-user message
The control-message dispatcher in
extern_vhost_pre_msg_handler at
returns RTE_VHOST_MSG_RESULT_ERR for an
unknown message ID, which causes the kernel vhost
code to close the socket. The backend treats the
close the same way it treats a QEMU death. The
session is torn down. If the malformed message
arrived during a state change (e.g. in
the middle of a stop sequence), the stop poller
has a high chance of timing out.
3. The socket file is reused
If a vhost controller is created with the same name as
a previous one, the new
vhost_register_unix_socket call will fail
with EADDRINUSE — the old socket file is
still on disk because the previous SPDK process was
killed without running vhost_driver_unregister.
The controller creation fails. The fix is to remove
the stale socket file from
/var/diskengine/vhost/<ctrlr>
before retrying. diskengine has a recovery pass for
this in
(vhost_driver_unregister during
vhost_user_dev_unregister) but it only
runs on a clean unregister, not a crash.
4. Live migration
Live migration is a controlled tear-down of the old
session and a controlled bring-up of the new one.
The source QEMU sends
VHOST_USER_GET_VRING_BASE to ask the
backend for the last-used indices on each
virtqueue. The source QEMU serializes the guest's
memory and sends it to the destination. The
destination QEMU restarts the device with a new
SET_MEM_TABLE (different guest memory
regions). The backend's
extern_vhost_pre_msg_handler handles
SET_MEM_TABLE during a live session
by tearing the session down and marking it for
restart. This is the
vsession->needs_restart = true
branch in the snippet above. The new
SET_VRING_ENABLE message is what
actually re-starts the poller.
5. The "innocent file" trap
/var/diskengine/vhost/<ctrlr> is
owned by the first SPDK process that created it. If
a second SPDK process tries to bind(2)
the same path, the bind fails. The fix is to either
have the second SPDK process
unlink(2) the stale socket first, or to
use a process-local PID in the path. diskengine does
the latter (it puts the VM ID in the path, not the
SPDK process PID), so two VMs with the same ID across
a restart will collide.
6. The "I called SET_VRING_KICK without SET_MEM_TABLE" trap
The control-message ordering is not enforced by the
protocol. The backend's
extern_vhost_pre_msg_handler is
best-effort. If the front end sends
SET_VRING_KICK before
SET_MEM_TABLE, the backend stores the
eventfd, but the data-plane poller can't access it
because the guest memory hasn't been mapped. The
poller will see no requests, the guest will see no
completions, and the device will appear hung. The
fix is to never start a data-plane poller until
vsession->mem != NULL. The
start_device function in the snippet
above does this check explicitly.
7. The QMP-quit-wedge candidate
When the QEMU process exits via QMP quit,
the kernel vhost code calls the backend's
stop_device, which calls
_stop_session. For the SCSI backend,
that's
lib/vhost/vhost_scsi.c:1542 ,
which unregisters the requestq poller and registers a
stop_poller that waits for in-flight
bdev_ios to drain. If a bdev_io is in
task_cnt > 0 because the bdev layer
has a long retry queue, the stop poller will spin on
pthread_mutex_trylock(&user_dev->lock)
and task_cnt checks for
SPDK_VHOST_SESSION_STOP_RETRY_TIMEOUT_IN_SEC
seconds. After that, the session is force-closed
but the bdev_ios keep their references. This is
the structural wedge that page 7.4 will dissect.
Why it matters
The QMP quit wedge that costs you a service restart is a vhost-user protocol bug. The data path, the control path, and the lifecycle of the per-vhost user device are all vhost-user. The next three pages go deeper: