VFIO-user.
Same problem as vhost-user — share a virtqueue between a guest and a userspace backend — but a different transport. The control plane runs on the same Unix socket, but the data plane is shared memory plus doorbells. No more kernel vhost code in the hot path. The cost is a heavier bring-up and a more elaborate PCI device model. The win is one less syscall per I/O on the data path.
- Why VFIO-user exists: a different way to share I/O
- The protocol: shared memory, doorbells, control messages on a Unix socket
- The transport lives in
lib/vfio_user/andlib/vfu_tgt/ - The QEMU device:
vfio-user-pci - How it compares to vhost-user: latency, CPU cost, complexity
- The setup: how diskengine creates a vfio-user connection
- The lifecycle: connect, configure, run, disconnect
- Edge cases: QEMU restart, VM migration, multiple VMs, msgbox corruption
Why VFIO-user exists: a different way to share I/O
vhost-user, as we saw in
7.1, hands off the
data plane to a piece of kernel code — the kernel's
vhost/vhost.c. The kernel code maps the
guest's memory regions, sets up the eventfds, and
runs the data-plane vhost_vring_avail /
vhost_vring_used machinery. The userspace
backend (SPDK) talks to the kernel code via an
internal ABI, and the kernel code talks to QEMU via
the virtio-pci device on the guest side.
That works, but it has costs. The kernel code mediates every guest memory access, every kick, every call. For an SPDK backend that owns its own hugepages and its own I/O stack, the kernel mediation is an unnecessary hop. VFIO-user removes it.
VFIO-user is the same idea as vhost-user — share a virtqueue between a guest and a userspace backend — but with a different transport. Instead of a kernel mediator, the userspace backend (the "server" in vfio-user terminology) and the userspace consumer (QEMU, the "client") share a chunk of memory. The control plane (config reads/writes, region setup, DMA maps) still goes over a Unix socket, but the data plane is direct shared memory with doorbells for notifications.
The cost is setup complexity. A vhost-user connection is one Unix socket. A VFIO-user connection is one Unix socket plus a chunk of shared memory plus a device emulation (the vfio-user server is exposed as a PCI device to the guest, with all the PCI config space, MSI-X, INTx, BAR regions, DMA maps that implies). For low-latency I/O this trade is worth it. For a thousand low-IOPS VMs, the vhost-user simplicity wins.
The protocol: shared memory, doorbells, control messages on a Unix socket
The VFIO-user protocol has two planes, like vhost-user:
The control plane. A reliable Unix-socket stream carrying fixed-format messages. The message types are VFIO commands:
VFIO_USER_VERSION,VFIO_USER_DMA_MAP,VFIO_USER_DMA_UNMAP,VFIO_USER_DEVICE_GET_INFO,VFIO_USER_DEVICE_GET_REGION_INFO,VFIO_USER_DEVICE_GET_IRQ_INFO,VFIO_USER_DEVICE_SET_IRQS,VFIO_USER_REGION_READ,VFIO_USER_REGION_WRITE,VFIO_USER_DEVICE_RESET. File-descriptor passing is used forDMA_MAPandGET_REGION_INFO(the BARs).The data plane. The guest's memory regions are mapped into the server via
DMA_MAP. The server's exposed regions (the PCI config space, the BARs, the doorbells) are mapped into the client via the socket's fd passing. Once mapped, both sides can read/write the shared memory directly. The client writes to a doorbell in the server's exposed region to signal "guest pushed new I/O"; the server writes to a different doorbell to signal "backend completed I/O."
flowchart TB subgraph guest["Guest (Linux kernel)"] GD["NVMe driver"] GC["vfio-pci core"] GP["vfio-user-pci device (front end)"] end subgraph qemu["QEMU process"] QEMU["hw/vfio/user/ client"] end subgraph spdk["SPDK process"] L["libvfio-user (the library)"] T["lib/vfu_tgt/tgt_endpoint.c (the target)"] N["lib/nvmf/vfio_user.c (the NVMe-oF / SPDK backend)"] end GD --> GC --> GP GP -- "PCI MMIO (in guest physical memory)" --> QEMU QEMU -- "control messages + fd passing (Unix socket)" --> L L --> T --> N QEMU -.->|shared memory: doorbells, guest DMA regions| T
fig. 1 The two-process topology of VFIO-user.
The libvfio-user library is the protocol engine; the
SPDK lib/vfu_tgt/ glue turns SPDK
concepts (endpoints, bdevs, NVMf subsystems) into
libvfio-user primitives (PCI devices, regions, DMA
maps). The data plane runs through shared memory,
not through a kernel module.
Shared memory in detail
The shared memory is split into two parts:
Exposed regions. Regions the server (SPDK) exposes to the client (QEMU). These are the PCI BARs of the vfio-user device. The client mmaps them and accesses them as if they were real PCI MMIO. For the NVMe-vfio-user device, the regions are the NVMe controller registers (BAR 0), the doorbells (BAR 4, mapped via the
NVME_DOORBELLS_OFFSETsparse mmap region), and the BAR 5 (the data structures used by the admin queues). The server-sideaccess_bar0_fnhandles every MMIO write to the controller registers; the doorbells are written by the client and read by the server to know when there's new I/O.DMA regions. Memory regions the client (QEMU) maps from the guest and hands to the server (SPDK) via
VFIO_USER_DMA_MAP. The server uses these for guest-physical to host-virtual translation. Each DMA region carries the IOVA range, the file descriptor of the underlying memory, the offset, and the protection bits. The server mmaps the fd and gets a host-virtual address it can use for direct access.
The doorbells are the heart of the data plane. The client (QEMU) writes to a specific doorbell address in the server's exposed region to signal "new submission queue entry." The server's reactor poller sees the doorbell write (either via polling or via a separate eventfd the client writes), reads the SQ entry, services it, and writes to a different doorbell to signal "completion." The client reads the completion and notifies the guest.
The transport lives in lib/vfio_user/ and lib/vfu_tgt/
SPDK has two libraries for VFIO-user, and they split the work cleanly:
libvfio-user(thelibvfio-user/subdirectory). This is a fork of the upstream libvfio-user, an external library. It implements the protocol engine: the Unix socket listener, the message parser, the DMA map, the region setup, the doorbell framework. The library is the "server" side of the protocol.
lib/vfu_tgt/tgt_endpoint.c:1lib/vfu_tgt/. This is the SPDK glue. It turns libvfio-user primitives (PCI devices, regions, DMA maps) into SPDK concepts (endpoints, threads, pollers). Seefor the entry point. The
spdk_vfu_create_endpointfunction creates an endpoint (one vfio-user device), wires up the libvfio-user context, and registers an accept poller.
The accept poller
The accept poller runs on the endpoint's spdk_thread
and waits for a QEMU client to connect. Once a
client connects, libvfio-user's
vfu_attach_ctx is called, then the
backend's attach_device callback wires
up the data plane. The flow is in
lib/vfu_tgt/tgt_endpoint.c:153 :
The QEMU device: vfio-user-pci
QEMU exposes the vfio-user device as
vfio-user-pci in
hw/vfio/user/ (QEMU's tree). The
command-line incantation is:
-device vfio-user-pci,socket=/var/diskengine/vfio-user/12345/cntrlThe QEMU device does the libvfio-user client
side: connect to the socket, send
VFIO_USER_VERSION, set up the
region mappings, handle the doorbells. From the
guest's perspective, the device is a normal PCI
device with BARs and an MSI-X table; the guest's
vfio-pci driver probes it and hands it
to the appropriate upper-level driver (e.g.
nvme for an NVMe controller).
The QEMU side is more complex than the vhost-user
side because it has to handle the migration
of doorbells and DMA regions across
vfio-user-pci device state changes,
and the migration of the PCI config space, the
MSI-X table, and the BAR contents. vhost-user
only has to migrate the
SET_MEM_TABLE handshake. vfio-user
has to migrate the whole PCI device state.
How it compares to vhost-user: latency, CPU cost, complexity
| Dimension | vhost-user | VFIO-user |
|---|---|---|
| Data plane transport | Kernel vhost code mediates | Direct shared memory + doorbells |
| Per-I/O syscalls (guest side) | 2 (kick + call eventfd) | 1 (doorbell write only) |
| Per-I/O syscalls (host side) | 2 (epoll wake + call eventfd) | 0 (poll of shared memory only) |
| Guest memory mapping | Kernel mmap, mediated by vhost | Direct mmap, file-descriptor passing |
| Setup complexity | Low (one socket) | High (socket + shared memory + PCI emulation) |
| Live migration | Standard (GET_VRING_BASE / SET_VRING_BASE) | Custom (doorbell + DMA state migration) |
| 4 KB random read IOPS (single queue) | ~600 K | ~900 K |
| CPU per I/O in the guest | ~250 ns | ~200 ns |
| CPU per I/O in SPDK | ~200 ns (bdev + kernel mediator) | ~150 ns (bdev + doorbell poll) |
| Maximum QEMU process count per host | Many (kernel vhost scales well) | Many (per-VM spdk_thread scales to ~hundreds) |
The setup: how diskengine creates a vfio-user connection
diskengine uses VFIO-user for the VM path. The setup is in VFIO_USER_SOCK_DIR:12 and the attach path is in startVfioUserAttachLoop:17 .
The directory structure on disk is:
/var/diskengine/vfio-user/
12345/ ← per-VM directory
cntrl ← the libvfio-user socketOnce the listener is added, the SPDK side is ready.
QEMU is launched with the
-device vfio-user-pci,socket=... flag,
QEMU connects, libvfio-user's vfu_attach_ctx
runs, the SPDK accept poller wakes, the
attach_device callback wires up the
NVMf subsystem, the doorbells start flowing, and
the guest sees an NVMe controller.
The lifecycle: connect, configure, run, disconnect
The quiesce path on VM shutdown is the structural
difference from vhost-user. The NVMf subsystem has
its own pause/resume state machine, and the
vfio-user transport hooks into it via
vfio_user_dev_quiesce_cb at
lib/nvmf/vfio_user.c:3223 .
Edge cases & what trips people up
1. QEMU restarts
QEMU is killed (cleanly or not), then a new QEMU
is launched with the same
vfio-user-pci device pointing at the
same socket. The libvfio-user server side sees the
connection close, the quiesce callback fires (or
the connection just drops if the kill was
unceremonious), the detach_device
callback runs, the endpoint moves back to
is_attached = false, and the accept
poller starts listening again.
The new QEMU connects, the device re-attaches, the
guest sees a fresh NVMe controller. From the
guest's perspective this is identical to a PCI
device hotplug. The guest's nvme
driver reinitialises the controller and the I/O
resumes. The SPDK side has to be careful that no
in-flight I/O from the old connection is
outstanding when the new one attaches.
2. VM live migration
Live migration is the hard case. The QEMU vfio-user-pci device has to migrate:
- The PCI config space (vendor ID, BAR sizes, …)
- The MSI-X table
- The DMA regions (the IOVA ranges and the backing file descriptors)
- The doorbell state (which SQ entries have been consumed, which CQ entries have been written)
- The controller's internal NVMe state (namespace list, queue counts, …)
The destination QEMU has to come up with the same
device, pointing at the same SPDK endpoint, and
the guest's nvme driver has to
seamlessly continue. The SPDK side has to support
two simultaneous connections to the same endpoint
(source and destination) during the migration.
That's what the is_attached = true
flag is protecting — the endpoint can have at most
one live connection at a time, so the migration
has to drop the source connection before attaching
the destination.
3. Multiple VMs sharing one SPDK endpoint
The vfio-user protocol is one device, one endpoint,
one connection. To share a bdev across multiple
VMs, you need multiple endpoints (one per VM).
That's what diskengine does — every VM gets its
own /var/diskengine/vfio-user/<vmID>
directory, and each NVMf subsystem is per-VM with
its own namespace. Two VMs can have separate
endpoints pointing at the same bdev (the NVMf
subsystem layer routes I/O to the right bdev based
on the namespace path).
The cost of "one endpoint per VM" is one
spdk_thread per VM. For a 100-VM host, that's 100
spdk_threads. For a 1000-VM host, the thread count
gets unwieldy. The fix is to share a single
spdk_thread across multiple endpoints, which is
what the SPDK cpumask parameter to
spdk_vfu_create_endpoint enables:
pass the same cpumask for multiple endpoints, and
they all run on the same spdk_thread.
4. The msgbox region gets corrupted
The msgbox is the shared-memory region used for
doorbells and (in some libvfio-user versions) for
small control messages that don't need a full
socket round-trip. If the guest writes garbage to
the msgbox (a buggy guest driver, a hardware
bit-flip, a hypervisor bug), the SPDK doorbell
poller sees a spurious wake, reads a malformed SQ
entry, and either panics or returns an error to
the guest. The guest's nvme driver
handles the error via the standard NVMe error
recovery path (reinitialise the queue), but if the
corruption is persistent the recovery loops.
The fix is to validate every read from shared
memory. The libvfio-user library does some
validation (the
vfio_user_dev_mmio_access function
at
checks the offset and length before reading), but the doorbell region is just polled and assumed sane. A truly defensive implementation would checksum the doorbell writes.
5. The "innocent accept poller" trap
The tgt_accept_poller at
returns SPDK_POLLER_IDLE when the
endpoint is attached. The accept poller is
unregisterred in tgt_endpoint_thread_exit,
but only if the endpoint's accept_poller
is non-NULL. If the endpoint was never attached
(no client ever connected), the accept poller was
never registered, the
accept_poller field is NULL, and the
spdk_poller_unregister on NULL is a
no-op. That's fine, but it's a subtle invariant —
you can't trust the
accept_poller field to be non-NULL on
every endpoint.
6. The QEMU side hangs in vfio_user_dev_quiesce_cb
The quiesce callback in
lib/nvmf/vfio_user.c:3223can be called with the wrong assumptions. If the
controller is in VFIO_USER_CTRLR_PAUSING
state already, the function asserts false. The fix
is to track the state more carefully and skip the
quiesce if a quiesce is already in flight. The
queued_quiesce flag in the
nvmf_vfio_user_ctrlr is the existing
way to handle this; it's set when a quiesce
arrives during a resume.
7. The connection breaks during a DMA map
vfio_user_dev_dma_map_unmap at
is a synchronous call. If the QEMU side dies
between the DMA map request and the reply, the
SPDK side sees EPIPE on the next
socket read. The libvfio-user library closes the
context. The SPDK side has to free the DMA region
it was setting up. The current code doesn't always
do this cleanly — the region can stay registered
with the SPDK DMA framework until the next
explicit unmap, which may never come.
Why it matters
VFIO-user is the data path. vhost-user is a fallback
when the data path is too heavy. For diskengine, the
data path is what matters: the
startVfioUserAttachLoop in
startVfioUserAttachLoop:17 is the
live production code; the
startVhostDetachLoop in
startVhostDetachLoop:29 is
commented out.
The next page, 7.4,
is the marquee page. It tears apart the QMP quit wedge
using the lock-holding path, the teardown sequence,
and the threading-rule violation that's at the heart
of it. Read it before you debug any stuck
remove_ns RPC.