Layer 7 · vhost / virtio / VFIO-user

VFIO-user.

Same problem as vhost-user — share a virtqueue between a guest and a userspace backend — but a different transport. The control plane runs on the same Unix socket, but the data plane is shared memory plus doorbells. No more kernel vhost code in the hot path. The cost is a heavier bring-up and a more elaborate PCI device model. The win is one less syscall per I/O on the data path.

~15 min read2 diagramsprerequisites: 7.1 · 4.1
On this page
  1. Why VFIO-user exists: a different way to share I/O
  2. The protocol: shared memory, doorbells, control messages on a Unix socket
  3. The transport lives in lib/vfio_user/ and lib/vfu_tgt/
  4. The QEMU device: vfio-user-pci
  5. How it compares to vhost-user: latency, CPU cost, complexity
  6. The setup: how diskengine creates a vfio-user connection
  7. The lifecycle: connect, configure, run, disconnect
  8. Edge cases: QEMU restart, VM migration, multiple VMs, msgbox corruption

Why VFIO-user exists: a different way to share I/O

vhost-user, as we saw in 7.1, hands off the data plane to a piece of kernel code — the kernel's vhost/vhost.c. The kernel code maps the guest's memory regions, sets up the eventfds, and runs the data-plane vhost_vring_avail / vhost_vring_used machinery. The userspace backend (SPDK) talks to the kernel code via an internal ABI, and the kernel code talks to QEMU via the virtio-pci device on the guest side.

That works, but it has costs. The kernel code mediates every guest memory access, every kick, every call. For an SPDK backend that owns its own hugepages and its own I/O stack, the kernel mediation is an unnecessary hop. VFIO-user removes it.

VFIO-user is the same idea as vhost-user — share a virtqueue between a guest and a userspace backend — but with a different transport. Instead of a kernel mediator, the userspace backend (the "server" in vfio-user terminology) and the userspace consumer (QEMU, the "client") share a chunk of memory. The control plane (config reads/writes, region setup, DMA maps) still goes over a Unix socket, but the data plane is direct shared memory with doorbells for notifications.

The cost is setup complexity. A vhost-user connection is one Unix socket. A VFIO-user connection is one Unix socket plus a chunk of shared memory plus a device emulation (the vfio-user server is exposed as a PCI device to the guest, with all the PCI config space, MSI-X, INTx, BAR regions, DMA maps that implies). For low-latency I/O this trade is worth it. For a thousand low-IOPS VMs, the vhost-user simplicity wins.

The protocol: shared memory, doorbells, control messages on a Unix socket

The VFIO-user protocol has two planes, like vhost-user:

  1. The control plane. A reliable Unix-socket stream carrying fixed-format messages. The message types are VFIO commands: VFIO_USER_VERSION, VFIO_USER_DMA_MAP, VFIO_USER_DMA_UNMAP, VFIO_USER_DEVICE_GET_INFO, VFIO_USER_DEVICE_GET_REGION_INFO, VFIO_USER_DEVICE_GET_IRQ_INFO, VFIO_USER_DEVICE_SET_IRQS, VFIO_USER_REGION_READ, VFIO_USER_REGION_WRITE, VFIO_USER_DEVICE_RESET. File-descriptor passing is used for DMA_MAP and GET_REGION_INFO (the BARs).

  2. The data plane. The guest's memory regions are mapped into the server via DMA_MAP. The server's exposed regions (the PCI config space, the BARs, the doorbells) are mapped into the client via the socket's fd passing. Once mapped, both sides can read/write the shared memory directly. The client writes to a doorbell in the server's exposed region to signal "guest pushed new I/O"; the server writes to a different doorbell to signal "backend completed I/O."

flowchart TB
subgraph guest["Guest (Linux kernel)"]
  GD["NVMe driver"]
  GC["vfio-pci core"]
  GP["vfio-user-pci device (front end)"]
end

subgraph qemu["QEMU process"]
  QEMU["hw/vfio/user/ client"]
end

subgraph spdk["SPDK process"]
  L["libvfio-user (the library)"]
  T["lib/vfu_tgt/tgt_endpoint.c (the target)"]
  N["lib/nvmf/vfio_user.c (the NVMe-oF / SPDK backend)"]
end

GD --> GC --> GP
GP -- "PCI MMIO (in guest physical memory)" --> QEMU
QEMU -- "control messages + fd passing (Unix socket)" --> L
L --> T --> N

QEMU -.->|shared memory: doorbells, guest DMA regions| T
fig. 1 — VFIO-user transport topology · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The two-process topology of VFIO-user. The libvfio-user library is the protocol engine; the SPDK lib/vfu_tgt/ glue turns SPDK concepts (endpoints, bdevs, NVMf subsystems) into libvfio-user primitives (PCI devices, regions, DMA maps). The data plane runs through shared memory, not through a kernel module.

Shared memory in detail

The shared memory is split into two parts:

  • Exposed regions. Regions the server (SPDK) exposes to the client (QEMU). These are the PCI BARs of the vfio-user device. The client mmaps them and accesses them as if they were real PCI MMIO. For the NVMe-vfio-user device, the regions are the NVMe controller registers (BAR 0), the doorbells (BAR 4, mapped via the NVME_DOORBELLS_OFFSET sparse mmap region), and the BAR 5 (the data structures used by the admin queues). The server-side access_bar0_fn handles every MMIO write to the controller registers; the doorbells are written by the client and read by the server to know when there's new I/O.

  • DMA regions. Memory regions the client (QEMU) maps from the guest and hands to the server (SPDK) via VFIO_USER_DMA_MAP. The server uses these for guest-physical to host-virtual translation. Each DMA region carries the IOVA range, the file descriptor of the underlying memory, the offset, and the protection bits. The server mmaps the fd and gets a host-virtual address it can use for direct access.

The doorbells are the heart of the data plane. The client (QEMU) writes to a specific doorbell address in the server's exposed region to signal "new submission queue entry." The server's reactor poller sees the doorbell write (either via polling or via a separate eventfd the client writes), reads the SQ entry, services it, and writes to a different doorbell to signal "completion." The client reads the completion and notifies the guest.

The transport lives in lib/vfio_user/ and lib/vfu_tgt/

SPDK has two libraries for VFIO-user, and they split the work cleanly:

  • libvfio-user (the libvfio-user/ subdirectory). This is a fork of the upstream libvfio-user, an external library. It implements the protocol engine: the Unix socket listener, the message parser, the DMA map, the region setup, the doorbell framework. The library is the "server" side of the protocol.

  • lib/vfu_tgt/. This is the SPDK glue. It turns libvfio-user primitives (PCI devices, regions, DMA maps) into SPDK concepts (endpoints, threads, pollers). See

    lib/vfu_tgt/tgt_endpoint.c:1

    for the entry point. The spdk_vfu_create_endpoint function creates an endpoint (one vfio-user device), wires up the libvfio-user context, and registers an accept poller.

The accept poller

The accept poller runs on the endpoint's spdk_thread and waits for a QEMU client to connect. Once a client connects, libvfio-user's vfu_attach_ctx is called, then the backend's attach_device callback wires up the data plane. The flow is in lib/vfu_tgt/tgt_endpoint.c:153 :

The QEMU device: vfio-user-pci

QEMU exposes the vfio-user device as vfio-user-pci in hw/vfio/user/ (QEMU's tree). The command-line incantation is:

-device vfio-user-pci,socket=/var/diskengine/vfio-user/12345/cntrl

The QEMU device does the libvfio-user client side: connect to the socket, send VFIO_USER_VERSION, set up the region mappings, handle the doorbells. From the guest's perspective, the device is a normal PCI device with BARs and an MSI-X table; the guest's vfio-pci driver probes it and hands it to the appropriate upper-level driver (e.g. nvme for an NVMe controller).

The QEMU side is more complex than the vhost-user side because it has to handle the migration of doorbells and DMA regions across vfio-user-pci device state changes, and the migration of the PCI config space, the MSI-X table, and the BAR contents. vhost-user only has to migrate the SET_MEM_TABLE handshake. vfio-user has to migrate the whole PCI device state.

How it compares to vhost-user: latency, CPU cost, complexity

Dimensionvhost-userVFIO-user
Data plane transportKernel vhost code mediatesDirect shared memory + doorbells
Per-I/O syscalls (guest side)2 (kick + call eventfd)1 (doorbell write only)
Per-I/O syscalls (host side)2 (epoll wake + call eventfd)0 (poll of shared memory only)
Guest memory mappingKernel mmap, mediated by vhostDirect mmap, file-descriptor passing
Setup complexityLow (one socket)High (socket + shared memory + PCI emulation)
Live migrationStandard (GET_VRING_BASE / SET_VRING_BASE)Custom (doorbell + DMA state migration)
4 KB random read IOPS (single queue)~600 K~900 K
CPU per I/O in the guest~250 ns~200 ns
CPU per I/O in SPDK~200 ns (bdev + kernel mediator)~150 ns (bdev + doorbell poll)
Maximum QEMU process count per hostMany (kernel vhost scales well)Many (per-VM spdk_thread scales to ~hundreds)

The setup: how diskengine creates a vfio-user connection

diskengine uses VFIO-user for the VM path. The setup is in VFIO_USER_SOCK_DIR:12 and the attach path is in startVfioUserAttachLoop:17 .

The directory structure on disk is:

/var/diskengine/vfio-user/
  12345/                  ← per-VM directory
    cntrl                 ← the libvfio-user socket

Once the listener is added, the SPDK side is ready. QEMU is launched with the -device vfio-user-pci,socket=... flag, QEMU connects, libvfio-user's vfu_attach_ctx runs, the SPDK accept poller wakes, the attach_device callback wires up the NVMf subsystem, the doorbells start flowing, and the guest sees an NVMe controller.

The lifecycle: connect, configure, run, disconnect

STEP 01
nvmf_create_transport VFIOUSER
SPDK-side transport registration (idempotent)
STEP 02
ensureVfioUserSocketDir(vmID)
mkdir /var/diskengine/vfio-user/<vmID>
STEP 03
nvmf_subsystem_add_listener
trtype=VFIOUSER, traddr=that directory — libvfio-user accepts on the cntrl socket
STEP 04
QEMU launches with -device vfio-user-pci
QEMU connects to the cntrl socket
STEP 05
vfu_attach_ctx
Version negotiation, capability exchange
STEP 06
attach_device callback
NVMf subsystem + vfio-user-ctrlr created; the data plane is live
STEP 07
Data plane runs
Guest writes to SQ, doorbell fires, SPDK polls, services I/O, writes to CQ
STEP 08
QMP quit / VM stop
vfio_user_dev_quiesce_cb fires; subsystem pause; device quiesced
STEP 09
Connection close
vfu_destroy_ctx; endpoint moves back to is_attached=false

The quiesce path on VM shutdown is the structural difference from vhost-user. The NVMf subsystem has its own pause/resume state machine, and the vfio-user transport hooks into it via vfio_user_dev_quiesce_cb at lib/nvmf/vfio_user.c:3223 .

Edge cases & what trips people up

1. QEMU restarts

QEMU is killed (cleanly or not), then a new QEMU is launched with the same vfio-user-pci device pointing at the same socket. The libvfio-user server side sees the connection close, the quiesce callback fires (or the connection just drops if the kill was unceremonious), the detach_device callback runs, the endpoint moves back to is_attached = false, and the accept poller starts listening again.

The new QEMU connects, the device re-attaches, the guest sees a fresh NVMe controller. From the guest's perspective this is identical to a PCI device hotplug. The guest's nvme driver reinitialises the controller and the I/O resumes. The SPDK side has to be careful that no in-flight I/O from the old connection is outstanding when the new one attaches.

2. VM live migration

Live migration is the hard case. The QEMU vfio-user-pci device has to migrate:

  • The PCI config space (vendor ID, BAR sizes, …)
  • The MSI-X table
  • The DMA regions (the IOVA ranges and the backing file descriptors)
  • The doorbell state (which SQ entries have been consumed, which CQ entries have been written)
  • The controller's internal NVMe state (namespace list, queue counts, …)

The destination QEMU has to come up with the same device, pointing at the same SPDK endpoint, and the guest's nvme driver has to seamlessly continue. The SPDK side has to support two simultaneous connections to the same endpoint (source and destination) during the migration. That's what the is_attached = true flag is protecting — the endpoint can have at most one live connection at a time, so the migration has to drop the source connection before attaching the destination.

3. Multiple VMs sharing one SPDK endpoint

The vfio-user protocol is one device, one endpoint, one connection. To share a bdev across multiple VMs, you need multiple endpoints (one per VM). That's what diskengine does — every VM gets its own /var/diskengine/vfio-user/<vmID> directory, and each NVMf subsystem is per-VM with its own namespace. Two VMs can have separate endpoints pointing at the same bdev (the NVMf subsystem layer routes I/O to the right bdev based on the namespace path).

The cost of "one endpoint per VM" is one spdk_thread per VM. For a 100-VM host, that's 100 spdk_threads. For a 1000-VM host, the thread count gets unwieldy. The fix is to share a single spdk_thread across multiple endpoints, which is what the SPDK cpumask parameter to spdk_vfu_create_endpoint enables: pass the same cpumask for multiple endpoints, and they all run on the same spdk_thread.

4. The msgbox region gets corrupted

The msgbox is the shared-memory region used for doorbells and (in some libvfio-user versions) for small control messages that don't need a full socket round-trip. If the guest writes garbage to the msgbox (a buggy guest driver, a hardware bit-flip, a hypervisor bug), the SPDK doorbell poller sees a spurious wake, reads a malformed SQ entry, and either panics or returns an error to the guest. The guest's nvme driver handles the error via the standard NVMe error recovery path (reinitialise the queue), but if the corruption is persistent the recovery loops.

The fix is to validate every read from shared memory. The libvfio-user library does some validation (the vfio_user_dev_mmio_access function at

lib/vfio_user/host/vfio_user.c:292

checks the offset and length before reading), but the doorbell region is just polled and assumed sane. A truly defensive implementation would checksum the doorbell writes.

5. The "innocent accept poller" trap

The tgt_accept_poller at

lib/vfu_tgt/tgt_endpoint.c:153

returns SPDK_POLLER_IDLE when the endpoint is attached. The accept poller is unregisterred in tgt_endpoint_thread_exit, but only if the endpoint's accept_poller is non-NULL. If the endpoint was never attached (no client ever connected), the accept poller was never registered, the accept_poller field is NULL, and the spdk_poller_unregister on NULL is a no-op. That's fine, but it's a subtle invariant — you can't trust the accept_poller field to be non-NULL on every endpoint.

6. The QEMU side hangs in vfio_user_dev_quiesce_cb

The quiesce callback in

lib/nvmf/vfio_user.c:3223

can be called with the wrong assumptions. If the controller is in VFIO_USER_CTRLR_PAUSING state already, the function asserts false. The fix is to track the state more carefully and skip the quiesce if a quiesce is already in flight. The queued_quiesce flag in the nvmf_vfio_user_ctrlr is the existing way to handle this; it's set when a quiesce arrives during a resume.

7. The connection breaks during a DMA map

vfio_user_dev_dma_map_unmap at

lib/vfio_user/host/vfio_user.c:268

is a synchronous call. If the QEMU side dies between the DMA map request and the reply, the SPDK side sees EPIPE on the next socket read. The libvfio-user library closes the context. The SPDK side has to free the DMA region it was setting up. The current code doesn't always do this cleanly — the region can stay registered with the SPDK DMA framework until the next explicit unmap, which may never come.

Why it matters

VFIO-user is the data path. vhost-user is a fallback when the data path is too heavy. For diskengine, the data path is what matters: the startVfioUserAttachLoop in startVfioUserAttachLoop:17 is the live production code; the startVhostDetachLoop in startVhostDetachLoop:29 is commented out.

The next page, 7.4, is the marquee page. It tears apart the QMP quit wedge using the lock-holding path, the teardown sequence, and the threading-rule violation that's at the heart of it. Read it before you debug any stuck remove_ns RPC.