Layer 6 · nvmf

RDMA, TCP, and the strange third option.

The NVMe-oF protocol doesn't care how the bytes get from initiator to target. It defines capsules and connection setup; it leaves the framing to the transport. SPDK implements three: RDMA, TCP, and VFIO-user. Each one is a complete transport in its own right — its own spdk_nvmf_transport_ops struct, its own poll loop, its own failure modes. This page is about what's inside each of them.

~20 min read3 diagramsprerequisites: 6.1 · 6.2
On this page
  1. What a transport does in SPDK
  2. RDMA: kernel bypass, verbs, and queue pairs
  3. RDMA: data-in-place vs data-in-buffer
  4. RDMA: the state machine, the SRQ, memory registration
  5. TCP: the PDU, the socket, and kTLS offload
  6. TCP: data-in-buffer, the wire format, throughput
  7. VFIO-user: the third option that is not a network at all
  8. VFIO-user vs vhost-user
  9. Comparison: latency, CPU, deployment complexity
  10. Edge cases: which transport breaks first under what load

What a transport does in SPDK

Every transport implements the same struct, spdk_nvmf_transport_ops, declared in include/spdk/nvmf_transport.h:struct . The struct has roughly two dozen function pointers covering lifecycle (create, destroy), listening (listen, stop_listen, discover), poll group management (create, destroy, add, remove, poll), request handling (req_free, req_complete, req_get_buffers_done), and qpair introspection (get_peer_trid, get_local_trid, get_listen_trid, abort_request).

The transport registry at lib/nvmf/transport.c:42 stores spdk_nvmf_transport_ops in a global TAILQ. The names are "RDMA", "TCP", and "VFIOUSER". The same struct, three completely different verbs underneath.

RDMA: kernel bypass, verbs, and queue pairs

RDMA — Remote Direct Memory Access — moves data between two machines without involving the kernel, the CPU, or (mostly) the cache. The HCA (Host Channel Adapter, the RDMA NIC) reads from memory on machine A and writes to memory on machine B. The verb is RDMA_WRITE for one-direction transfer and RDMA_SEND / RDMA_RECV for short two-way messages. NVMe-oF over RDMA uses all four.

The transport is implemented in lib/nvmf/rdma.c:rdma.c . The verbs interface is wrapped by SPDK's spdk_internal/rdma_provider.h and spdk_internal/rdma_utils.h, which lets the same code target libibverbs, the kernel-bypass vfiouser path, or a custom userland provider.

The mapping from NVMe queues to RDMA queue pairs is direct:

NVMe-oFRDMANotes
Admin submission queueRDMA Send Queue (SQ) + Receive Queue (RQ)Both directions are SEND/RECV. No RDMA_READ or RDMA_WRITE on admin.
Admin completion queueRDMA RQ (for unsolicited completions)One-shot. Init pre-posts receive work requests (WRs).
I/O submission queue (reads)RDMA_READ (host reads from target)For reads, the target posts a receive buffer; the host RDMA_READs from it.
I/O submission queue (writes)RDMA_WRITE (host writes to target)For writes, the target posts a receive buffer; the host RDMA_WRITEs into it.

So one NVMe I/O qpair is backed by one RDMA queue pair, which carries a pair of SQs and a pair of RQs (one each for sends, one each for recv-completions). The transport pre-posts a large number of receive WRs to the SRQ (Shared Receive Queue) so it always has somewhere to land an incoming capsule.

RDMA: data-in-place vs data-in-buffer

Here's the design choice that defines RDMA performance. The NVMe-oF spec has two ways to do an I/O:

  • Data-in-buffer. The data travels through a buffer that the target owns. The host puts the data in the buffer (via SEND or WRITE), the target processes the I/O against the buffer, and the target sends the data back (via SEND or READ).

  • Data-in-place. The data goes directly between the host's user buffer and the target's backing bdev. For reads: the target bdev DMAs the data into a pinned host buffer. For writes: the host's pinned buffer is read by the bdev.

In SPDK, "data-in-place" is enabled when the zcopy option is set on the transport (default true for RDMA). When set, the I/O path uses spdk_nvme_rdma_zcopy_start / spdk_nvme_rdma_zcopy_commit instead of staging through a buffer. The data sits in either the host's pinned memory or the bdev's DMA region for the entire I/O lifetime.

RDMA: the state machine, the SRQ, memory registration

The transport state machine in the enum above has 14 states. The interesting transitions are:

  1. NEW → NEED_BUFFER. A new capsule arrived. The request needs a data buffer. The transport either has one immediately (in the iobuf pool) or has to wait.

  2. NEED_BUFFER → HAVE_BUFFER. An iobuf was acquired. The request is now ready to either RECEIVE data (for writes) or to EXECUTE the I/O (for reads).

  3. HAVE_BUFFER → TRANSFERRING_HOST_TO_CONTROLLER. A write. The host is doing RDMA_WRITE into the target's receive buffer. The target waits for the WRITE to complete.

  4. READY_TO_EXECUTE → EXECUTING. The bdev is doing the I/O. Completion is asynchronous; the poller eventually gets the bdev_io done callback.

  5. EXECUTED → DATA_TRANSFER_TO_HOST_PENDING. A read. The bdev has filled the buffer; now the target needs to RDMA_READ into the host's pinned buffer.

The SRQ (Shared Receive Queue) is a performance optimization. Instead of having a separate RQ per qpair, all qpairs in a poll group share one RQ. The transport pre-posts receive WRs to the SRQ; when a capsule arrives, the HCA hands out a receive WR from the shared pool. The transport looks up which qpair owns the WR by the WR ID, and the request is born.

Memory registration is the part that gets you. Every buffer that RDMA touches must be registered with the HCA — that means the HCA knows the physical address mapping and has pinned the page. ibv_reg_mr is expensive (microseconds to milliseconds). The RDMA transport keeps a pool of pre-registered buffers (the num_shared_buffers and buf_cache_size options) so that the common case — incoming capsules with small data — never has to register at I/O time.

TCP: the PDU, the socket, and kTLS offload

RDMA has a hard requirement: an RDMA-capable HCA and a network that can carry RDMA traffic (RoCE v2 with PFC, InfiniBand, or iWARP). When you don't have that — most data centers, all public clouds — you fall back to TCP.

TCP is the universal transport. Any Ethernet NIC works. Any network works. The cost is CPU: every byte goes through the kernel TCP stack (unless you use sendfile or similar offload, which TCP doesn't really give you for arbitrary user buffers), and the kernel TCP stack is not cheap.

SPDK's TCP transport implementation is in lib/nvmf/tcp.c:tcp.c . The connection looks like a regular socket pair: the target calls accept(), the initiator calls connect(), and then the two sides exchange NVMe-oF PDUs — Protocol Data Units — over the socket.

The wire format: PDUs over TCP

A PDU is a fixed 8-byte header followed by a variable-length data payload. The header contains the type (Capsule, Capsule Response, R2T, Ready to Transfer, etc.), the length, and a sequence number. The transport's poll_group_poll reads from the socket, demultiplexes by header type, and hands the body to the nvmf library.

The extra states relative to RDMA are around R2T (Ready to Transfer). For large writes, the host sends a write command, the target replies with an R2T saying "I need bytes X through Y now," the host sends them. R2T is a flow-control mechanism; without it, the host would either over-commit memory (sending the full write at once) or under-commit (sending in tiny chunks). The R2T protocol lets the target pace the host.

TCP: data-in-buffer, throughput

The TCP transport is always data-in-buffer. There's no RDMA-style direct path because there's no RDMA. The data always passes through a target-owned buffer. This means:

  • Every I/O involves at least one copy. For reads, the bdev fills the buffer, then send() copies the buffer to the socket. For writes, recv() fills the buffer, then the bdev reads from it.

  • The CPU cost of TCP is dominated by the socket I/O. The transport uses SPDK's spdk_sock abstraction (POSIX sockets, or io_uring if the user compiled with it) underneath. The number of context switches is non-zero, but most of the work is in memcpy and the kernel's TCP send/receive path.

  • Throughput scales with core count: more poll groups, more TCP connections, more parallel memcpy. A well-tuned TCP target with 16 cores can push 1-2 million IOPS for 4KB random reads.

kTLS / SSL offload

TLS is supported via the secure_channel option on the listener. The transport uses OpenSSL on the data path. kTLS (kernel TLS offload) is a Linux kernel feature that lets the kernel NIC driver handle the encryption; the application just hands it the plaintext and the keys. SPDK uses kTLS via the spdk_sock abstraction's impl selector — if the underlying socket supports kTLS, the transport uses it transparently.

VFIO-user: the third option that is not a network at all

Here's the curve ball. VFIO-user is not a network transport. There is no IP, no port, no RDMA QP, no TCP socket. VFIO-user uses shared memory and a Unix domain socket to expose an emulated PCI device to a QEMU VM. It's a way for a QEMU guest to access a host SPDK bdev as if it were a local NVMe device, with no kernel involvement on either side and no network stack in the middle.

flowchart LR
subgraph "QEMU guest VM"
  GUEST[Guest NVMe driver]
  QEMUDev["QEMU vfio-user
device emulation"] end subgraph "Host (SPDK)" SOCK["Unix domain
socket (no network)"] SHM["Shared memory
(doorbells, queues)"] NVMeUser["lib/nvmf/vfio_user.c
target-side transport"] Bdev["bdev stack"] end GUEST --> QEMUDev QEMUDev -- "control messages (Unix socket)" --> SOCK QEMUDev -- "MMIO into doorbells" --> SHM SOCK --> NVMeUser SHM --> NVMeUser NVMeUser --> Bdev
fig. 1 — VFIO-user architecture · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The "fabric" is a Unix domain socket plus a shared memory region. The guest's NVMe driver thinks it's talking to a local PCIe NVMe controller; in reality, the doorbells are MMIOs into shared memory, and the admin/IO queues are SPDK's normal qpair data structures.

VFIO-user reuses the vfio-user protocol (libvfio-user). The transport is implemented in lib/nvmf/vfio_user.c:vfio_user.c . The other side is in lib/vfu_tgt/tgt_endpoint.c:tgt_endpoint.c , which is a target-side library for the "endpoint" half of the protocol. The QEMU side lives in QEMU itself (hw/vfio/user.c).

The key idea: the guest's NVMe driver writes a doorbell by MMIO-ing into a BAR. The BAR is backed by shared memory. The host (SPDK) is polling that shared memory in its poll_group_poll. When the doorbell changes, SPDK knows the guest has submitted an I/O. The admin and I/O submission queues themselves live in the shared memory region. The protocol is a PCIe NVMe emulation; the "fabric" is the shared memory.

VFIO-user vs vhost-user

Layer 7 is about vhost-user and VFIO-user in detail. The short version: vhost-user is a virtio device, VFIO-user is a PCIe device. The difference matters:

Aspectvhost-user (virtio)VFIO-user (PCIe/NVMe)
What the guest seesvirtio-blk / virtio-scsi deviceReal NVMe controller
What's on the wirevirtqueue descriptors (avail ring, used ring)NVMe submission/completion queues, doorbells
Driver in the guestvirtio-pci / virtio-mmioStandard NVMe driver (nvme.ko)
Performance characteristicLowest possible — designed for VM workloadsNearly as low — emulating a real device, but with all the NVMe features (namespaces, ANA, reservations)
When to useGeneric VM block I/O; you don't need NVMe semanticsThe guest needs real NVMe: namespaces, ANA, multipath, NVMe-specific features

VFIO-user is the right answer for "I want this VM to see SPDK-backed storage as a real NVMe device, with all the NVMe feature set, and I want it to be fast." vhost-user is the right answer for "I want a paravirtualized block device for a KVM guest, and I don't need NVMe semantics."

Comparison: latency, CPU, deployment complexity

RDMATCPVFIO-user
Sub-µs 4KB read latency~5 µs (good RDMA NIC)~30-50 µs (kernel-bypass socket) / ~80 µs (regular socket)~3-5 µs (in-host)
CPU per GB/s~0.5 core~2-3 cores (regular socket) / ~1 core (io_uring)~0.3 core (in-host)
Topology requirementDedicated RoCE v2 network with PFC, or InfiniBand fabric, or iWARPAny EthernetSame host (Unix socket + shared memory)
Encryption supportIPsec (out of band)TLS / kTLS in-bandNone (in-host only; trusted boundary)
DistanceDatacenter-scale (RoCE has reach limits; IB has range)Anywhere TCP/IP runsSame machine
Hardware costRDMA HCA, ~$500-$2000 per portStandard NIC, ~$50None
Deployment complexityHigh (PFC, ECN, MTU, MR tuning, firmware)Low (just open a port)Medium (QEMU + SPDK + vfio-user server setup)

Latency is the headline. RDMA on a good HCA can hit 5 µs end-to-end for a 4KB read. TCP over kernel-bypass sockets hits 30-50 µs. VFIO-user, because it's in-host shared memory, hits 3-5 µs. For a single connection with low queue depth, this ordering holds.

CPU follows the same pattern. RDMA's verbs are the lightest; TCP's socket path is the heaviest (mitigated by io_uring and kTLS). VFIO-user is the lightest of all because there's no network at all.

Deployment complexity flips the order. TCP is the easiest — you just need a routable IP. RDMA is the hardest — you need an RDMA-capable fabric, properly configured PFC and ECN, a working subnet manager, and HCA-specific tuning. VFIO-user is in the middle: it's "in-host only" by design, but the QEMU integration has many small pieces that all have to line up.

Edge cases & what trips people up

RDMA: MTU issues

RDMA has a notion of MTU independent of the Ethernet MTU. The default is usually 2048 (2K), but the practical max is 4096 (4K) for most HCAs. The "max inline data" setting on the queue pair controls how much data can ride in a SEND WR. If the data doesn't fit, it goes as a separate RDMA WRITE — which is fine, but extra round trips.

The thing that catches you: your I/O size must be at most max_io_size, which is constrained by io_unit_size and the buffer pool. If the application issues a 512KB read and the transport only has 128KB buffers, the I/O has to be split. The transport handles splitting, but it costs an extra copy in some cases.

TCP: TCP RST

A TCP RST is an abrupt connection termination. The kernel sends one if the application closes a socket with data unread; it can also be sent by an intermediary. The transport's poll_group_poll reads the disconnection, marks every qpair on the connection as disconnected, and reports the failure to the nvmf library. All in-flight I/O on the qpair is failed with NVME_SC_HOST_PATH_ERROR.

What you do about it: usually nothing. The host's NVMe driver will see the failure, retry the I/O if appropriate, and reconnect. If you see RSTs in a healthy system, you have a process-exit-with-unclean-TCP-state problem on the host side; check your SO_LINGER settings.

RDMA: path migration

RoCE supports path migration: rerouting a QP to a different physical path without tearing it down. The HCA detects the path failure and notifies userspace. SPDK's transport has limited support for this; the default behavior is to fail the qpair and let the host reconnect. If you need hot-path migration, you're in advanced territory — set the HCA's roce_path sysfs knob and tune the path's retransmit_timeout.

VFIO-user: QEMU restarts

QEMU is killed and restarted. The vfio-user transport detects the disconnect (the Unix socket closes), marks the controller as broken, and tears down the qpair. The transport's state machine has explicit handling for this — look at lib/nvmf/vfio_user.c:161 where the controller state goes from RUNNING to PAUSING to PAUSED on disconnect.

The bdev underneath is unaffected — SPDK's nvmf_tgt continues to run, the bdev continues to exist. When QEMU reconnects, the vfio-user transport re-creates the controller. The new controller has a fresh cntlid and sees the same bdev. The guest's NVMe driver will re-enumerate the namespaces.

RDMA: connection migration under kernel-bypass

A subtle failure mode: the SPDK target's poll group is pinned to a specific core. The HCA's IRQ is on a different core. Every received WR triggers a CQ notification, which is an inter-processor interrupt. The transport has to bounce from the IRQ's core to the poll group's core. If your topology is wrong (poll group on core 0, IRQ on core 31), you get cache-line ping-ponging. Set get_optimal_poll_group in the transport ops to do the right thing — for RDMA, that's "the same core the HCA's CQ completion vector is on."

What to take away

Three transports, three completely different worlds. RDMA uses verbs, pre-registered memory, and a 14-state request state machine to deliver sub-5µs latency. TCP uses sockets and PDUs, with kTLS for encryption, and is always data-in-buffer. VFIO-user is the outlier: not a network at all, just shared memory and doorbells, the lowest-latency option for QEMU guests. The transport ops struct is the unified interface; under the hood, the implementations are completely unrelated.

The next page — provisioning flow traced — is about how diskengine takes a single database record and walks it through the full SPDK stack from lvol to NVMe-oF to host.