RDMA, TCP, and the strange third option.
The NVMe-oF protocol doesn't care how the bytes get from
initiator to target. It defines capsules and connection setup;
it leaves the framing to the transport. SPDK implements three:
RDMA, TCP, and VFIO-user. Each one is a complete transport in
its own right — its own spdk_nvmf_transport_ops
struct, its own poll loop, its own failure modes. This page is
about what's inside each of them.
- What a transport does in SPDK
- RDMA: kernel bypass, verbs, and queue pairs
- RDMA: data-in-place vs data-in-buffer
- RDMA: the state machine, the SRQ, memory registration
- TCP: the PDU, the socket, and kTLS offload
- TCP: data-in-buffer, the wire format, throughput
- VFIO-user: the third option that is not a network at all
- VFIO-user vs vhost-user
- Comparison: latency, CPU, deployment complexity
- Edge cases: which transport breaks first under what load
What a transport does in SPDK
Every transport implements the same struct,
spdk_nvmf_transport_ops, declared in
include/spdk/nvmf_transport.h:struct .
The struct has roughly two dozen function pointers covering
lifecycle (create, destroy), listening (listen, stop_listen,
discover), poll group management (create, destroy, add, remove,
poll), request handling (req_free, req_complete, req_get_buffers_done),
and qpair introspection (get_peer_trid, get_local_trid,
get_listen_trid, abort_request).
The transport registry at
lib/nvmf/transport.c:42 stores
spdk_nvmf_transport_ops in a global TAILQ. The
names are "RDMA", "TCP", and
"VFIOUSER". The same struct, three completely
different verbs underneath.
RDMA: kernel bypass, verbs, and queue pairs
RDMA — Remote Direct Memory Access — moves data between two
machines without involving the kernel, the CPU, or (mostly) the
cache. The HCA (Host Channel Adapter, the RDMA NIC) reads from
memory on machine A and writes to memory on machine B. The
verb is RDMA_WRITE for one-direction transfer and
RDMA_SEND / RDMA_RECV for short
two-way messages. NVMe-oF over RDMA uses all four.
The transport is implemented in
lib/nvmf/rdma.c:rdma.c . The verbs
interface is wrapped by SPDK's
spdk_internal/rdma_provider.h and
spdk_internal/rdma_utils.h, which lets the same
code target libibverbs, the kernel-bypass vfiouser path, or
a custom userland provider.
The mapping from NVMe queues to RDMA queue pairs is direct:
| NVMe-oF | RDMA | Notes |
|---|---|---|
| Admin submission queue | RDMA Send Queue (SQ) + Receive Queue (RQ) | Both directions are SEND/RECV. No RDMA_READ or RDMA_WRITE on admin. |
| Admin completion queue | RDMA RQ (for unsolicited completions) | One-shot. Init pre-posts receive work requests (WRs). |
| I/O submission queue (reads) | RDMA_READ (host reads from target) | For reads, the target posts a receive buffer; the host RDMA_READs from it. |
| I/O submission queue (writes) | RDMA_WRITE (host writes to target) | For writes, the target posts a receive buffer; the host RDMA_WRITEs into it. |
So one NVMe I/O qpair is backed by one RDMA queue pair, which carries a pair of SQs and a pair of RQs (one each for sends, one each for recv-completions). The transport pre-posts a large number of receive WRs to the SRQ (Shared Receive Queue) so it always has somewhere to land an incoming capsule.
RDMA: data-in-place vs data-in-buffer
Here's the design choice that defines RDMA performance. The NVMe-oF spec has two ways to do an I/O:
Data-in-buffer. The data travels through a buffer that the target owns. The host puts the data in the buffer (via SEND or WRITE), the target processes the I/O against the buffer, and the target sends the data back (via SEND or READ).
Data-in-place. The data goes directly between the host's user buffer and the target's backing bdev. For reads: the target bdev DMAs the data into a pinned host buffer. For writes: the host's pinned buffer is read by the bdev.
In SPDK, "data-in-place" is enabled when the
zcopy option is set on the transport (default
true for RDMA). When set, the I/O path uses
spdk_nvme_rdma_zcopy_start /
spdk_nvme_rdma_zcopy_commit instead of staging
through a buffer. The data sits in either the host's pinned
memory or the bdev's DMA region for the entire I/O lifetime.
RDMA: the state machine, the SRQ, memory registration
The transport state machine in the enum above has 14 states. The interesting transitions are:
NEW → NEED_BUFFER. A new capsule arrived. The request needs a data buffer. The transport either has one immediately (in the iobuf pool) or has to wait.
NEED_BUFFER → HAVE_BUFFER. An iobuf was acquired. The request is now ready to either RECEIVE data (for writes) or to EXECUTE the I/O (for reads).
HAVE_BUFFER → TRANSFERRING_HOST_TO_CONTROLLER. A write. The host is doing RDMA_WRITE into the target's receive buffer. The target waits for the WRITE to complete.
READY_TO_EXECUTE → EXECUTING. The bdev is doing the I/O. Completion is asynchronous; the poller eventually gets the bdev_io done callback.
EXECUTED → DATA_TRANSFER_TO_HOST_PENDING. A read. The bdev has filled the buffer; now the target needs to RDMA_READ into the host's pinned buffer.
The SRQ (Shared Receive Queue) is a performance optimization. Instead of having a separate RQ per qpair, all qpairs in a poll group share one RQ. The transport pre-posts receive WRs to the SRQ; when a capsule arrives, the HCA hands out a receive WR from the shared pool. The transport looks up which qpair owns the WR by the WR ID, and the request is born.
Memory registration is the part that gets
you. Every buffer that RDMA touches must be registered with
the HCA — that means the HCA knows the physical address
mapping and has pinned the page. ibv_reg_mr is
expensive (microseconds to milliseconds). The RDMA transport
keeps a pool of pre-registered buffers (the
num_shared_buffers and buf_cache_size
options) so that the common case — incoming capsules with
small data — never has to register at I/O time.
TCP: the PDU, the socket, and kTLS offload
RDMA has a hard requirement: an RDMA-capable HCA and a network that can carry RDMA traffic (RoCE v2 with PFC, InfiniBand, or iWARP). When you don't have that — most data centers, all public clouds — you fall back to TCP.
TCP is the universal transport. Any Ethernet NIC works. Any
network works. The cost is CPU: every byte goes through the
kernel TCP stack (unless you use sendfile or
similar offload, which TCP doesn't really give you for
arbitrary user buffers), and the kernel TCP stack is not
cheap.
SPDK's TCP transport implementation is in
lib/nvmf/tcp.c:tcp.c . The connection
looks like a regular socket pair: the target calls
accept(), the initiator calls
connect(), and then the two sides exchange
NVMe-oF PDUs — Protocol Data Units — over the
socket.
The wire format: PDUs over TCP
A PDU is a fixed 8-byte header followed by a variable-length
data payload. The header contains the type (Capsule, Capsule
Response, R2T, Ready to Transfer, etc.), the length, and a
sequence number. The transport's poll_group_poll
reads from the socket, demultiplexes by header type, and hands
the body to the nvmf library.
The extra states relative to RDMA are around R2T (Ready to Transfer). For large writes, the host sends a write command, the target replies with an R2T saying "I need bytes X through Y now," the host sends them. R2T is a flow-control mechanism; without it, the host would either over-commit memory (sending the full write at once) or under-commit (sending in tiny chunks). The R2T protocol lets the target pace the host.
TCP: data-in-buffer, throughput
The TCP transport is always data-in-buffer. There's no RDMA-style direct path because there's no RDMA. The data always passes through a target-owned buffer. This means:
Every I/O involves at least one copy. For reads, the bdev fills the buffer, then
send()copies the buffer to the socket. For writes,recv()fills the buffer, then the bdev reads from it.The CPU cost of TCP is dominated by the socket I/O. The transport uses SPDK's
spdk_sockabstraction (POSIX sockets, orio_uringif the user compiled with it) underneath. The number of context switches is non-zero, but most of the work is inmemcpyand the kernel's TCP send/receive path.Throughput scales with core count: more poll groups, more TCP connections, more parallel
memcpy. A well-tuned TCP target with 16 cores can push 1-2 million IOPS for 4KB random reads.
kTLS / SSL offload
TLS is supported via the secure_channel option on
the listener. The transport uses OpenSSL on the data path.
kTLS (kernel TLS offload) is a Linux kernel feature that lets
the kernel NIC driver handle the encryption; the application
just hands it the plaintext and the keys. SPDK uses
kTLS via the spdk_sock abstraction's
impl selector — if the underlying socket
supports kTLS, the transport uses it transparently.
VFIO-user: the third option that is not a network at all
Here's the curve ball. VFIO-user is not a network transport. There is no IP, no port, no RDMA QP, no TCP socket. VFIO-user uses shared memory and a Unix domain socket to expose an emulated PCI device to a QEMU VM. It's a way for a QEMU guest to access a host SPDK bdev as if it were a local NVMe device, with no kernel involvement on either side and no network stack in the middle.
flowchart LR subgraph "QEMU guest VM" GUEST[Guest NVMe driver] QEMUDev["QEMU vfio-user
device emulation"] end subgraph "Host (SPDK)" SOCK["Unix domain
socket (no network)"] SHM["Shared memory
(doorbells, queues)"] NVMeUser["lib/nvmf/vfio_user.c
target-side transport"] Bdev["bdev stack"] end GUEST --> QEMUDev QEMUDev -- "control messages (Unix socket)" --> SOCK QEMUDev -- "MMIO into doorbells" --> SHM SOCK --> NVMeUser SHM --> NVMeUser NVMeUser --> Bdev
fig. 1 The "fabric" is a Unix domain socket plus a shared memory region. The guest's NVMe driver thinks it's talking to a local PCIe NVMe controller; in reality, the doorbells are MMIOs into shared memory, and the admin/IO queues are SPDK's normal qpair data structures.
VFIO-user reuses the vfio-user protocol
(libvfio-user). The transport is implemented in
lib/nvmf/vfio_user.c:vfio_user.c . The
other side is in
lib/vfu_tgt/tgt_endpoint.c:tgt_endpoint.c ,
which is a target-side library for the "endpoint" half of
the protocol. The QEMU side lives in QEMU itself
(hw/vfio/user.c).
The key idea: the guest's NVMe driver writes a doorbell by
MMIO-ing into a BAR. The BAR is backed by shared memory. The
host (SPDK) is polling that shared memory in its
poll_group_poll. When the doorbell changes, SPDK
knows the guest has submitted an I/O. The admin and I/O
submission queues themselves live in the shared memory
region. The protocol is a PCIe NVMe emulation; the
"fabric" is the shared memory.
VFIO-user vs vhost-user
Layer 7 is about vhost-user and VFIO-user in detail. The short version: vhost-user is a virtio device, VFIO-user is a PCIe device. The difference matters:
| Aspect | vhost-user (virtio) | VFIO-user (PCIe/NVMe) |
|---|---|---|
| What the guest sees | virtio-blk / virtio-scsi device | Real NVMe controller |
| What's on the wire | virtqueue descriptors (avail ring, used ring) | NVMe submission/completion queues, doorbells |
| Driver in the guest | virtio-pci / virtio-mmio | Standard NVMe driver (nvme.ko) |
| Performance characteristic | Lowest possible — designed for VM workloads | Nearly as low — emulating a real device, but with all the NVMe features (namespaces, ANA, reservations) |
| When to use | Generic VM block I/O; you don't need NVMe semantics | The guest needs real NVMe: namespaces, ANA, multipath, NVMe-specific features |
VFIO-user is the right answer for "I want this VM to see SPDK-backed storage as a real NVMe device, with all the NVMe feature set, and I want it to be fast." vhost-user is the right answer for "I want a paravirtualized block device for a KVM guest, and I don't need NVMe semantics."
Comparison: latency, CPU, deployment complexity
| RDMA | TCP | VFIO-user | |
|---|---|---|---|
| Sub-µs 4KB read latency | ~5 µs (good RDMA NIC) | ~30-50 µs (kernel-bypass socket) / ~80 µs (regular socket) | ~3-5 µs (in-host) |
| CPU per GB/s | ~0.5 core | ~2-3 cores (regular socket) / ~1 core (io_uring) | ~0.3 core (in-host) |
| Topology requirement | Dedicated RoCE v2 network with PFC, or InfiniBand fabric, or iWARP | Any Ethernet | Same host (Unix socket + shared memory) |
| Encryption support | IPsec (out of band) | TLS / kTLS in-band | None (in-host only; trusted boundary) |
| Distance | Datacenter-scale (RoCE has reach limits; IB has range) | Anywhere TCP/IP runs | Same machine |
| Hardware cost | RDMA HCA, ~$500-$2000 per port | Standard NIC, ~$50 | None |
| Deployment complexity | High (PFC, ECN, MTU, MR tuning, firmware) | Low (just open a port) | Medium (QEMU + SPDK + vfio-user server setup) |
Latency is the headline. RDMA on a good HCA can hit 5 µs end-to-end for a 4KB read. TCP over kernel-bypass sockets hits 30-50 µs. VFIO-user, because it's in-host shared memory, hits 3-5 µs. For a single connection with low queue depth, this ordering holds.
CPU follows the same pattern. RDMA's verbs are the lightest; TCP's socket path is the heaviest (mitigated by io_uring and kTLS). VFIO-user is the lightest of all because there's no network at all.
Deployment complexity flips the order. TCP is the easiest — you just need a routable IP. RDMA is the hardest — you need an RDMA-capable fabric, properly configured PFC and ECN, a working subnet manager, and HCA-specific tuning. VFIO-user is in the middle: it's "in-host only" by design, but the QEMU integration has many small pieces that all have to line up.
Edge cases & what trips people up
RDMA: MTU issues
RDMA has a notion of MTU independent of the Ethernet MTU. The default is usually 2048 (2K), but the practical max is 4096 (4K) for most HCAs. The "max inline data" setting on the queue pair controls how much data can ride in a SEND WR. If the data doesn't fit, it goes as a separate RDMA WRITE — which is fine, but extra round trips.
The thing that catches you: your I/O size must be at most
max_io_size, which is constrained by
io_unit_size and the buffer pool. If the
application issues a 512KB read and the transport only has
128KB buffers, the I/O has to be split. The transport
handles splitting, but it costs an extra copy in some
cases.
TCP: TCP RST
A TCP RST is an abrupt connection termination. The kernel
sends one if the application closes a socket with data
unread; it can also be sent by an intermediary. The
transport's poll_group_poll reads the
disconnection, marks every qpair on the connection as
disconnected, and reports the failure to the nvmf library.
All in-flight I/O on the qpair is failed with
NVME_SC_HOST_PATH_ERROR.
What you do about it: usually nothing. The host's NVMe
driver will see the failure, retry the I/O if appropriate,
and reconnect. If you see RSTs in a healthy system, you
have a process-exit-with-unclean-TCP-state problem on the
host side; check your SO_LINGER settings.
RDMA: path migration
RoCE supports path migration: rerouting a QP to a
different physical path without tearing it down. The HCA
detects the path failure and notifies userspace. SPDK's
transport has limited support for this; the default
behavior is to fail the qpair and let the host reconnect.
If you need hot-path migration, you're in advanced
territory — set the HCA's roce_path sysfs
knob and tune the path's retransmit_timeout.
VFIO-user: QEMU restarts
QEMU is killed and restarted. The vfio-user transport
detects the disconnect (the Unix socket closes), marks the
controller as broken, and tears down the qpair. The
transport's state machine has explicit handling for
this — look at
lib/nvmf/vfio_user.c:161 where
the controller state goes from RUNNING to
PAUSING to PAUSED on disconnect.
The bdev underneath is unaffected — SPDK's nvmf_tgt
continues to run, the bdev continues to exist. When QEMU
reconnects, the vfio-user transport re-creates the
controller. The new controller has a fresh
cntlid and sees the same bdev. The guest's
NVMe driver will re-enumerate the namespaces.
RDMA: connection migration under kernel-bypass
A subtle failure mode: the SPDK target's poll group is
pinned to a specific core. The HCA's IRQ is on a different
core. Every received WR triggers a CQ notification, which
is an inter-processor interrupt. The transport has to
bounce from the IRQ's core to the poll group's core. If
your topology is wrong (poll group on core 0, IRQ on core
31), you get cache-line ping-ponging. Set
get_optimal_poll_group in the transport ops
to do the right thing — for RDMA, that's "the same core
the HCA's CQ completion vector is on."
What to take away
Three transports, three completely different worlds. RDMA uses verbs, pre-registered memory, and a 14-state request state machine to deliver sub-5µs latency. TCP uses sockets and PDUs, with kTLS for encryption, and is always data-in-buffer. VFIO-user is the outlier: not a network at all, just shared memory and doorbells, the lowest-latency option for QEMU guests. The transport ops struct is the unified interface; under the hood, the implementations are completely unrelated.
The next page — provisioning flow traced — is about how diskengine takes a single database record and walks it through the full SPDK stack from lvol to NVMe-oF to host.