Layer 4 · The bdev framework

One I/O, end to end.

A bdev_io lives a long life for a structure that gets recycled from a mempool. It's allocated when you call spdk_bdev_read, it survives a bounce-buffer copy, a possible split into child I/Os, a queue wait for memory, a trip through the module's dispatch, an NVMe round-trip or two, an accel sequence, and then a free back to the pool. This page traces every step.

~20 min read2 diagramsprerequisites: 4.1 · 4.2
On this page
  1. The end-to-end path in nine steps
  2. Step 1-2: the caller's submit and the framework's allocation
  3. Step 3: the per-thread bdev_io cache and the global pool
  4. Step 4: the framework dispatches to the module
  5. Step 5-6: the module dispatches and waits
  6. Step 7-8: completion and the caller callback
  7. Step 9: the bdev_io returns to the pool
  8. The split: read-modify-write optimization
  9. The bounce buffer: alignment handling
  10. Edge cases: what if the module forgets to complete?

The end-to-end path in nine steps

Here's the whole thing in one diagram. We'll go through each step in detail below.

STEP 01
Caller has desc and channel
Both must be on the same thread
STEP 02
Caller invokes spdk_bdev_read/write/flush/...
Public API, bdev.h
STEP 03
Framework allocates bdev_io
From per-thread cache, then mempool
STEP 04
Framework dispatches to module
submit_request with channel and bdev_io
STEP 05
Module dispatches to device
NVMe queue, AIO, memcpy
STEP 06
Module waits for completion
Polled or interrupt-driven
STEP 07
Module calls spdk_bdev_io_complete
With a status code
STEP 08
Framework calls caller's callback
On the same thread that submitted
STEP 09
Framework returns bdev_io to pool
Per-thread cache first, then mempool
flowchart TB
A["caller: spdk_bdev_read_blocks_ext"] --> B["bdev_io in per-thread cache?"]
B -- "yes" --> D["Take from cache"]
B -- "no, no waiters" --> E["mempool_get"]
B -- "no, but waiters exist" --> F["Return -ENOMEM"]
E --> G["Initialize bdev_io fields"]
D --> G
G --> H["Bounce needed?"]
H -- "yes" --> I["accel copy to aligned buffer"]
H -- "no" --> J["Call fn_table->submit_request"]
I --> J
J --> K["Module dispatches to device"]
K --> L["Wait for completion"]
L --> M["Module calls spdk_bdev_io_complete"]
M --> N["Complete from submit context?"]
N -- "yes" --> O["spdk_thread_send_msg to defer"]
N -- "no" --> P["Update stats, run caller cb"]
O --> P
P --> Q["Caller cb returns"]
Q --> R["spdk_bdev_free_io"]
R --> S["Cache full?"]
S -- "no" --> T["Return to per-thread cache"]
S -- "yes" --> U[mempool_put]

classDef fail fill:#f5d6e0,stroke:#8a1c4f;
classDef ok fill:#d6f5d6,stroke:#2a6f2a;
classDef sync fill:#cfe1ff,stroke:#1c4f8a;
class F fail
class T,U ok
classDefault sync
fig. 1 — the bdev_io lifecycle · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The full bdev_io lifecycle. The two pink boxes are the failure-mode branches. The two green boxes are the free-back-to-pool paths. Everything else is the normal happy path.

Step 1-2: the caller's submit and the framework's allocation

The caller has a spdk_bdev_desc * and a spdk_io_channel *. Both must be valid and on the same thread. The caller then invokes one of the public I/O functions, e.g. spdk_bdev_read_blocks:

Three things to notice:

  1. The validation happens first. bdev_io_valid_blocks checks that offset_blocks and num_blocks are within the bdev's range and that num_blocks is non-zero. If the check fails, the function returns -EINVAL and the caller never gets a callback. This is the first place a bug shows up.

  2. The bdev_io is allocated before the I/O is submitted. If bdev_channel_get_io returns NULL (because the pool is empty and waiters exist), the function returns -ENOMEM and the caller never gets a callback. The caller is expected to handle this. (The "waiters exist" branch is part of the fairness design — see Step 3.)

  3. All the I/O parameters go in u.bdev. Not into a malloc'd sidecar. The bdev_io is a single allocation, and the data fits in the union.

Step 3: the per-thread bdev_io cache and the global pool

The bdev_io mempool is created during spdk_bdev_initialize:

The size is the size of the bdev_io header plus the largest get_ctx_size() across all loaded modules. The default pool size is 65535; you can change it with the bdev_set_options RPC. Each element is spdk_bdev_io + module scratch space, cache-line aligned.

Per-thread, the framework maintains a small cache:

Three branches, in order:

  1. Per-thread cache hit. Fast. The default cache size is 64 (set by g_bdev_opts.bdev_io_cache_size in bdev.c:147). This is the common case at steady state.

  2. Cache miss, but waiters exist. Return NULL. The fairness rule: a thread that is going to wait shouldn't have new arrivals jump in front of it. This is what makes spdk_bdev_queue_io_wait work — if there are waiters, you must join them.

  3. Cache miss, no waiters. Pull from the global mempool. The mempool's spdk_mempool_get is lock-free for the common case (it has its own per-thread cache too).

Step 4: the framework dispatches to the module

After the bdev_io is allocated and its fields are filled in, the framework calls bdev_io_submit which calls _bdev_io_submit which calls bdev_io_do_submit which calls bdev_submit_request:

The interesting bits:

  • The in_submit_request flag is set around the call. This is the recursion guard: if the module's submit_request synchronously calls spdk_bdev_io_complete, the framework defers the actual completion to avoid stack blowup (see Step 7).

  • If there are nomem-queued I/Os, the new submission is queued behind them. This prevents a low-priority consumer from monopolizing the I/O path when a high-priority consumer is starved of memory.

  • The actual module call is one line: bdev_submit_request(bdev, ch, bdev_io), which does bdev->fn_table->submit_request(ioch, bdev_io). The whole framework-to-module interface is that one line.

Step 5-6: the module dispatches and waits

From the module's perspective, the call is just fn_table->submit_request(ch, bdev_io). What happens inside is entirely up to the module. For NVMe, the module:

  1. Translates the bdev_io to a NVMe command (read, write, flush, etc.).

  2. Picks an NVMe submission queue (usually round-robin across queues, then pinned to a queue for the same channel).

  3. Writes the command to the doorbell register.

  4. Returns. The I/O is now in flight on the device.

For malloc:

  1. Picks the source/destination address (the module's own malloc_buf).

  2. Submits an accel copy with the channel's accel_channel.

  3. Returns. The accel framework will call malloc_done when the copy completes.

For AIO (Linux kernel aio):

  1. Calls io_submit on the channel's io_context.

  2. Returns. The kernel will signal completion via an eventfd, the channel's poller will pick it up, and the module will call spdk_bdev_io_complete.

The key idea: all three modules return immediately. The I/O is in flight somewhere outside the module's stack. The module is now in "wait" state, and the framework is free to do other work.

The malloc example in detail

For the malloc module, the submit path goes through the accel framework. Here's the read path:

The key call is spdk_accel_append_copy: the malloc module is asking the accel framework to copy from malloc_buf + offset (the bdev's storage) into bdev_io->u.bdev.iovs (the caller's buffer). When the copy completes, the accel framework calls malloc_sequence_done, which calls malloc_done, which calls spdk_bdev_io_complete.

Step 7-8: completion and the caller callback

When the device finishes, the module calls spdk_bdev_io_complete:

Several things happen here, in order:

  1. Double-completion assert. If the bdev_io is already in a non-pending state, the framework asserts. Modules that call spdk_bdev_io_complete twice for the same bdev_io crash the process. (This is intentional — the alternative is silent resource corruption.)

  2. Decrement the channel's io_outstanding. This drives the queue depth poller and the QoS logic.

  3. Run the accel sequence (if any). If the bdev_io has an accel_sequence attached and the I/O was successful, the framework runs the sequence (e.g. decompression, decryption) before the caller's callback.

  4. Push the bounce buffer (if any). If the I/O used a bounce buffer, the framework copies the aligned data back to the user's possibly-misaligned buffer.

  5. Handle NOMEM retry. If this I/O itself returned NOMEM and is the kind of I/O that can be retried, queue it for later.

  6. Call the caller's callback. This happens in bdev_io_complete:

Two important things:

  1. The recursion guard. If we're still inside the original submit_request call, the completion is deferred via spdk_thread_send_msg. This prevents unbounded stack growth when the user's completion callback issues a new I/O.

  2. Latency tracking. The framework measures tsc_diff = tsc_now - submit_tsc and uses it for the per-bdev I/O statistics and the optional histogram.

Finally, the caller's callback is invoked. The framework does this in _bdev_io_complete:

The callback signature is spdk_bdev_io_completion_cb(bdev_io, success, cb_arg):

  • success is a bool, not a status code. If you need the specific status (e.g. NVME_ERROR vs MISCOMPARE), use spdk_bdev_io_get_nvme_status() or spdk_bdev_io_get_scsi_status().

  • cb_arg is the caller's private context, whatever they passed to spdk_bdev_read.

  • The bdev_io is passed in too — the caller can read the results of a read (the data is in the iovs that were submitted), inspect the status, or chain another I/O.

Step 9: the bdev_io returns to the pool

The caller's callback is done. The bdev_io needs to go back to the mempool. This is spdk_bdev_free_io:

Three branches:

  1. If the bdev_io has a buffer attached (because the caller called spdk_bdev_io_get_buf or spdk_bdev_io_set_buf), free the buffer first. This is the "you allocated a buffer for me" path.

  2. Per-thread cache has space. Push the bdev_io onto the cache. This is the common case. Also, while the cache isn't full and there are waiters, wake one of them up — the bdev_io that was just freed might let a waiter proceed.

  3. Per-thread cache is full. Put the bdev_io back into the global mempool. The framework asserts there are no waiters (because the only way to be in this branch is if the cache is full, which means waiters have been serviced).

The waiters loop is what makes the cache-and-pool design actually fair. If a high-priority thread is starved and sleeping on the io_wait_queue, and a lower-priority thread finishes an I/O and frees a bdev_io, the lower-priority thread does the work of waking the waiter. This is "give back" semantics: you took an io, you must give it back, and the act of giving it back can unblock someone else.

The split: read-modify-write optimization

Sometimes a single submit would exceed the bdev's maximum size, or cross an alignment boundary, or both. The framework handles this by splitting the I/O into child bdev_ios, each within limits, and treating the parent as a "wait for all children" object.

The split is gated by a flag in spdk_bdev_io.internal.f.split:

The split logic is complex — it has to walk the iov array, accumulate child I/Os up to various limits, and account for metadata. But the high-level idea is simple: a 4 MB read on a bdev with a 1 MB max_rw_size becomes 4 child reads, each 1 MB, all in parallel. The parent's callback only fires when all 4 children have completed.

The child bdev_ios live in the parent's child_iov[] array (32 elements, defined by SPDK_BDEV_IO_NUM_CHILD_IOV). The parent tracks internal.split.outstanding — a counter that decrements as each child completes. The parent's callback fires when the counter hits zero.

The bounce buffer: alignment handling

Some bdevs have alignment requirements (required_alignment > 0). If the caller's buffer doesn't meet them, the framework automatically:

  1. Allocates an aligned buffer from the iobuf pool.

  2. For a read: issues the read into the aligned buffer, then copies the data back to the caller's buffer before the caller's callback.

  3. For a write: copies the caller's data into the aligned buffer, issues the write from there, then frees the aligned buffer in the completion path.

The flag internal.f.has_bounce_buf is set on the bdev_io when this path is taken. The framework checks it in spdk_bdev_io_complete (around line 8096 in lib/bdev/bdev.c:8070 ) and routes the completion through _bdev_io_push_bounce_data_buffer for the copy-back step.

The cost is one memcopy per misaligned I/O. At millions of IOPS, the cost is significant, so most SPDK applications pre-allocate aligned buffers.

Edge cases: what if the module forgets to complete?

These are the things that have actually broken production SPDK applications.

What if submit_request returns without completing?

The framework asserts in the caller. The bdev_io_do_submit function at lib/bdev/bdev.c:2944 calls the module's submit_request directly. If the module returns without calling spdk_bdev_io_complete and without queuing async work, the bdev_io is leaked: it's removed from the cache but never freed. The io_outstanding counter is never decremented, which eventually blocks new I/O. The process appears to hang with no obvious cause.

The defense: every module's submit_request must call spdk_bdev_io_complete() in every code path, including error paths. The malloc module does this — the switch statement in _bdev_malloc_submit_request has a default case that completes with FAILED. The passthru module does this — every submit returns rc, and rc != 0 triggers a complete with FAILED.

What if submit_request completes synchronously from inside itself?

This is fine — and common. NVMe modules often complete immediately if the queue is full (returning NOMEM so the framework can retry). Malloc completes synchronously for read with NULL buffer, for reset, for flush, for abort, for zcopy start, etc.

The in_submit_request recursion guard defers the actual caller's callback to a spdk_thread_send_msg, which is what makes this safe.

What if the mempool is exhausted?

The submit returns -ENOMEM. The caller must handle it. For consumers that submit many I/Os in a loop, the standard pattern is to use spdk_bdev_queue_io_wait: register a callback that re-tries the submit, and the framework will fire it when a bdev_io is freed. The passthru module uses exactly this pattern (see module/bdev/passthru/vbdev_passthru.c:200 ).

What if two threads try to complete the same bdev_io?

The framework asserts. Modules are required to call spdk_bdev_io_complete exactly once. The assertion is in lib/bdev/bdev.c:8077 : assert(bdev_io->internal.status == SPDK_BDEV_IO_STATUS_PENDING).

What if the caller's callback submits a new I/O?

Allowed, but it must be done on the same thread. The framework's recursion guard defers the caller's callback if it would otherwise re-enter the submit path. This is what makes polling feasible: an nvmf target's callback can submit a new bdev_io, which can complete in the same reactor iteration, and the reactor never blows its stack.

What if the bdev is removed mid-I/O?

The bdev's internal.status moves to SPDK_BDEV_STATUS_REMOVING, and the framework rejects new submissions (returns -ENODEV or -EAGAIN). In-flight I/Os continue. Their callbacks fire as normal. The desc is closed by the consumer in response to the SPDK_BDEV_EVENT_REMOVE event.

What if the caller's thread changes between submit and complete?

Can't happen. The bdev_io's callback is invoked on the thread that submitted the I/O. The framework asserts this: spdk_get_thread() == spdk_bdev_io_get_thread(bdev_io). If you want to handle completion on a different thread, you have to marshal it with spdk_thread_send_msg in your own callback.

What if the caller's callback takes a long time?

The framework doesn't care, but your reactor will. The callback runs on the same thread that submitted the I/O, which is the same thread that's polling the reactor. If the callback does 100 µs of work, the reactor can't process other I/Os during that time. This is why SPDK applications avoid syscalls and sleeps in completion callbacks.

What to take away

A bdev_io is born in a per-thread cache, lives through one or more device round-trips, dies in a callback, and goes back to the cache. The framework handles the mempool, the queue depth tracking, the latency measurement, the double-completion assertion, the recursion guard, the alignment bounce, the split, and the NOMEM retry. The module handles exactly one thing: the type-specific dispatch in submit_request and the eventual spdk_bdev_io_complete. The contract is small and the framework does the rest.

Next: the bdev hierarchy — what happens when a bdev sits on top of another bdev, and how the I/O flows through the stack.