One I/O, end to end.
A bdev_io lives a long life for a structure that gets
recycled from a mempool. It's allocated when you call
spdk_bdev_read, it survives a bounce-buffer
copy, a possible split into child I/Os, a queue wait for
memory, a trip through the module's dispatch, an NVMe
round-trip or two, an accel sequence, and then a free
back to the pool. This page traces every step.
- The end-to-end path in nine steps
- Step 1-2: the caller's submit and the framework's allocation
- Step 3: the per-thread bdev_io cache and the global pool
- Step 4: the framework dispatches to the module
- Step 5-6: the module dispatches and waits
- Step 7-8: completion and the caller callback
- Step 9: the bdev_io returns to the pool
- The split: read-modify-write optimization
- The bounce buffer: alignment handling
- Edge cases: what if the module forgets to complete?
The end-to-end path in nine steps
Here's the whole thing in one diagram. We'll go through each step in detail below.
flowchart TB A["caller: spdk_bdev_read_blocks_ext"] --> B["bdev_io in per-thread cache?"] B -- "yes" --> D["Take from cache"] B -- "no, no waiters" --> E["mempool_get"] B -- "no, but waiters exist" --> F["Return -ENOMEM"] E --> G["Initialize bdev_io fields"] D --> G G --> H["Bounce needed?"] H -- "yes" --> I["accel copy to aligned buffer"] H -- "no" --> J["Call fn_table->submit_request"] I --> J J --> K["Module dispatches to device"] K --> L["Wait for completion"] L --> M["Module calls spdk_bdev_io_complete"] M --> N["Complete from submit context?"] N -- "yes" --> O["spdk_thread_send_msg to defer"] N -- "no" --> P["Update stats, run caller cb"] O --> P P --> Q["Caller cb returns"] Q --> R["spdk_bdev_free_io"] R --> S["Cache full?"] S -- "no" --> T["Return to per-thread cache"] S -- "yes" --> U[mempool_put] classDef fail fill:#f5d6e0,stroke:#8a1c4f; classDef ok fill:#d6f5d6,stroke:#2a6f2a; classDef sync fill:#cfe1ff,stroke:#1c4f8a; class F fail class T,U ok classDefault sync
fig. 1 The full bdev_io lifecycle. The two pink boxes are the failure-mode branches. The two green boxes are the free-back-to-pool paths. Everything else is the normal happy path.
Step 1-2: the caller's submit and the framework's allocation
The caller has a spdk_bdev_desc * and a
spdk_io_channel *. Both must be valid and
on the same thread. The caller then invokes one of the
public I/O functions, e.g. spdk_bdev_read_blocks:
Three things to notice:
The validation happens first.
bdev_io_valid_blockschecks thatoffset_blocksandnum_blocksare within the bdev's range and thatnum_blocksis non-zero. If the check fails, the function returns-EINVALand the caller never gets a callback. This is the first place a bug shows up.The bdev_io is allocated before the I/O is submitted. If
bdev_channel_get_ioreturns NULL (because the pool is empty and waiters exist), the function returns-ENOMEMand the caller never gets a callback. The caller is expected to handle this. (The "waiters exist" branch is part of the fairness design — see Step 3.)All the I/O parameters go in
u.bdev. Not into a malloc'd sidecar. The bdev_io is a single allocation, and the data fits in the union.
Step 3: the per-thread bdev_io cache and the global pool
The bdev_io mempool is created during
spdk_bdev_initialize:
The size is the size of the bdev_io header plus the
largest get_ctx_size() across all loaded
modules. The default pool size is 65535; you can change
it with the bdev_set_options RPC. Each
element is spdk_bdev_io + module scratch
space, cache-line aligned.
Per-thread, the framework maintains a small cache:
Three branches, in order:
Per-thread cache hit. Fast. The default cache size is 64 (set by
g_bdev_opts.bdev_io_cache_sizeinbdev.c:147). This is the common case at steady state.Cache miss, but waiters exist. Return NULL. The fairness rule: a thread that is going to wait shouldn't have new arrivals jump in front of it. This is what makes
spdk_bdev_queue_io_waitwork — if there are waiters, you must join them.Cache miss, no waiters. Pull from the global mempool. The mempool's
spdk_mempool_getis lock-free for the common case (it has its own per-thread cache too).
Step 4: the framework dispatches to the module
After the bdev_io is allocated and its fields are
filled in, the framework calls
bdev_io_submit which calls
_bdev_io_submit which calls
bdev_io_do_submit which calls
bdev_submit_request:
The interesting bits:
The
in_submit_requestflag is set around the call. This is the recursion guard: if the module's submit_request synchronously callsspdk_bdev_io_complete, the framework defers the actual completion to avoid stack blowup (see Step 7).If there are nomem-queued I/Os, the new submission is queued behind them. This prevents a low-priority consumer from monopolizing the I/O path when a high-priority consumer is starved of memory.
The actual module call is one line:
bdev_submit_request(bdev, ch, bdev_io), which doesbdev->fn_table->submit_request(ioch, bdev_io). The whole framework-to-module interface is that one line.
Step 5-6: the module dispatches and waits
From the module's perspective, the call is just
fn_table->submit_request(ch, bdev_io).
What happens inside is entirely up to the module.
For NVMe, the module:
Translates the bdev_io to a NVMe command (read, write, flush, etc.).
Picks an NVMe submission queue (usually round-robin across queues, then pinned to a queue for the same channel).
Writes the command to the doorbell register.
Returns. The I/O is now in flight on the device.
For malloc:
Picks the source/destination address (the module's own
malloc_buf).Submits an accel copy with the channel's
accel_channel.Returns. The accel framework will call
malloc_donewhen the copy completes.
For AIO (Linux kernel aio):
Calls
io_submiton the channel'sio_context.Returns. The kernel will signal completion via an eventfd, the channel's poller will pick it up, and the module will call
spdk_bdev_io_complete.
The key idea: all three modules return immediately. The I/O is in flight somewhere outside the module's stack. The module is now in "wait" state, and the framework is free to do other work.
The malloc example in detail
For the malloc module, the submit path goes through the accel framework. Here's the read path:
The key call is spdk_accel_append_copy:
the malloc module is asking the accel framework to
copy from malloc_buf + offset (the
bdev's storage) into bdev_io->u.bdev.iovs
(the caller's buffer). When the copy completes, the
accel framework calls malloc_sequence_done,
which calls malloc_done, which calls
spdk_bdev_io_complete.
Step 7-8: completion and the caller callback
When the device finishes, the module calls
spdk_bdev_io_complete:
Several things happen here, in order:
Double-completion assert. If the bdev_io is already in a non-pending state, the framework asserts. Modules that call
spdk_bdev_io_completetwice for the same bdev_io crash the process. (This is intentional — the alternative is silent resource corruption.)Decrement the channel's io_outstanding. This drives the queue depth poller and the QoS logic.
Run the accel sequence (if any). If the bdev_io has an accel_sequence attached and the I/O was successful, the framework runs the sequence (e.g. decompression, decryption) before the caller's callback.
Push the bounce buffer (if any). If the I/O used a bounce buffer, the framework copies the aligned data back to the user's possibly-misaligned buffer.
Handle NOMEM retry. If this I/O itself returned NOMEM and is the kind of I/O that can be retried, queue it for later.
Call the caller's callback. This happens in
bdev_io_complete:
Two important things:
The recursion guard. If we're still inside the original
submit_requestcall, the completion is deferred viaspdk_thread_send_msg. This prevents unbounded stack growth when the user's completion callback issues a new I/O.Latency tracking. The framework measures
tsc_diff = tsc_now - submit_tscand uses it for the per-bdev I/O statistics and the optional histogram.
Finally, the caller's callback is invoked. The
framework does this in _bdev_io_complete:
The callback signature is
spdk_bdev_io_completion_cb(bdev_io, success, cb_arg):
successis a bool, not a status code. If you need the specific status (e.g. NVME_ERROR vs MISCOMPARE), usespdk_bdev_io_get_nvme_status()orspdk_bdev_io_get_scsi_status().cb_argis the caller's private context, whatever they passed tospdk_bdev_read.The
bdev_iois passed in too — the caller can read the results of a read (the data is in the iovs that were submitted), inspect the status, or chain another I/O.
Step 9: the bdev_io returns to the pool
The caller's callback is done. The bdev_io needs to
go back to the mempool. This is
spdk_bdev_free_io:
Three branches:
If the bdev_io has a buffer attached (because the caller called
spdk_bdev_io_get_buforspdk_bdev_io_set_buf), free the buffer first. This is the "you allocated a buffer for me" path.Per-thread cache has space. Push the bdev_io onto the cache. This is the common case. Also, while the cache isn't full and there are waiters, wake one of them up — the bdev_io that was just freed might let a waiter proceed.
Per-thread cache is full. Put the bdev_io back into the global mempool. The framework asserts there are no waiters (because the only way to be in this branch is if the cache is full, which means waiters have been serviced).
The waiters loop is what makes the cache-and-pool design actually fair. If a high-priority thread is starved and sleeping on the io_wait_queue, and a lower-priority thread finishes an I/O and frees a bdev_io, the lower-priority thread does the work of waking the waiter. This is "give back" semantics: you took an io, you must give it back, and the act of giving it back can unblock someone else.
The split: read-modify-write optimization
Sometimes a single submit would exceed the bdev's maximum size, or cross an alignment boundary, or both. The framework handles this by splitting the I/O into child bdev_ios, each within limits, and treating the parent as a "wait for all children" object.
The split is gated by a flag in
spdk_bdev_io.internal.f.split:
The split logic is complex — it has to walk the iov array, accumulate child I/Os up to various limits, and account for metadata. But the high-level idea is simple: a 4 MB read on a bdev with a 1 MB max_rw_size becomes 4 child reads, each 1 MB, all in parallel. The parent's callback only fires when all 4 children have completed.
The child bdev_ios live in the parent's
child_iov[] array (32 elements, defined
by SPDK_BDEV_IO_NUM_CHILD_IOV). The
parent tracks internal.split.outstanding —
a counter that decrements as each child completes. The
parent's callback fires when the counter hits zero.
The bounce buffer: alignment handling
Some bdevs have alignment requirements
(required_alignment > 0). If the
caller's buffer doesn't meet them, the framework
automatically:
Allocates an aligned buffer from the iobuf pool.
For a read: issues the read into the aligned buffer, then copies the data back to the caller's buffer before the caller's callback.
For a write: copies the caller's data into the aligned buffer, issues the write from there, then frees the aligned buffer in the completion path.
The flag internal.f.has_bounce_buf is
set on the bdev_io when this path is taken. The
framework checks it in
spdk_bdev_io_complete (around line 8096
in lib/bdev/bdev.c:8070 ) and
routes the completion through
_bdev_io_push_bounce_data_buffer for the
copy-back step.
The cost is one memcopy per misaligned I/O. At millions of IOPS, the cost is significant, so most SPDK applications pre-allocate aligned buffers.
Edge cases: what if the module forgets to complete?
These are the things that have actually broken production SPDK applications.
What if submit_request returns without completing?
The framework asserts in the caller. The
bdev_io_do_submit function at
lib/bdev/bdev.c:2944 calls
the module's submit_request directly. If
the module returns without calling
spdk_bdev_io_complete and without
queuing async work, the bdev_io is leaked: it's
removed from the cache but never freed. The
io_outstanding counter is never
decremented, which eventually blocks new I/O. The
process appears to hang with no obvious cause.
The defense: every module's submit_request
must call spdk_bdev_io_complete() in
every code path, including error paths. The malloc
module does this — the switch statement in
_bdev_malloc_submit_request has a
default case that completes with FAILED. The passthru
module does this — every submit returns rc, and rc !=
0 triggers a complete with FAILED.
What if submit_request completes synchronously from inside itself?
This is fine — and common. NVMe modules often complete immediately if the queue is full (returning NOMEM so the framework can retry). Malloc completes synchronously for read with NULL buffer, for reset, for flush, for abort, for zcopy start, etc.
The in_submit_request recursion guard
defers the actual caller's callback to a
spdk_thread_send_msg, which is what
makes this safe.
What if the mempool is exhausted?
The submit returns -ENOMEM. The caller
must handle it. For consumers that submit many I/Os in
a loop, the standard pattern is to use
spdk_bdev_queue_io_wait: register a
callback that re-tries the submit, and the framework
will fire it when a bdev_io is freed. The passthru
module uses exactly this pattern (see
module/bdev/passthru/vbdev_passthru.c:200 ).
What if two threads try to complete the same bdev_io?
The framework asserts. Modules are required to call
spdk_bdev_io_complete exactly once. The
assertion is in
lib/bdev/bdev.c:8077 :
assert(bdev_io->internal.status == SPDK_BDEV_IO_STATUS_PENDING).
What if the caller's callback submits a new I/O?
Allowed, but it must be done on the same thread. The framework's recursion guard defers the caller's callback if it would otherwise re-enter the submit path. This is what makes polling feasible: an nvmf target's callback can submit a new bdev_io, which can complete in the same reactor iteration, and the reactor never blows its stack.
What if the bdev is removed mid-I/O?
The bdev's internal.status moves to
SPDK_BDEV_STATUS_REMOVING, and the
framework rejects new submissions (returns -ENODEV
or -EAGAIN). In-flight I/Os continue. Their callbacks
fire as normal. The desc is closed by the consumer
in response to the
SPDK_BDEV_EVENT_REMOVE event.
What if the caller's thread changes between submit and complete?
Can't happen. The bdev_io's callback is invoked on
the thread that submitted the I/O. The framework
asserts this: spdk_get_thread() == spdk_bdev_io_get_thread(bdev_io).
If you want to handle completion on a different
thread, you have to marshal it with
spdk_thread_send_msg in your own
callback.
What if the caller's callback takes a long time?
The framework doesn't care, but your reactor will. The callback runs on the same thread that submitted the I/O, which is the same thread that's polling the reactor. If the callback does 100 µs of work, the reactor can't process other I/Os during that time. This is why SPDK applications avoid syscalls and sleeps in completion callbacks.
What to take away
A bdev_io is born in a per-thread cache, lives
through one or more device round-trips, dies in a
callback, and goes back to the cache. The framework
handles the mempool, the queue depth tracking, the
latency measurement, the double-completion assertion,
the recursion guard, the alignment bounce, the
split, and the NOMEM retry. The module handles
exactly one thing: the type-specific dispatch in
submit_request and the eventual
spdk_bdev_io_complete. The contract is
small and the framework does the rest.
Next: the bdev hierarchy — what happens when a bdev sits on top of another bdev, and how the I/O flows through the stack.