Layer 4 · The bdev framework

One I/O, end to end.

A bdev_io lives a long life for a structure that gets recycled from a mempool. It's allocated when you call spdk_bdev_read, it survives a bounce-buffer copy, a possible split into child I/Os, a queue wait for memory, a trip through the module's dispatch, an NVMe round-trip or two, an accel sequence, and then a free back to the pool. This page traces every step.

~20 min read2 diagramsprerequisites: 4.1 · 4.2

On this page

The end-to-end path in nine steps
Step 1-2: the caller's submit and the framework's allocation
Step 3: the per-thread bdev_io cache and the global pool
Step 4: the framework dispatches to the module
Step 5-6: the module dispatches and waits
Step 7-8: completion and the caller callback
Step 9: the bdev_io returns to the pool
The split: read-modify-write optimization
The bounce buffer: alignment handling
Edge cases: what if the module forgets to complete?

The end-to-end path in nine steps

Here's the whole thing in one diagram. We'll go through each step in detail below.

STEP 01

Caller has desc and channel

Both must be on the same thread

→

STEP 02

Caller invokes spdk_bdev_read/write/flush/...

Public API, bdev.h

→

STEP 03

Framework allocates bdev_io

From per-thread cache, then mempool

→

STEP 04

Framework dispatches to module

submit_request with channel and bdev_io

→

STEP 05

Module dispatches to device

NVMe queue, AIO, memcpy

→

STEP 06

Module waits for completion

Polled or interrupt-driven

→

STEP 07

Module calls spdk_bdev_io_complete

With a status code

→

STEP 08

Framework calls caller's callback

On the same thread that submitted

→

STEP 09

Framework returns bdev_io to pool

Per-thread cache first, then mempool

flowchart TB
A["caller: spdk_bdev_read_blocks_ext"] --> B["bdev_io in per-thread cache?"]
B -- "yes" --> D["Take from cache"]
B -- "no, no waiters" --> E["mempool_get"]
B -- "no, but waiters exist" --> F["Return -ENOMEM"]
E --> G["Initialize bdev_io fields"]
D --> G
G --> H["Bounce needed?"]
H -- "yes" --> I["accel copy to aligned buffer"]
H -- "no" --> J["Call fn_table->submit_request"]
I --> J
J --> K["Module dispatches to device"]
K --> L["Wait for completion"]
L --> M["Module calls spdk_bdev_io_complete"]
M --> N["Complete from submit context?"]
N -- "yes" --> O["spdk_thread_send_msg to defer"]
N -- "no" --> P["Update stats, run caller cb"]
O --> P
P --> Q["Caller cb returns"]
Q --> R["spdk_bdev_free_io"]
R --> S["Cache full?"]
S -- "no" --> T["Return to per-thread cache"]
S -- "yes" --> U[mempool_put]

classDef fail fill:#f5d6e0,stroke:#8a1c4f;
classDef ok fill:#d6f5d6,stroke:#2a6f2a;
classDef sync fill:#cfe1ff,stroke:#1c4f8a;
class F fail
class T,U ok
classDefault sync

fig. 1 — the bdev_io lifecycle · tap or scroll to zoom · ↗ for fullscreen

fig. 1 The full bdev_io lifecycle. The two pink boxes are the failure-mode branches. The two green boxes are the free-back-to-pool paths. Everything else is the normal happy path.

Step 1-2: the caller's submit and the framework's allocation

The caller has a spdk_bdev_desc * and a spdk_io_channel *. Both must be valid and on the same thread. The caller then invokes one of the public I/O functions, e.g. spdk_bdev_read_blocks:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 5870-5910 bdev_read_blocks_with_md — the framework's read path

static int
bdev_read_blocks_with_md(struct spdk_bdev_desc *desc, struct spdk_io_channel *ch, void *buf,
                         void *md_buf, uint64_t offset_blocks, uint64_t num_blocks,
                         spdk_bdev_io_completion_cb cb, void *cb_arg)
{
    struct spdk_bdev *bdev = spdk_bdev_desc_get_bdev(desc);
    struct spdk_bdev_io *bdev_io;
    struct spdk_bdev_channel *channel = __io_ch_to_bdev_ch(ch);

    if (!bdev_io_valid_blocks(bdev, offset_blocks, num_blocks)) {
        return -EINVAL;
    }

    bdev_io = bdev_channel_get_io(channel);
    if (!bdev_io) {
        return -ENOMEM;
    }

    bdev_io->internal.ch = channel;
    bdev_io->internal.desc = desc;
    bdev_io->type = SPDK_BDEV_IO_TYPE_READ;
    bdev_io->u.bdev.iovs = &bdev_io->iov;
    bdev_io->u.bdev.iovs[0].iov_base = buf;
    bdev_io->u.bdev.iovs[0].iov_len = num_blocks * bdev_desc_get_block_size(desc);
    bdev_io->u.bdev.iovcnt = 1;
    bdev_io->u.bdev.md_buf = md_buf;
    bdev_io->u.bdev.num_blocks = num_blocks;
    bdev_io->u.bdev.offset_blocks = offset_blocks;
    bdev_io->u.bdev.memory_domain = NULL;
    bdev_io->internal.caller_ctx = cb_arg;
    bdev_io->internal.cb = cb;

    // ... accel sequence, dif init, etc.

    bdev_io_submit(bdev_io);
    return 0;
}

Three things to notice:

The validation happens first. bdev_io_valid_blocks checks that offset_blocks and num_blocks are within the bdev's range and that num_blocks is non-zero. If the check fails, the function returns -EINVAL and the caller never gets a callback. This is the first place a bug shows up.
The bdev_io is allocated before the I/O is submitted. If bdev_channel_get_io returns NULL (because the pool is empty and waiters exist), the function returns -ENOMEM and the caller never gets a callback. The caller is expected to handle this. (The "waiters exist" branch is part of the fairness design — see Step 3.)
All the I/O parameters go in u.bdev. Not into a malloc'd sidecar. The bdev_io is a single allocation, and the data fits in the union.

Step 3: the per-thread bdev_io cache and the global pool

The bdev_io mempool is created during spdk_bdev_initialize:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 2360-2370 The bdev_io mempool is created at init time

g_bdev_mgr.bdev_io_pool = spdk_mempool_create(mempool_name,
              g_bdev_opts.bdev_io_pool_size,
              sizeof(struct spdk_bdev_io) +
              bdev_module_get_max_ctx_size(),
              0,
              SPDK_ENV_NUMA_ID_ANY);

The size is the size of the bdev_io header plus the largest get_ctx_size() across all loaded modules. The default pool size is 65535; you can change it with the bdev_set_options RPC. Each element is spdk_bdev_io + module scratch space, cache-line aligned.

Per-thread, the framework maintains a small cache:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 2658-2679 bdev_channel_get_io — the per-thread fast path

struct spdk_bdev_io *
bdev_channel_get_io(struct spdk_bdev_channel *channel)
{
    struct spdk_bdev_mgmt_channel *ch = channel->shared_resource->mgmt_ch;
    struct spdk_bdev_io *bdev_io;

    if (ch->per_thread_cache_count > 0) {
        bdev_io = STAILQ_FIRST(&ch->per_thread_cache);
        STAILQ_REMOVE_HEAD(&ch->per_thread_cache, internal.buf_link);
        ch->per_thread_cache_count--;
    } else if (spdk_unlikely(!TAILQ_EMPTY(&ch->io_wait_queue))) {
        /*
         * Don't try to look for bdev_ios in the global pool if there are
         * waiters on bdev_ios - we don't want this caller to jump the line.
         */
        bdev_io = NULL;
    } else {
        bdev_io = spdk_mempool_get(g_bdev_mgr.bdev_io_pool);
    }

    return bdev_io;
}

Three branches, in order:

Per-thread cache hit. Fast. The default cache size is 64 (set by g_bdev_opts.bdev_io_cache_size in bdev.c:147). This is the common case at steady state.
Cache miss, but waiters exist. Return NULL. The fairness rule: a thread that is going to wait shouldn't have new arrivals jump in front of it. This is what makes spdk_bdev_queue_io_wait work — if there are waiters, you must join them.
Cache miss, no waiters. Pull from the global mempool. The mempool's spdk_mempool_get is lock-free for the common case (it has its own per-thread cache too).

Step 4: the framework dispatches to the module

After the bdev_io is allocated and its fields are filled in, the framework calls bdev_io_submit which calls _bdev_io_submit which calls bdev_io_do_submit which calls bdev_submit_request:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 2943-2984 bdev_io_do_submit — the actual module call

static inline void
bdev_io_do_submit(struct spdk_bdev_channel *bdev_ch, struct spdk_bdev_io *bdev_io)
{
    struct spdk_bdev *bdev = bdev_io->bdev;
    struct spdk_io_channel *ch = bdev_ch->channel;
    struct spdk_bdev_shared_resource *shared_resource = bdev_ch->shared_resource;

    if (spdk_unlikely(bdev_io->type == SPDK_BDEV_IO_TYPE_ABORT)) {
        // ... abort special-case ...
    }

    if (spdk_unlikely(bdev_io->type == SPDK_BDEV_IO_TYPE_WRITE &&
              bdev_io->bdev->split_on_write_unit &&
              bdev_io->u.bdev.num_blocks < bdev_io->bdev->write_unit_size)) {
        SPDK_ERRLOG("IO num_blocks %lu does not match the write_unit_size %u\n",
                    bdev_io->u.bdev.num_blocks, bdev_io->bdev->write_unit_size);
        _bdev_io_complete_in_submit(bdev_ch, bdev_io, SPDK_BDEV_IO_STATUS_FAILED);
        return;
    }

    if (spdk_likely(TAILQ_EMPTY(&shared_resource->nomem_io))) {
        bdev_io_increment_outstanding(bdev_ch, shared_resource);
        bdev_io->internal.f.in_submit_request = true;
        bdev_submit_request(bdev, ch, bdev_io);
        bdev_io->internal.f.in_submit_request = false;
    } else {
        bdev_queue_nomem_io_tail(shared_resource, bdev_io, BDEV_IO_RETRY_STATE_SUBMIT);
        // ...
    }
}

The interesting bits:

The in_submit_request flag is set around the call. This is the recursion guard: if the module's submit_request synchronously calls spdk_bdev_io_complete, the framework defers the actual completion to avoid stack blowup (see Step 7).
If there are nomem-queued I/Os, the new submission is queued behind them. This prevents a low-priority consumer from monopolizing the I/O path when a high-priority consumer is starved of memory.
The actual module call is one line: bdev_submit_request(bdev, ch, bdev_io), which does bdev->fn_table->submit_request(ioch, bdev_io). The whole framework-to-module interface is that one line.

Step 5-6: the module dispatches and waits

From the module's perspective, the call is just fn_table->submit_request(ch, bdev_io). What happens inside is entirely up to the module. For NVMe, the module:

Translates the bdev_io to a NVMe command (read, write, flush, etc.).
Picks an NVMe submission queue (usually round-robin across queues, then pinned to a queue for the same channel).
Writes the command to the doorbell register.
Returns. The I/O is now in flight on the device.

For malloc:

Picks the source/destination address (the module's own malloc_buf).
Submits an accel copy with the channel's accel_channel.
Returns. The accel framework will call malloc_done when the copy completes.

For AIO (Linux kernel aio):

Calls io_submit on the channel's io_context.
Returns. The kernel will signal completion via an eventfd, the channel's poller will pick it up, and the module will call spdk_bdev_io_complete.

The key idea: all three modules return immediately. The I/O is in flight somewhere outside the module's stack. The module is now in "wait" state, and the framework is free to do other work.

The malloc example in detail

For the malloc module, the submit path goes through the accel framework. Here's the read path:

spdk_v26_01_migration/module/bdev/malloc/bdev_malloc.c · lines 362-413 bdev_malloc_readv — submit a read

static void
bdev_malloc_readv(struct malloc_disk *mdisk, struct spdk_io_channel *ch,
                  struct malloc_task *task, struct spdk_bdev_io *bdev_io)
{
    uint64_t len, offset;
    int res = 0;

    len = bdev_io->u.bdev.num_blocks * bdev_io->bdev->blocklen;
    offset = bdev_io->u.bdev.offset_blocks * bdev_io->bdev->blocklen;

    // ... iov length sanity check ...

    task->status = SPDK_BDEV_IO_STATUS_SUCCESS;
    task->num_outstanding = 0;
    task->iov.iov_base = mdisk->malloc_buf + offset;
    task->iov.iov_len = len;

    task->num_outstanding++;
    res = spdk_accel_append_copy(&bdev_io->u.bdev.accel_sequence, ch,
                                 bdev_io->u.bdev.iovs, bdev_io->u.bdev.iovcnt,
                                 bdev_io->u.bdev.memory_domain,
                                 bdev_io->u.bdev.memory_domain_ctx,
                                 &task->iov, 1, NULL, NULL, NULL, NULL);
    if (spdk_unlikely(res != 0)) {
        malloc_sequence_fail(task, res);
        return;
    }

    spdk_accel_sequence_reverse(bdev_io->u.bdev.accel_sequence);
    spdk_accel_sequence_finish(bdev_io->u.bdev.accel_sequence, malloc_sequence_done, task);

    // ... metadata path ...
}

The key call is spdk_accel_append_copy: the malloc module is asking the accel framework to copy from malloc_buf + offset (the bdev's storage) into bdev_io->u.bdev.iovs (the caller's buffer). When the copy completes, the accel framework calls malloc_sequence_done, which calls malloc_done, which calls spdk_bdev_io_complete.

Step 7-8: completion and the caller callback

When the device finishes, the module calls spdk_bdev_io_complete:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 8070-8111 spdk_bdev_io_complete — the framework's completion entry point

void
spdk_bdev_io_complete(struct spdk_bdev_io *bdev_io, enum spdk_bdev_io_status status)
{
    struct spdk_bdev *bdev = bdev_io->bdev;
    struct spdk_bdev_channel *bdev_ch = bdev_io->internal.ch;
    struct spdk_bdev_shared_resource *shared_resource = bdev_ch->shared_resource;

    if (spdk_unlikely(bdev_io->internal.status != SPDK_BDEV_IO_STATUS_PENDING)) {
        SPDK_ERRLOG("Unexpected completion on IO from %s module, status was %s\n",
                    spdk_bdev_get_module_name(bdev),
                    bdev_io_status_get_string(bdev_io->internal.status));
        assert(false);
    }
    bdev_io->internal.status = status;

    if (spdk_unlikely(bdev_io->type == SPDK_BDEV_IO_TYPE_RESET)) {
        // ... reset special-case ...
    } else {
        bdev_io_decrement_outstanding(bdev_ch, shared_resource);
        if (spdk_likely(status == SPDK_BDEV_IO_STATUS_SUCCESS)) {
            if (bdev_io_needs_sequence_exec(bdev_io)) {
                bdev_io_exec_sequence(bdev_io, bdev_io_complete_sequence_cb);
                return;
            } else if (spdk_unlikely(bdev_io->internal.f.has_bounce_buf &&
                                     !bdev_io_use_accel_sequence(bdev_io))) {
                _bdev_io_push_bounce_data_buffer(bdev_io,
                                                 _bdev_io_complete_push_bounce_done);
                return;
            }
        }

        if (spdk_unlikely(_bdev_io_handle_no_mem(bdev_io, BDEV_IO_RETRY_STATE_SUBMIT))) {
            return;
        }
    }

    bdev_io_complete(bdev_io);
}

Several things happen here, in order:

Double-completion assert. If the bdev_io is already in a non-pending state, the framework asserts. Modules that call spdk_bdev_io_complete twice for the same bdev_io crash the process. (This is intentional — the alternative is silent resource corruption.)
Decrement the channel's io_outstanding. This drives the queue depth poller and the QoS logic.
Run the accel sequence (if any). If the bdev_io has an accel_sequence attached and the I/O was successful, the framework runs the sequence (e.g. decompression, decryption) before the caller's callback.
Push the bounce buffer (if any). If the I/O used a bounce buffer, the framework copies the aligned data back to the user's possibly-misaligned buffer.
Handle NOMEM retry. If this I/O itself returned NOMEM and is the kind of I/O that can be retried, queue it for later.
Call the caller's callback. This happens in bdev_io_complete:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 7939-7975 bdev_io_complete — the caller's callback

static inline void
bdev_io_complete(void *ctx)
{
    struct spdk_bdev_io *bdev_io = ctx;
    struct spdk_bdev_channel *bdev_ch = bdev_io->internal.ch;
    uint64_t tsc, tsc_diff;

    if (spdk_unlikely(bdev_io->internal.f.in_submit_request)) {
        /*
         * Defer completion to avoid potential infinite recursion if the
         * user's completion callback issues a new I/O.
         */
        spdk_thread_send_msg(spdk_bdev_io_get_thread(bdev_io),
                             bdev_io_complete, bdev_io);
        return;
    }

    tsc = spdk_get_ticks();
    tsc_diff = tsc - bdev_io->internal.submit_tsc;

    bdev_ch_remove_from_io_submitted(bdev_io);
    spdk_trace_record_tsc(tsc, TRACE_BDEV_IO_DONE, bdev_ch->trace_id, 0, (uintptr_t)bdev_io,
                          bdev_io->internal.caller_ctx, bdev_ch->queue_depth);

    if (bdev_ch->histogram) {
        spdk_histogram_data_tally(bdev_ch->histogram, tsc_diff);
    }

    bdev_io_update_io_stat(bdev_io, tsc_diff);
    _bdev_io_complete(bdev_io);
}

Two important things:

The recursion guard. If we're still inside the original submit_request call, the completion is deferred via spdk_thread_send_msg. This prevents unbounded stack growth when the user's completion callback issues a new I/O.
Latency tracking. The framework measures tsc_diff = tsc_now - submit_tsc and uses it for the per-bdev I/O statistics and the optional histogram.

Finally, the caller's callback is invoked. The framework does this in _bdev_io_complete:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 7922-7937 _bdev_io_complete — the actual caller's callback

static inline void
_bdev_io_complete(void *ctx)
{
    struct spdk_bdev_io *bdev_io = ctx;

    if (spdk_unlikely(bdev_io_use_accel_sequence(bdev_io))) {
        assert(bdev_io->internal.status != SPDK_BDEV_IO_STATUS_SUCCESS);
        spdk_accel_sequence_abort(bdev_io->internal.accel_sequence);
    }

    assert(bdev_io->internal.cb != NULL);
    assert(spdk_get_thread() == spdk_bdev_io_get_thread(bdev_io));

    bdev_io->internal.cb(bdev_io, bdev_io->internal.status == SPDK_BDEV_IO_STATUS_SUCCESS,
                         bdev_io->internal.caller_ctx);
}

The callback signature is spdk_bdev_io_completion_cb(bdev_io, success, cb_arg):

success is a bool, not a status code. If you need the specific status (e.g. NVME_ERROR vs MISCOMPARE), use spdk_bdev_io_get_nvme_status() or spdk_bdev_io_get_scsi_status().
cb_arg is the caller's private context, whatever they passed to spdk_bdev_read.
The bdev_io is passed in too — the caller can read the results of a read (the data is in the iovs that were submitted), inspect the status, or chain another I/O.

Step 9: the bdev_io returns to the pool

The caller's callback is done. The bdev_io needs to go back to the mempool. This is spdk_bdev_free_io:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 2681-2710 spdk_bdev_free_io — return to the pool

void
spdk_bdev_free_io(struct spdk_bdev_io *bdev_io)
{
    struct spdk_bdev_mgmt_channel *ch;

    assert(bdev_io != NULL);
    assert(bdev_io->internal.status != SPDK_BDEV_IO_STATUS_PENDING);

    ch = bdev_io->internal.ch->shared_resource->mgmt_ch;

    if (bdev_io->internal.f.has_buf) {
        bdev_io_put_buf(bdev_io);
    }

    if (ch->per_thread_cache_count < ch->bdev_io_cache_size) {
        ch->per_thread_cache_count++;
        STAILQ_INSERT_HEAD(&ch->per_thread_cache, bdev_io, internal.buf_link);
        while (ch->per_thread_cache_count > 0 && !TAILQ_EMPTY(&ch->io_wait_queue)) {
            struct spdk_bdev_io_wait_entry *entry;

            entry = TAILQ_FIRST(&ch->io_wait_queue);
            TAILQ_REMOVE(&ch->io_wait_queue, entry, link);
            entry->cb_fn(entry->cb_arg);
        }
    } else {
        /* We should never have a full cache with entries on the io wait queue. */
        assert(TAILQ_EMPTY(&ch->io_wait_queue));
        spdk_mempool_put(g_bdev_mgr.bdev_io_pool, (void *)bdev_io);
    }
}

Three branches:

If the bdev_io has a buffer attached (because the caller called spdk_bdev_io_get_buf or spdk_bdev_io_set_buf), free the buffer first. This is the "you allocated a buffer for me" path.
Per-thread cache has space. Push the bdev_io onto the cache. This is the common case. Also, while the cache isn't full and there are waiters, wake one of them up — the bdev_io that was just freed might let a waiter proceed.
Per-thread cache is full. Put the bdev_io back into the global mempool. The framework asserts there are no waiters (because the only way to be in this branch is if the cache is full, which means waiters have been serviced).

The waiters loop is what makes the cache-and-pool design actually fair. If a high-priority thread is starved and sleeping on the io_wait_queue, and a lower-priority thread finishes an I/O and frees a bdev_io, the lower-priority thread does the work of waking the waiter. This is "give back" semantics: you took an io, you must give it back, and the act of giving it back can unblock someone else.

The split: read-modify-write optimization

Sometimes a single submit would exceed the bdev's maximum size, or cross an alignment boundary, or both. The framework handles this by splitting the I/O into child bdev_ios, each within limits, and treating the parent as a "wait for all children" object.

The split is gated by a flag in spdk_bdev_io.internal.f.split:

spdk_v26_01_migration/lib/bdev/bdev.c · lines 3317-3376 _bdev_rw_split — the split path

static void
_bdev_rw_split(void *_bdev_io)
{
    struct spdk_bdev_io *bdev_io = _bdev_io;
    struct spdk_bdev *bdev = bdev_io->bdev;
    // ...
    uint32_t io_boundary;

    if (bdev_io->type == SPDK_BDEV_IO_TYPE_WRITE && bdev->split_on_write_unit) {
        io_boundary = bdev->write_unit_size;
    } else if (bdev->split_on_optimal_io_boundary) {
        io_boundary = bdev->optimal_io_boundary;
    } else {
        io_boundary = UINT32_MAX;
    }

    // ... walk iovs, accumulate child bdev_ios up to
    //     max_segment_size / max_num_segments / io_boundary ...
}

The split logic is complex — it has to walk the iov array, accumulate child I/Os up to various limits, and account for metadata. But the high-level idea is simple: a 4 MB read on a bdev with a 1 MB max_rw_size becomes 4 child reads, each 1 MB, all in parallel. The parent's callback only fires when all 4 children have completed.

The child bdev_ios live in the parent's child_iov[] array (32 elements, defined by SPDK_BDEV_IO_NUM_CHILD_IOV). The parent tracks internal.split.outstanding — a counter that decrements as each child completes. The parent's callback fires when the counter hits zero.

The bounce buffer: alignment handling

Some bdevs have alignment requirements (required_alignment > 0). If the caller's buffer doesn't meet them, the framework automatically:

Allocates an aligned buffer from the iobuf pool.
For a read: issues the read into the aligned buffer, then copies the data back to the caller's buffer before the caller's callback.
For a write: copies the caller's data into the aligned buffer, issues the write from there, then frees the aligned buffer in the completion path.

The flag internal.f.has_bounce_buf is set on the bdev_io when this path is taken. The framework checks it in spdk_bdev_io_complete (around line 8096 in lib/bdev/bdev.c:8070 ) and routes the completion through _bdev_io_push_bounce_data_buffer for the copy-back step.

The cost is one memcopy per misaligned I/O. At millions of IOPS, the cost is significant, so most SPDK applications pre-allocate aligned buffers.

Edge cases: what if the module forgets to complete?

These are the things that have actually broken production SPDK applications.

What if `submit_request` returns without completing?

The framework asserts in the caller. The bdev_io_do_submit function at lib/bdev/bdev.c:2944 calls the module's submit_request directly. If the module returns without calling spdk_bdev_io_complete and without queuing async work, the bdev_io is leaked: it's removed from the cache but never freed. The io_outstanding counter is never decremented, which eventually blocks new I/O. The process appears to hang with no obvious cause.

The defense: every module's submit_request must call spdk_bdev_io_complete() in every code path, including error paths. The malloc module does this — the switch statement in _bdev_malloc_submit_request has a default case that completes with FAILED. The passthru module does this — every submit returns rc, and rc != 0 triggers a complete with FAILED.

What if `submit_request` completes synchronously from inside itself?

This is fine — and common. NVMe modules often complete immediately if the queue is full (returning NOMEM so the framework can retry). Malloc completes synchronously for read with NULL buffer, for reset, for flush, for abort, for zcopy start, etc.

The in_submit_request recursion guard defers the actual caller's callback to a spdk_thread_send_msg, which is what makes this safe.

What if the mempool is exhausted?

The submit returns -ENOMEM. The caller must handle it. For consumers that submit many I/Os in a loop, the standard pattern is to use spdk_bdev_queue_io_wait: register a callback that re-tries the submit, and the framework will fire it when a bdev_io is freed. The passthru module uses exactly this pattern (see module/bdev/passthru/vbdev_passthru.c:200 ).

What if two threads try to complete the same bdev_io?

The framework asserts. Modules are required to call spdk_bdev_io_complete exactly once. The assertion is in lib/bdev/bdev.c:8077 : assert(bdev_io->internal.status == SPDK_BDEV_IO_STATUS_PENDING).

What if the caller's callback submits a new I/O?

Allowed, but it must be done on the same thread. The framework's recursion guard defers the caller's callback if it would otherwise re-enter the submit path. This is what makes polling feasible: an nvmf target's callback can submit a new bdev_io, which can complete in the same reactor iteration, and the reactor never blows its stack.

What if the bdev is removed mid-I/O?

The bdev's internal.status moves to SPDK_BDEV_STATUS_REMOVING, and the framework rejects new submissions (returns -ENODEV or -EAGAIN). In-flight I/Os continue. Their callbacks fire as normal. The desc is closed by the consumer in response to the SPDK_BDEV_EVENT_REMOVE event.

What if the caller's thread changes between submit and complete?

Can't happen. The bdev_io's callback is invoked on the thread that submitted the I/O. The framework asserts this: spdk_get_thread() == spdk_bdev_io_get_thread(bdev_io). If you want to handle completion on a different thread, you have to marshal it with spdk_thread_send_msg in your own callback.

What if the caller's callback takes a long time?

The framework doesn't care, but your reactor will. The callback runs on the same thread that submitted the I/O, which is the same thread that's polling the reactor. If the callback does 100 µs of work, the reactor can't process other I/Os during that time. This is why SPDK applications avoid syscalls and sleeps in completion callbacks.

What to take away

A bdev_io is born in a per-thread cache, lives through one or more device round-trips, dies in a callback, and goes back to the cache. The framework handles the mempool, the queue depth tracking, the latency measurement, the double-completion assertion, the recursion guard, the alignment bounce, the split, and the NOMEM retry. The module handles exactly one thing: the type-specific dispatch in submit_request and the eventual spdk_bdev_io_complete. The contract is small and the framework does the rest.

Next: the bdev hierarchy — what happens when a bdev sits on top of another bdev, and how the I/O flows through the stack.