Layer 4 · The bdev framework

The four things you need to know by name.

The bdev framework looks like a wall of structs if you open the headers cold. But four of them do almost all the work, and once you can picture them and their relationships, every other bdev page in this curriculum reads in half the time. This is that picture.

~15 min read2 diagramsprerequisites: 2.3 · 2.2
On this page
  1. The big idea: noun, open, context, work
  2. spdk_bdev — the thing itself
  3. spdk_bdev_desc — the open handle
  4. spdk_bdev_channel — per-thread state
  5. spdk_bdev_io — one request
  6. How they relate: the lifecycle in five lines
  7. The capability system: what a bdev can do
  8. Why the split exists: lock-free I/O on the hot path
  9. Edge cases & what trips people up

The big idea: noun, open, context, work

Most storage stacks have one core data type: the I/O request. SPDK intentionally splits that into four so each piece can do one thing well. The four names are:

TypeWhat it isCreated byCreated per
spdk_bdevThe block device. The "thing."The module that owns the deviceOne per device
spdk_bdev_descAn open handle on a bdev.spdk_bdev_open_ext()One per consumer
spdk_bdev_channelPer-thread state for one desc on one bdev.spdk_bdev_get_io_channel()One per (desc, thread) pair
spdk_bdev_ioA single I/O request in flight.The framework's mempoolOne per I/O

A useful way to remember them: the bdev is the noun ("the disk"), the desc is the verb "to open," the channel is the per-thread context, and the bdev_io is the work being done. Each one has a well-defined lifetime, a well-defined owner, and a specific job.

flowchart TB
subgraph "Module (e.g. malloc, nvme, lvol)"
  BDEV["spdk_bdev
'Malloc0'
blocklen=4096, blockcnt=1024"] end subgraph "Consumer A (your code)" DESC_A["spdk_bdev_desc
write=true
thread=A"] CH_A1["spdk_bdev_channel
thread=A"] CH_A2["spdk_bdev_channel
thread=B"] end subgraph "Consumer B (nvmf target)" DESC_B["spdk_bdev_desc
write=false
thread=B"] end DESC_A -- "holds ref to" --> BDEV DESC_B -- "holds ref to" --> BDEV DESC_A -- "get_io_channel" --> CH_A1 DESC_A -- "get_io_channel from a different thread" --> CH_A2 CH_A1 -- "allocates from" --> IO1["spdk_bdev_io (READ)"] CH_A1 -- "allocates from" --> IO2["spdk_bdev_io (WRITE)"] CH_A2 -- "allocates from" --> IO3["spdk_bdev_io (FLUSH)"] classDef bdev fill:#f5e6c8,stroke:#a17f1a; classDef desc fill:#cfe1ff,stroke:#1c4f8a; classDef channel fill:#d6f5d6,stroke:#2a6f2a; classDef io fill:#f5d6e0,stroke:#8a1c4f; class BDEV bdev class DESC_A,DESC_B desc class CH_A1,CH_A2 channel class IO1,IO2,IO3 io
fig. 1 — the four core types, in their natural relationships · tap or scroll to zoom · ↗ for fullscreen

fig. 1   One spdk_bdev (Malloc0). Two open handles (spdk_bdev_desc) on it — yours, read-write; the nvmf target's, read-only. Your desc has two channels because you grabbed one on thread A and one on thread B. Each channel has its own queue of in-flight spdk_bdev_io requests.

spdk_bdev — the thing itself

An spdk_bdev is a single block device. It's the in-memory representation of "a disk" — real or virtual. The struct itself is defined in include/spdk/bdev_module.h:420 and is nearly 400 lines long. Most of those lines are knobs (block size, alignment hints, max I/O size, DIF configuration, NUMA id, UUID…). A bdev is created and registered by a module (the concept of a module is the subject of 4.2).

Here are the fields you'll actually look at:

The annotations on the numbered lines:

  1. ctxt — a back-pointer to whatever the module put behind this bdev. For the malloc module, this is struct malloc_disk *. For the NVMe module, it's struct spdk_nvme_ns *. The framework never dereferences this; it's pure pass-through for the module's convenience.

  2. name — the device's unique name. This is what bdev_get_bdevs takes as an argument and what shows up in JSON-RPC. The malloc module names them Malloc0, Malloc1, … The NVMe module uses nqn strings.

  3. product_name — a type name, not a unique one. Every malloc bdev has product_name = "Malloc disk"; every lvol has "Logical Volume"; every passthru has "passthru". This is the string used by bdev_get_bdevs -p <product_name> to filter.

  4. blocklen — logical block size in bytes. Always a power of 2, almost always 512 or 4096. The framework refuses I/O that isn't aligned to this size.

  5. phys_blocklen — physical block size. The smallest unit the device can atomically write. May equal blocklen.

  6. blockcnt — number of logical blocks. Multiply by blocklen to get the total byte capacity.

  7. io_type_supported — a bitmap, ORed from the SPDK_BDEV_IO_TYPE_* values. The framework will reject I/O of an unsupported type at submit time. Modules can also expose a more dynamic io_type_supported() function pointer for cases where capability depends on something runtime-checkable.

  8. required_alignment — exponent of the alignment requirement. required_alignment = 12 means buffers must be 4096-byte aligned. The framework will automatically double-buffer (copy through a bounce buffer) any I/O that violates this. See 4.3 for the bounce path.

  9. module — a back-pointer to the spdk_bdev_module that registered this bdev. Used for things like calling module_init_done on async modules and looking up module-level config.

  10. fn_table — the function table. The "vtable." Points at the struct that contains submit_request, destruct, io_type_supported, get_io_channel, and so on. This is the seam between the framework and a module. The full thing is in include/spdk/bdev_module.h:299 .

spdk_bdev_desc — the open handle

A spdk_bdev_desc is what you get when you open a bdev. Think of it as a file descriptor. The header has only a forward declaration:

Users see only the forward declaration. The full struct is declared in lib/bdev/bdev.c:344 and contains the framework's bookkeeping. The public API only hands out pointers to it.

To open a bdev, you call spdk_bdev_open_ext (or its async sibling). The relevant public declaration is in include/spdk/bdev.h:498 :

Three things to notice:

  1. write is sticky. A read-only desc can never be upgraded to read-write in place. If you need write access, open it with write=true.

  2. You must pass an event callback. The framework will use it to notify you of hot-removal, resize, and media management events. There is no "I don't care" option. If you don't care, pass a function that just logs.

  3. The desc is bound to a thread. The spdk_bdev_close() call must happen on the same spdk_thread that called spdk_bdev_open_ext(). This is enforced and asserted; the framework uses the desc's thread to schedule unregister notifications.

The desc is the level at which write access is enforced. Two consumers can each open the same bdev — one read-only, one read-write. Both get their own desc; both can do I/O; the framework arbitrates through the desc list on the bdev.

spdk_bdev_channel — per-thread state

A spdk_bdev_channel is the per-thread context for a desc on a bdev. You get one by calling spdk_bdev_get_io_channel(desc) from a specific thread. The full struct lives in lib/bdev/bdev.c:278 . Here is the annotated core:

  1. bdev — a back-pointer to the bdev this channel is for. Mostly redundant (you can get to it through the desc) but convenient on the hot path.

  2. channel — the module's per-thread channel, obtained from the module's get_io_channel() callback. This is where NVMe's per-thread submission queue lives, where malloc's accel_channel lives, where AIO's per-thread completion state lives. The framework hides it inside the spdk_bdev_channel and passes the whole thing to your submit_request.

  3. accel_channel — a separate channel into the accel framework. The bdev layer needs it for bounce buffers, accel sequences, and DMA-from-memcpy work that the framework itself initiates. Every channel gets one automatically.

  4. shared_resource — a sneaky one. Two channels on the same thread, opened against the same module's bdev, share a single set of nomem queues, io_outstanding counter, and a few other fields. This is what makes it cheap to have lots of channels: the heavy state is per-module-per-thread, not per-channel.

  5. stat — per-channel I/O statistics (bytes read, bytes written, latencies, min/max). Aggregated into the bdev's stats on channel destroy. What bdev_get_iostat reports.

  6. io_outstanding — count of in-flight I/O on this channel. Drives the queue depth poller and QoS decisions.

  7. io_submitted — a TAILQ of every bdev_io that has been handed to the module and not yet completed. The framework uses this for cleanup if the channel goes away while I/O is in flight (see the "what trips people up" section below).

  8. io_locked — I/Os waiting because they target a locked LBA range (acquired via spdk_bdev_quiesce_range).

  9. io_accel_exec — I/Os that have an accel sequence being executed.

  10. io_memory_domain — I/Os waiting on a memory domain pull/push.

  11. qos_queued_io — I/Os held back because a QoS rate limit has been reached. The QoS poller drains this periodically.

spdk_bdev_io — one request

A spdk_bdev_io is a single I/O operation. Read, write, flush, unmap, write-zeros, reset, compare, compare-and-write, copy, NVMe passthrough, zcopy, abort, seek — they're all bdev_io with different type values. The struct is defined in include/spdk/bdev_module.h:1125 and is a textbook tagged union.

  1. bdev — which device this request targets. Set by the framework before submit_request is called. The module does not set it.

  2. type — the discriminant of the tagged union. One of the SPDK_BDEV_IO_TYPE_* values. The module's submit_request dispatches on this. The enum is at include/spdk/bdev.h:103 — 22 distinct types as of v26.01.

  3. num_retries — how many times this request has been re-submitted. Used by telemetry and by the in_submit_request recursion guard. Don't touch it from a module.

  4. iov — a single embedded iovec for the common case (one buffer, no scatter-gather). The u.bdev field's iovs pointer is initialized to point at this when there's only one iovec. Saves a separate allocation for the simple case.

  5. child_iov[] — 32-element array used by the framework's split machinery when a request is broken into child I/Os (e.g. crossing an optimal_io_boundary). Modules do not touch this; the framework manages it during the split.

  6. u — the tagged union. u.bdev for read/write/flush/unmap/write-zeroes/compare/copy/zcopy/zone/etc. u.reset for reset. u.abort for abort. u.nvme_passthru for raw NVMe commands. u.zone_mgmt for zoned commands. The block_params struct (the most common one) is at

    include/spdk/bdev_module.h:848

    and contains iovs, iovcnt, offset_blocks, num_blocks, md_buf, dif_*, memory_domain, and more.

  7. internal — framework-private. Holds ch (the channel), desc (the desc), cb (the caller callback), status, submit_tsc, split state, bounce buffer state, accel sequence pointers, retry state, and various flag bits. A module must not touch this struct.

  8. driver_ctx[] — the module's per-IO scratch space. The size of this region is determined at init time by the module's get_ctx_size() callback (see 4.2). For malloc it's sizeof(struct malloc_task). For passthru it's sizeof(struct passthru_bdev_io). Accessed with spdk_bdev_io_from_ctx().

The auxiliary-data escape hatch

The u union only covers fields the framework knows about. Sometimes a module needs to attach module-specific state to a bdev_io. For that, the framework provides two helpers:

This is a CONTAINEROF trick. driver_ctx is declared as a zero-length array, so the module's scratch space lives just past the end of the bdev_io struct. From a pointer to your scratch you can get back to the bdev_io header. Both directions are valid.

How they relate: the lifecycle in five lines

Here's the most common path through the four types, end to end. It is what every higher-level consumer (nvmf, vhost, lvol) does underneath.

STEP 01
spdk_bdev_open_ext()
Get a desc. Bound to your thread.
STEP 02
spdk_bdev_get_io_channel()
Get a per-thread channel. Cached internally.
STEP 03
spdk_bdev_read() etc.
Framework grabs a bdev_io from the mempool.
STEP 04
submit_request()
Module sees a bdev_io on its channel.
STEP 05
spdk_bdev_io_complete()
Framework frees the bdev_io back to the mempool.

The point: at no point does anyone hold a global lock. The channel is per-thread; the bdev_io is per-operation; the desc is the only thing that crosses thread boundaries (via the event callback).

sequenceDiagram
participant App as Application thread
participant Fwk as bdev framework
participant Mod as bdev module
(e.g. malloc) participant Dev as backing device
(DMA, AIO, NVMe) App->>Fwk: spdk_bdev_open_ext("Malloc0", true, cb, &desc) Fwk-->>App: desc (bound to this thread) App->>Fwk: spdk_bdev_get_io_channel(desc) Fwk->>Fwk: alloc spdk_bdev_channel
(per-thread state) Fwk-->>App: ch (spdk_io_channel *) App->>Fwk: spdk_bdev_read(desc, ch, buf, 0, 4096, cb, arg) Fwk->>Fwk: alloc bdev_io from mempool Fwk->>Fwk: fill u.bdev with iovs, offset_blocks, num_blocks Fwk->>Fwk: increment ch->io_outstanding Fwk->>Mod: fn_table->submit_request(ch, bdev_io) Mod->>Dev: copy from malloc_buf to buf Dev-->>Mod: complete Mod->>Fwk: spdk_bdev_io_complete(bdev_io, SUCCESS) Fwk->>Fwk: decrement ch->io_outstanding Fwk->>Fwk: free bdev_io back to per-thread cache Fwk->>App: cb(bdev_io, success=true, arg)
fig. 2 — the I/O path through the four types · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The minimum I/O path. No locks. The framework and module communicate through function pointers and an in-memory struct. The completion callback runs on the same thread that submitted the I/O, which is what makes SPDK polling work.

The capability system: what a bdev can do

Not every bdev supports every I/O type. malloc supports read, write, flush, unmap, write-zeroes, zcopy, abort, copy, reset. The split module (which exposes a subset of an underlying bdev) supports whatever its base supports. The null bdev supports nothing. The framework needs to know, ahead of time, what a given bdev can do — and it needs to know it in a way that doesn't require poking the module on the hot path.

That's why spdk_bdev has a bitmap field, io_type_supported. The module sets it once at register time:

The framework will OR all the "true" types into the io_type_supported bitmap. Public callers can query it without going through the module:

Some capabilities aren't static. The passthru module delegates to its base bdev:

The passthru bdev's io_type_supported bitmap is always set to "everything" at register time, but the function pointer defers the actual decision. This is the way to expose dynamic capabilities.

Why the split exists: lock-free I/O on the hot path

It's worth dwelling on this. The four-way split is not accidental. Each piece exists because the alternative would either need a global lock or would require the framework to pre-allocate things it can't pre-allocate.

Compare to the obvious single-struct design: one spdk_io_request with a bdev pointer, a thread pointer, and the I/O data. That would require the framework to look up a thread-specific state on every I/O — and "look up" means a hash table or a global lock or both. Either kills performance.

Instead:

  • The bdev is immutable after register. It lives in a global list. The pointer to it is your starting point; it never changes.

  • The desc is the only thing that crosses threads. One desc per consumer, one set of "this consumer's callbacks." The event callback is invoked on the desc's thread, which is the only place any cross-thread notification is needed.

  • The channel is per-thread. After spdk_bdev_get_io_channel(), every subsequent I/O on that (desc, thread) pair touches only thread-local data. No locks.

  • The bdev_io is per-I/O. Allocated from a thread-cached mempool (spdk_mempool with a per-thread STAILQ cache). Allocation and free are lock-free on the hot path. See 4.3 for the full lifecycle.

The performance difference between this design and a "reasonable" single-struct design is roughly an order of magnitude at high queue depth. SPDK pushes 10+ million IOPS on a single NVMe device; the design that makes that possible is the one described above.

Edge cases & what trips people up

The data structure story looks clean from 10,000 feet. At ground level there are several places where a normal-looking line of code can leak a desc, lose an I/O, or trigger a double-free. This section collects the ones we've actually hit.

Closing a desc while I/O is in flight

You submit a write on thread A, then you call spdk_bdev_close(desc) from thread B. The framework asserts; the process aborts. The close must happen on the same thread that opened the desc. But what about I/O submitted but not yet completed?

The framework handles it: spdk_bdev_close waits for all in-flight I/O on the desc to complete before freeing the desc. It does this by iterating the io_submitted list of every channel that was opened through this desc. The wait is done with spdk_thread_send_msg and a counter; you don't write any of this yourself. Just don't close from a different thread.

Destroying a bdev while a channel is open

The reverse direction is harder. A bdev can be hot-removed while consumers still have open descs and channels. The framework will not force-close the desc; instead, it fires the desc's event_cb with SPDK_BDEV_EVENT_REMOVE on the desc's thread. The consumer is responsible for:

  1. Stop submitting new I/O.
  2. Wait for in-flight I/O to complete.
  3. Call spdk_bdev_close(desc).

Only after the last desc on the bdev is closed does the unregister actually take effect. This is why spdk_bdev_unregister's callback fires asynchronously — the framework returns immediately and the callback runs when every consumer has released their desc. In the meantime, the bdev is in SPDK_BDEV_STATUS_REMOVING state and rejects new spdk_bdev_open_ext calls.

Concurrent spdk_for_each_bdev calls

The bdev module list is protected by a spinlock; the spdk_for_each_bdev walker takes it briefly to find the bdev, then drops it before invoking the user callback. The user callback is therefore called without holding the bdev-manager spinlock — so it is safe to do almost anything, including calling spdk_bdev_open_ext from inside the callback. (If you do call spdk_bdev_unregister from inside, you may cause the iteration to skip a bdev or visit one that is mid-removal; the framework documents this and asks you not to do it.)

Channel vs desc: which one owns what?

A common mistake is to think the channel "owns" the desc. It doesn't. A desc can have many channels (one per thread). A channel is a per-(desc, thread) object; closing the desc destroys all its channels. The cleanest mental model: desc is for ownership, channel is for I/O.

required_alignment and double-buffering

If the caller's buffer is misaligned, the framework automatically routes the I/O through a bounce buffer in the accel framework. The user-visible API is unchanged — you call spdk_bdev_read(desc, ch, buf, ...) with buf misaligned, the framework detects it, copies the data through DMA-aligned memory, and reports success. The cost is one extra memcopy per misaligned I/O. If you care, allocate aligned buffers. See 4.3 for the bounce-buffer path.

Why you can't pass a bdev across thread boundaries

A spdk_bdev * pointer is fine to pass around — it's effectively immutable. A spdk_bdev_desc * is not: it's bound to a thread. A spdk_io_channel * is bound even more tightly: it can only be used from the thread that created it. Sharing any of the latter two across threads with no spdk_thread_send_msg marshalling is a bug waiting to assert.

What to take away

The bdev framework is built around four named data types. spdk_bdev is the device. spdk_bdev_desc is your open handle on it. spdk_bdev_channel is your per-thread context for the desc. spdk_bdev_io is one request in flight. Each exists to do one thing well, and the design is what makes SPDK's lock-free hot path possible.

With these four types in your head, the next page — the bdev module interface — is just "the vtable that defines a bdev type." The page after that — the bdev_io lifecycle — is just "what happens to one of those bdev_ios between submit and complete." And the hierarchy page is just "what happens when a bdev sits on top of another bdev."