The four things you need to know by name.
The bdev framework looks like a wall of structs if you open the headers cold. But four of them do almost all the work, and once you can picture them and their relationships, every other bdev page in this curriculum reads in half the time. This is that picture.
- The big idea: noun, open, context, work
spdk_bdev— the thing itselfspdk_bdev_desc— the open handlespdk_bdev_channel— per-thread statespdk_bdev_io— one request- How they relate: the lifecycle in five lines
- The capability system: what a bdev can do
- Why the split exists: lock-free I/O on the hot path
- Edge cases & what trips people up
The big idea: noun, open, context, work
Most storage stacks have one core data type: the I/O request. SPDK intentionally splits that into four so each piece can do one thing well. The four names are:
| Type | What it is | Created by | Created per |
|---|---|---|---|
spdk_bdev | The block device. The "thing." | The module that owns the device | One per device |
spdk_bdev_desc | An open handle on a bdev. | spdk_bdev_open_ext() | One per consumer |
spdk_bdev_channel | Per-thread state for one desc on one bdev. | spdk_bdev_get_io_channel() | One per (desc, thread) pair |
spdk_bdev_io | A single I/O request in flight. | The framework's mempool | One per I/O |
A useful way to remember them: the bdev is the noun ("the disk"), the desc is the verb "to open," the channel is the per-thread context, and the bdev_io is the work being done. Each one has a well-defined lifetime, a well-defined owner, and a specific job.
flowchart TB subgraph "Module (e.g. malloc, nvme, lvol)" BDEV["spdk_bdev
'Malloc0'
blocklen=4096, blockcnt=1024"] end subgraph "Consumer A (your code)" DESC_A["spdk_bdev_desc
write=true
thread=A"] CH_A1["spdk_bdev_channel
thread=A"] CH_A2["spdk_bdev_channel
thread=B"] end subgraph "Consumer B (nvmf target)" DESC_B["spdk_bdev_desc
write=false
thread=B"] end DESC_A -- "holds ref to" --> BDEV DESC_B -- "holds ref to" --> BDEV DESC_A -- "get_io_channel" --> CH_A1 DESC_A -- "get_io_channel from a different thread" --> CH_A2 CH_A1 -- "allocates from" --> IO1["spdk_bdev_io (READ)"] CH_A1 -- "allocates from" --> IO2["spdk_bdev_io (WRITE)"] CH_A2 -- "allocates from" --> IO3["spdk_bdev_io (FLUSH)"] classDef bdev fill:#f5e6c8,stroke:#a17f1a; classDef desc fill:#cfe1ff,stroke:#1c4f8a; classDef channel fill:#d6f5d6,stroke:#2a6f2a; classDef io fill:#f5d6e0,stroke:#8a1c4f; class BDEV bdev class DESC_A,DESC_B desc class CH_A1,CH_A2 channel class IO1,IO2,IO3 io
fig. 1 One spdk_bdev (Malloc0). Two open handles
(spdk_bdev_desc) on it — yours, read-write; the nvmf
target's, read-only. Your desc has two channels because you grabbed
one on thread A and one on thread B. Each channel has its own queue
of in-flight spdk_bdev_io requests.
spdk_bdev — the thing itself
An spdk_bdev is a single block device. It's the
in-memory representation of "a disk" — real or virtual. The struct
itself is defined in
include/spdk/bdev_module.h:420 and is
nearly 400 lines long. Most of those lines are knobs (block size,
alignment hints, max I/O size, DIF configuration, NUMA id, UUID…).
A bdev is created and registered by a module (the
concept of a module is the subject of 4.2).
Here are the fields you'll actually look at:
The annotations on the numbered lines:
ctxt— a back-pointer to whatever the module put behind this bdev. For the malloc module, this isstruct malloc_disk *. For the NVMe module, it'sstruct spdk_nvme_ns *. The framework never dereferences this; it's pure pass-through for the module's convenience.name— the device's unique name. This is whatbdev_get_bdevstakes as an argument and what shows up in JSON-RPC. The malloc module names themMalloc0,Malloc1, … The NVMe module usesnqnstrings.product_name— a type name, not a unique one. Every malloc bdev hasproduct_name = "Malloc disk"; every lvol has"Logical Volume"; every passthru has"passthru". This is the string used bybdev_get_bdevs -p <product_name>to filter.blocklen— logical block size in bytes. Always a power of 2, almost always 512 or 4096. The framework refuses I/O that isn't aligned to this size.phys_blocklen— physical block size. The smallest unit the device can atomically write. May equalblocklen.blockcnt— number of logical blocks. Multiply byblocklento get the total byte capacity.io_type_supported— a bitmap, ORed from theSPDK_BDEV_IO_TYPE_*values. The framework will reject I/O of an unsupported type at submit time. Modules can also expose a more dynamicio_type_supported()function pointer for cases where capability depends on something runtime-checkable.required_alignment— exponent of the alignment requirement.required_alignment = 12means buffers must be 4096-byte aligned. The framework will automatically double-buffer (copy through a bounce buffer) any I/O that violates this. See 4.3 for the bounce path.module— a back-pointer to thespdk_bdev_modulethat registered this bdev. Used for things like callingmodule_init_doneon async modules and looking up module-level config.fn_table— the function table. The "vtable." Points at the struct that containssubmit_request,destruct,io_type_supported,get_io_channel, and so on. This is the seam between the framework and a module. The full thing is in include/spdk/bdev_module.h:299 .
spdk_bdev_desc — the open handle
A spdk_bdev_desc is what you get when you
open a bdev. Think of it as a file descriptor. The
header has only a forward declaration:
Users see only the forward declaration. The full struct is declared in lib/bdev/bdev.c:344 and contains the framework's bookkeeping. The public API only hands out pointers to it.
To open a bdev, you call spdk_bdev_open_ext (or its
async sibling). The relevant public declaration is in
include/spdk/bdev.h:498 :
Three things to notice:
writeis sticky. A read-only desc can never be upgraded to read-write in place. If you need write access, open it withwrite=true.You must pass an event callback. The framework will use it to notify you of hot-removal, resize, and media management events. There is no "I don't care" option. If you don't care, pass a function that just logs.
The desc is bound to a thread. The
spdk_bdev_close()call must happen on the samespdk_threadthat calledspdk_bdev_open_ext(). This is enforced and asserted; the framework uses the desc's thread to schedule unregister notifications.
The desc is the level at which write access is enforced. Two consumers can each open the same bdev — one read-only, one read-write. Both get their own desc; both can do I/O; the framework arbitrates through the desc list on the bdev.
spdk_bdev_channel — per-thread state
A spdk_bdev_channel is the per-thread context for a
desc on a bdev. You get one by calling
spdk_bdev_get_io_channel(desc) from a specific
thread. The full struct lives in
lib/bdev/bdev.c:278 . Here is the
annotated core:
bdev— a back-pointer to the bdev this channel is for. Mostly redundant (you can get to it through the desc) but convenient on the hot path.channel— the module's per-thread channel, obtained from the module'sget_io_channel()callback. This is where NVMe's per-thread submission queue lives, where malloc'saccel_channellives, where AIO's per-thread completion state lives. The framework hides it inside thespdk_bdev_channeland passes the whole thing to yoursubmit_request.accel_channel— a separate channel into the accel framework. The bdev layer needs it for bounce buffers, accel sequences, and DMA-from-memcpy work that the framework itself initiates. Every channel gets one automatically.shared_resource— a sneaky one. Two channels on the same thread, opened against the same module's bdev, share a single set of nomem queues, io_outstanding counter, and a few other fields. This is what makes it cheap to have lots of channels: the heavy state is per-module-per-thread, not per-channel.stat— per-channel I/O statistics (bytes read, bytes written, latencies, min/max). Aggregated into the bdev's stats on channel destroy. Whatbdev_get_iostatreports.io_outstanding— count of in-flight I/O on this channel. Drives the queue depth poller and QoS decisions.io_submitted— a TAILQ of every bdev_io that has been handed to the module and not yet completed. The framework uses this for cleanup if the channel goes away while I/O is in flight (see the "what trips people up" section below).io_locked— I/Os waiting because they target a locked LBA range (acquired viaspdk_bdev_quiesce_range).io_accel_exec— I/Os that have an accel sequence being executed.io_memory_domain— I/Os waiting on a memory domain pull/push.qos_queued_io— I/Os held back because a QoS rate limit has been reached. The QoS poller drains this periodically.
spdk_bdev_io — one request
A spdk_bdev_io is a single I/O operation. Read,
write, flush, unmap, write-zeros, reset, compare,
compare-and-write, copy, NVMe passthrough, zcopy, abort, seek —
they're all bdev_io with different type values.
The struct is defined in
include/spdk/bdev_module.h:1125 and
is a textbook tagged union.
bdev— which device this request targets. Set by the framework beforesubmit_requestis called. The module does not set it.type— the discriminant of the tagged union. One of theSPDK_BDEV_IO_TYPE_*values. The module'ssubmit_requestdispatches on this. The enum is at include/spdk/bdev.h:103 — 22 distinct types as of v26.01.num_retries— how many times this request has been re-submitted. Used by telemetry and by thein_submit_requestrecursion guard. Don't touch it from a module.iov— a single embedded iovec for the common case (one buffer, no scatter-gather). Theu.bdevfield'siovspointer is initialized to point at this when there's only one iovec. Saves a separate allocation for the simple case.child_iov[]— 32-element array used by the framework's split machinery when a request is broken into child I/Os (e.g. crossing anoptimal_io_boundary). Modules do not touch this; the framework manages it during the split.
include/spdk/bdev_module.h:848u— the tagged union.u.bdevfor read/write/flush/unmap/write-zeroes/compare/copy/zcopy/zone/etc.u.resetfor reset.u.abortfor abort.u.nvme_passthrufor raw NVMe commands.u.zone_mgmtfor zoned commands. The block_params struct (the most common one) is atand contains
iovs,iovcnt,offset_blocks,num_blocks,md_buf,dif_*,memory_domain, and more.internal— framework-private. Holdsch(the channel),desc(the desc),cb(the caller callback),status,submit_tsc, split state, bounce buffer state, accel sequence pointers, retry state, and various flag bits. A module must not touch this struct.driver_ctx[]— the module's per-IO scratch space. The size of this region is determined at init time by the module'sget_ctx_size()callback (see 4.2). For malloc it'ssizeof(struct malloc_task). For passthru it'ssizeof(struct passthru_bdev_io). Accessed withspdk_bdev_io_from_ctx().
The auxiliary-data escape hatch
The u union only covers fields the framework
knows about. Sometimes a module needs to attach
module-specific state to a bdev_io. For that, the framework
provides two helpers:
This is a CONTAINEROF trick. driver_ctx
is declared as a zero-length array, so the module's scratch
space lives just past the end of the bdev_io struct. From a
pointer to your scratch you can get back to the bdev_io
header. Both directions are valid.
How they relate: the lifecycle in five lines
Here's the most common path through the four types, end to end. It is what every higher-level consumer (nvmf, vhost, lvol) does underneath.
spdk_bdev_open_ext()spdk_bdev_get_io_channel()spdk_bdev_read() etc.submit_request()spdk_bdev_io_complete()The point: at no point does anyone hold a global lock. The channel is per-thread; the bdev_io is per-operation; the desc is the only thing that crosses thread boundaries (via the event callback).
sequenceDiagram participant App as Application thread participant Fwk as bdev framework participant Mod as bdev module
(e.g. malloc) participant Dev as backing device
(DMA, AIO, NVMe) App->>Fwk: spdk_bdev_open_ext("Malloc0", true, cb, &desc) Fwk-->>App: desc (bound to this thread) App->>Fwk: spdk_bdev_get_io_channel(desc) Fwk->>Fwk: alloc spdk_bdev_channel
(per-thread state) Fwk-->>App: ch (spdk_io_channel *) App->>Fwk: spdk_bdev_read(desc, ch, buf, 0, 4096, cb, arg) Fwk->>Fwk: alloc bdev_io from mempool Fwk->>Fwk: fill u.bdev with iovs, offset_blocks, num_blocks Fwk->>Fwk: increment ch->io_outstanding Fwk->>Mod: fn_table->submit_request(ch, bdev_io) Mod->>Dev: copy from malloc_buf to buf Dev-->>Mod: complete Mod->>Fwk: spdk_bdev_io_complete(bdev_io, SUCCESS) Fwk->>Fwk: decrement ch->io_outstanding Fwk->>Fwk: free bdev_io back to per-thread cache Fwk->>App: cb(bdev_io, success=true, arg)
fig. 2 The minimum I/O path. No locks. The framework and module communicate through function pointers and an in-memory struct. The completion callback runs on the same thread that submitted the I/O, which is what makes SPDK polling work.
The capability system: what a bdev can do
Not every bdev supports every I/O type. malloc supports read, write, flush, unmap, write-zeroes, zcopy, abort, copy, reset. The split module (which exposes a subset of an underlying bdev) supports whatever its base supports. The null bdev supports nothing. The framework needs to know, ahead of time, what a given bdev can do — and it needs to know it in a way that doesn't require poking the module on the hot path.
That's why spdk_bdev has a bitmap field,
io_type_supported. The module sets it once at
register time:
The framework will OR all the "true" types into the
io_type_supported bitmap. Public callers can
query it without going through the module:
Some capabilities aren't static. The passthru module delegates to its base bdev:
The passthru bdev's io_type_supported bitmap is
always set to "everything" at register time, but the
function pointer defers the actual decision. This is the
way to expose dynamic capabilities.
Why the split exists: lock-free I/O on the hot path
It's worth dwelling on this. The four-way split is not accidental. Each piece exists because the alternative would either need a global lock or would require the framework to pre-allocate things it can't pre-allocate.
Compare to the obvious single-struct design: one
spdk_io_request with a bdev pointer, a thread
pointer, and the I/O data. That would require the framework
to look up a thread-specific state on every I/O — and
"look up" means a hash table or a global lock or both.
Either kills performance.
Instead:
The bdev is immutable after register. It lives in a global list. The pointer to it is your starting point; it never changes.
The desc is the only thing that crosses threads. One desc per consumer, one set of "this consumer's callbacks." The event callback is invoked on the desc's thread, which is the only place any cross-thread notification is needed.
The channel is per-thread. After
spdk_bdev_get_io_channel(), every subsequent I/O on that (desc, thread) pair touches only thread-local data. No locks.The bdev_io is per-I/O. Allocated from a thread-cached mempool (
spdk_mempoolwith a per-thread STAILQ cache). Allocation and free are lock-free on the hot path. See 4.3 for the full lifecycle.
The performance difference between this design and a "reasonable" single-struct design is roughly an order of magnitude at high queue depth. SPDK pushes 10+ million IOPS on a single NVMe device; the design that makes that possible is the one described above.
Edge cases & what trips people up
The data structure story looks clean from 10,000 feet. At ground level there are several places where a normal-looking line of code can leak a desc, lose an I/O, or trigger a double-free. This section collects the ones we've actually hit.
Closing a desc while I/O is in flight
You submit a write on thread A, then you call
spdk_bdev_close(desc) from thread B. The
framework asserts; the process aborts. The close must
happen on the same thread that opened the desc. But what
about I/O submitted but not yet completed?
The framework handles it: spdk_bdev_close
waits for all in-flight I/O on the desc to complete before
freeing the desc. It does this by iterating the
io_submitted list of every channel that was
opened through this desc. The wait is done with
spdk_thread_send_msg and a counter; you don't
write any of this yourself. Just don't close from a
different thread.
Destroying a bdev while a channel is open
The reverse direction is harder. A bdev can be hot-removed
while consumers still have open descs and channels. The
framework will not force-close the desc; instead, it
fires the desc's event_cb with
SPDK_BDEV_EVENT_REMOVE on the desc's thread.
The consumer is responsible for:
- Stop submitting new I/O.
- Wait for in-flight I/O to complete.
- Call
spdk_bdev_close(desc).
Only after the last desc on the bdev is closed does the
unregister actually take effect. This is why
spdk_bdev_unregister's callback fires
asynchronously — the framework returns
immediately and the callback runs when every consumer has
released their desc. In the meantime, the bdev is in
SPDK_BDEV_STATUS_REMOVING state and rejects
new spdk_bdev_open_ext calls.
Concurrent spdk_for_each_bdev calls
The bdev module list is protected by a spinlock; the
spdk_for_each_bdev walker takes it briefly to
find the bdev, then drops it before invoking the user
callback. The user callback is therefore called without
holding the bdev-manager spinlock — so it is safe to do
almost anything, including calling spdk_bdev_open_ext
from inside the callback. (If you do call
spdk_bdev_unregister from inside, you may
cause the iteration to skip a bdev or visit one that is
mid-removal; the framework documents this and asks you
not to do it.)
Channel vs desc: which one owns what?
A common mistake is to think the channel "owns" the desc. It doesn't. A desc can have many channels (one per thread). A channel is a per-(desc, thread) object; closing the desc destroys all its channels. The cleanest mental model: desc is for ownership, channel is for I/O.
required_alignment and double-buffering
If the caller's buffer is misaligned, the framework
automatically routes the I/O through a bounce buffer
in the accel framework. The user-visible API is unchanged
— you call spdk_bdev_read(desc, ch, buf, ...)
with buf misaligned, the framework detects
it, copies the data through DMA-aligned memory, and
reports success. The cost is one extra memcopy per
misaligned I/O. If you care, allocate aligned buffers.
See 4.3 for the
bounce-buffer path.
Why you can't pass a bdev across thread boundaries
A spdk_bdev * pointer is fine to pass around
— it's effectively immutable. A spdk_bdev_desc *
is not: it's bound to a thread. A
spdk_io_channel * is bound even more tightly:
it can only be used from the thread that created it.
Sharing any of the latter two across threads with no
spdk_thread_send_msg marshalling is a bug
waiting to assert.
What to take away
The bdev framework is built around four named data types.
spdk_bdev is the device. spdk_bdev_desc
is your open handle on it. spdk_bdev_channel
is your per-thread context for the desc. spdk_bdev_io
is one request in flight. Each exists to do one thing well,
and the design is what makes SPDK's lock-free hot path
possible.
With these four types in your head, the next page — the bdev module interface — is just "the vtable that defines a bdev type." The page after that — the bdev_io lifecycle — is just "what happens to one of those bdev_ios between submit and complete." And the hierarchy page is just "what happens when a bdev sits on top of another bdev."