What makes a bdev a bdev.
A "bdev" is whatever the framework says it is. A bdev module
is a struct of function pointers plus some lifecycle
metadata. Once the framework has a struct that matches the
shape it expects, the module is a first-class citizen: it
gets I/O submitted to it, it gets hot-remove events, it
shows up in bdev_get_bdevs, and it can be a base
for a virtual bdev on top. This page is the contract.
- Two structs:
spdk_bdev_moduleandspdk_bdev_fn_table - The 12 (give or take) things every module must implement
- Walkthrough:
bdev_malloc.c— the simplest module - Walkthrough:
vbdev_passthru.c— a more realistic module - Registration: how a module declares itself
- The instance vs the type:
spdk_bdevvsspdk_bdev_module - Edge cases: missing implementations, shared state, expected behavior
Two structs: module and fn_table
A bdev type is defined by two structs, not one. They capture the two levels of what a module is:
| Struct | Represents | How many per process |
|---|---|---|
spdk_bdev_module | The type ("malloc", "nvme", "lvol", "passthru", …) | One per loaded module |
spdk_bdev_fn_table | The function pointers every bdev of this type responds to | One per bdev module |
The split is deliberate. The spdk_bdev_module
carries lifecycle (init, fini, examine, config serialization)
— things that happen once per module per process.
The spdk_bdev_fn_table carries the
per-instance hot-path operations (submit_request, destruct,
get_io_channel) — things that happen per-bdev and per-I/O.
An spdk_bdev points at both: module
for lifecycle, fn_table for I/O.
module_init— called by the framework duringspdk_bdev_initialize. Do per-module setup here. Return 0 on success, <0 on error. Ifasync_initis true, return 0 and callspdk_bdev_module_init_done()later.init_complete— optional. Called when the bdev subsystem has finished initialization. By this point all other modules have been initialized and all configuration has been applied. Good place to auto-create virtual bdevs.fini_start— optional. Called before the framework starts unregistering bdevs. The module should release any claims on bdevs it doesn't have a virtual bdev on top of.module_fini— called after all bdevs of all modules are unregistered. Final cleanup.config_json— optional. Write module-level config (or per-bdev creation RPCs) to the JSON stream. Either this orfn_table->write_config_jsoncan be used for per-bdev config, not both.name— the module's name. Used in JSON-RPC (bdev_get_bdevs -p malloc), in the bdev list, in error messages. Must be unique within the process.get_ctx_size— return the number of bytes the framework should reserve for the module's per-IO scratch space. Malloc returnssizeof(struct malloc_task). Passthru returnssizeof(struct passthru_bdev_io). NVMe is larger because it carries queue ids, etc.examine_config— virtual bdevs only. Called for every bdev that gets registered. The module decides synchronously whether to claim the bdev and create a virtual bdev on top.examine_disk— virtual bdevs only. Called afterexamine_configfor bdevs that were claimed. May do I/O and finish asynchronously. Callspdk_bdev_module_examine_done()when done.async_init/async_fini/async_fini_start— flags. If true, the module's init/fini functions can return 0 and complete asynchronously via the corresponding_done()API.
Now the function table. This is the per-bdev hot path.
destruct— required. Called when the bdev is being unregistered. Free the module's per-bdev context (ctx). Return 0 on synchronous completion, or 1 and callspdk_bdev_destruct_done()later for async.submit_request— required. The hot path. Module dispatches the bdev_io to its backing device (NVMe queue, AIO, malloc memcpy, …). When done, callspdk_bdev_io_complete(). See 4.3 for the full lifecycle.io_type_supported— required. Capability check. Returns true if this bdev supports a given I/O type. The framework calls this both at submit time and when populating theio_type_supportedbitmap at register time.get_io_channel— required. Return a per-threadspdk_io_channelfor this bdev. Almost always implemented asspdk_get_io_channel(ctx)for an io_device registered inmodule_init.dump_info_json— optional. Add module-specific fields to thebdev_get_bdevsoutput.write_config_json— optional. Output the RPC to recreate this specific bdev. Mutually exclusive with the module-levelconfig_jsonfor per-bdev output.get_spin_time— optional. Microseconds the thread spent spinning waiting for I/O last sample. Used by vtune integration and disk utilization.get_module_ctx— optional. Return a module-specific context pointer for this bdev. Set at register time; rarely changed.get_memory_domains— optional. Report which memory domains (DMA-capable memory regions) this bdev can work with. Used for zero-copy I/O across memory domain boundaries.reset_device_stat— optional. Reset the module's per-bdev I/O statistics.dump_device_stat_json— optional. Output per-bdev I/O statistics to a JSON stream.accel_sequence_supported— optional. Return true if the bdev can handle anspdk_accel_sequencein front of (or after) a given I/O type. Used for compressed reads, encryption, etc.
The (roughly) twelve things every module must implement
The style guide for this curriculum (and the actual SPDK docs) talks about "the 7 things every module must implement" — that count is the count of required function pointers. The full picture has more, but the must-implement set is:
| # | Function pointer | Where | Required? |
|---|---|---|---|
| 1 | submit_request | fn_table | Yes — the hot path |
| 2 | destruct | fn_table | Yes — cleanup |
| 3 | io_type_supported | fn_table | Yes — capability |
| 4 | get_io_channel | fn_table | Yes — per-thread state |
| 5 | module_init | spdk_bdev_module | Yes — startup |
| 6 | get_ctx_size | spdk_bdev_module | Yes — driver context size |
| 7 | write_config_json or config_json | fn_table or module | Strongly recommended (config save/load) |
Plus the optional ones (dump_info_json,
get_spin_time, get_memory_domains,
accel_sequence_supported, etc.) and the
virtual-bdev-specific ones (examine_config,
examine_disk, fini_start). The
"12" number comes from adding in init_complete,
module_fini, get_module_ctx,
reset_device_stat, and dump_device_stat_json.
Walkthrough: bdev_malloc.c — the simplest module
The malloc module is what every SPDK tutorial starts from, because it's the only module where the backing storage is ordinary memory. There's no device, no DMA, no interrupt. You submit, the framework copies bytes. Done.
Here's the entire spdk_bdev_module definition:
Four fields. That's it. No async_init, no examine_config,
no config_json. The malloc module doesn't auto-create bdevs
(you call bdev_malloc_create over RPC), it
doesn't claim any base bdevs, and it doesn't need async
init. The two struct fields you might miss: init_complete
and fini_start are NULL, and that's allowed —
they're optional.
Now the function table:
Seven of the twelve possible function pointers filled in.
The missing five (dump_info_json,
get_spin_time, get_module_ctx,
reset_device_stat, dump_device_stat_json)
are all optional. None of them would do anything useful for
malloc anyway — there's no "spin time" for a memcpy, no
"reset stats" beyond the framework's built-in.
The module init: register an io_device
The interesting thing in module_init is the
spdk_io_device_register call. This is what
makes get_io_channel work:
The io_device is the tailq g_malloc_disks (see
line 253: static TAILQ_HEAD(, malloc_disk) g_malloc_disks).
The framework will create one channel per thread per call
to spdk_get_io_channel(&g_malloc_disks), each
channel will be sizeof(struct malloc_channel)
bytes (set in the fourth argument), and the framework will
invoke malloc_create_channel_cb on channel
create and malloc_destroy_channel_cb on
channel destroy.
get_io_channel: trivial
And the channel getter, for comparison:
Three lines, including the signature. The
spdk_get_io_channel() call looks up (or
creates) a per-thread channel for the io_device and
returns it. The framework wraps it in a
spdk_bdev_channel and that's what your
submit_request receives as its first
argument.
submit_request: the dispatch table
The submit path is a switch on bdev_io type. For each type, the module either completes immediately (reset, flush, abort) or kicks off a real operation (read, write, unmap, write-zeros, copy, zcopy).
Note the call to spdk_io_channel_get_ctx(ch).
This is how you go from the framework's wrapped channel
(spdk_io_channel *) to your module's
per-thread state (struct malloc_channel *).
The framework gives you a generic channel pointer; the
io_device machinery gives you the bytes you asked for at
register time.
And the dispatch:
Every case ends by either calling
spdk_bdev_io_complete() (or a helper that
does so) or scheduling async work that will eventually
call it. The default case returns -1, which causes the
caller to complete the I/O with FAILED. That's the
framework's contract: if submit_request
returns without completing the I/O and without queuing
async work, the framework will fail it.
Walkthrough: vbdev_passthru.c — a more realistic module
Malloc is a leaf module — it owns a chunk of memory and that memory is the bdev. Passthru is a virtual module — it sits on top of another bdev and forwards every operation to it. This is the "vbdev" pattern and it underlies lvol, raid, split, gpt, and most of the composition you'll do.
The struct itself: smaller
The passthru module struct is similar but not identical:
Three differences from malloc:
examine_configis set. This is what makes passthru a "virtual" module. The framework calls it for every bdev that gets registered; passthru decides whether to attach itself on top of that bdev.config_jsonis set, notwrite_config_jsonin the fn_table. The passthru module's bdevs are created by the RPC handler, and the module-level config_json walks the global list ofvbdev_passthrunodes to emit onebdev_passthru_createRPC per bdev.No
init_complete. Passthru can complete its setup during the examine path, so it doesn't need post-init notification.
submit_request: forward to the base
The passthru submit path is a 1:1 forwarding of every I/O type. It uses the public bdev API to submit a new bdev_io on the base bdev, with a private completion callback that translates the result back to the original I/O:
Three things to notice:
The base bdev is reached via
pt_node->base_descandpt_ch->base_ch. The passthru opened the base invbdev_passthru_register()and saved the desc; the per-thread base channel was grabbed inpt_bdev_ch_create_cb()at channel create time.The completion callback is
_pt_complete_io. It does the inverse translation: completes the originalbdev_iowith the status from the new one, then frees the new bdev_io withspdk_bdev_free_io. This is the "submit a child bdev_io and bridge it to the parent" pattern — the bdev_io you submit and the bdev_io you receive are different objects, even if their data represents the same operation.The
-ENOMEMcase: if the framework can't allocate a child bdev_io for the base, passthru queues the parent's bdev_io for later retry. This isvbdev_passthru_queue_io, which usesspdk_bdev_queue_io_waitto wake the I/O when an io frees up. Modules that allocate from the bdev_io mempool need to handle this.
destruct: the opposite of register
The destruct path mirrors the register path. For passthru:
The comment is gold: "It is important to follow this exact sequence of steps for destroying a vbdev…" That's not hyperbole. Get the order wrong and you can double-free, leak a desc, or deadlock on a thread message.
Remove from your module's list first. This way no other code path in your module can find this node and try to use it.
Release the claim on the base bdev. Otherwise the base bdev can't be unregistered (it thinks you still own it).
Close the base desc on its opening thread.
spdk_bdev_closeis thread-bound; if you're on a different thread, send a message.Unregister the io_device. After this, the framework will drain any channels that are still open and call your destroy callback.
Registration: how a module declares itself
Once your spdk_bdev_module struct is
defined, you make the framework aware of it with one
line:
This is a attribute((constructor))
function. The compiler emits it into the module's
.ctors section, the dynamic linker runs
it before main(), and it calls
spdk_bdev_module_list_add(module) to add
the module to the framework's global list. By the time
spdk_bdev_initialize is called from
main(), the list is fully populated.
For malloc:
For passthru:
Same macro, different struct. Both modules are now part
of the framework. The framework walks the list during
spdk_bdev_initialize and calls
module_init on each.
flowchart LR A[Module .c file] -->|constructor| B[spdk_bdev_module_list_add] B --> C[Global module list] C --> D[spdk_bdev_initialize] D --> E[module_init on each] E -->|registers io_device| F[spdk_io_device_register] F --> G[Module ready to receive I/O]
fig. 1 The module gets added to the global list at
link time (via a constructor). spdk_bdev_initialize
walks the list and calls module_init on each.
After that, the module can register bdevs, which is the
next step.
The instance vs the type: spdk_bdev vs spdk_bdev_module
A common confusion: how is the malloc module different from a single malloc bdev? The answer is the same as "class vs object":
spdk_bdev_module malloc_ifis the class. There is exactly one of it. It has themodule_initandmodule_finifunctions. It's about lifecycle.spdk_bdev pt_bdevis an object. There can be many of them. Each has its ownname,blockcnt,ctxtback-pointer, andfn_table. Thefn_tablepointers (likesubmit_request) are the same for all bdevs of the same module — but they receivectxas their first argument, and that's how the module knows which instance to operate on.
Concretely: passthru defines one spdk_bdev_module
passthru_if and one spdk_bdev_fn_table
vbdev_passthru_fn_table. But every
spdk_bdev pt_bdev has its own
struct vbdev_passthru *pt_node as its
ctxt, and that node is what the
submit_request callback uses to find its
base bdev.
The framework is what connects the two. When a bdev is
registered, the framework records the
spdk_bdev_module *module field. When an I/O
is submitted, the framework reads
spdk_bdev->fn_table->submit_request
and calls it. When an event fires, the framework
consults spdk_bdev->module for any
module-wide behavior (like the examine chain).
Edge cases: missing implementations, shared state, expected behavior
What if you forget io_type_supported?
You can't — the framework's I/O submit path will call it regardless of whether the bitmap has the type set, and a NULL function pointer will segfault. In practice, every module that compiles has it.
The more interesting failure is the wrong
implementation. The passthru module's approach
(delegate to the base) is correct. A naive implementation
that returns true for everything will
accept I/O types the underlying device can't service, and
the failure mode is a generic SPDK_BDEV_IO_STATUS_FAILED
that doesn't tell you why.
What if your get_ctx_size returns 0?
That's fine. The malloc module returns
sizeof(struct malloc_task), but a module
that has no per-IO state (very simple wrappers, for
example) can return 0. The framework allocates
sizeof(spdk_bdev_io) + max_ctx_size as the
mempool element size; if your size is 0, you get just
the bdev_io header.
What if your module_init does long synchronous work?
It'll block the app thread (the one that called
spdk_bdev_initialize). That thread isn't
doing I/O, but it is doing reactor bookkeeping. If
module_init takes 100ms, the whole process
stalls for 100ms. The fix is to set async_init =
true, return 0 immediately, and call
spdk_bdev_module_init_done() when ready.
What if you share state across bdevs incorrectly?
The classic bug: a module registers multiple bdevs and uses a single global list or a single lock for them. That's not wrong per se, but it conflicts with the framework's "one channel per thread per bdev" model. The framework can pin a thread and submit to a single bdev; if your code path takes a global lock, you've killed that bdev's scalability.
The correct pattern: per-bdev state in
spdk_bdev.ctxt (or in a struct hung off
it), per-thread state in your channel struct, and no
global locks on the I/O path.
What if your destruct is async and you forget to call spdk_bdev_destruct_done?
The bdev will never be unregistered. The framework's
unregister callback never fires. Your application will
hang at shutdown waiting for it. (This is one of the
very few "silently hangs" failures in SPDK. The fix
is one line: a spdk_bdev_destruct_done(bdev, 0)
call when your async work is done.)
What if you claim a bdev you don't have a vbdev for?
spdk_bdev_module_claim_bdev is documented
as for vbdev modules that "create virtual bdevs on top"
of the base. If you claim a bdev and never create the
vbdev (e.g. examine_config returns without doing
anything), the base bdev can never be unregistered by
another module. The fix is to either create the vbdev
or not claim the bdev.
What if you register two bdevs with the same name?
spdk_bdev_register returns -EEXIST. Your
bdev is freed (you have to set up cleanup of
ctxt on the failure path). This is
checked in the framework; the second name is the
loser.
What if your channel create callback fails partway?
The framework assumes your channel create callback is atomic: if it returns non-zero, the channel is assumed to be uninitialized and will be destroyed. If you've allocated state inside the callback, you must free it on the failure path. The malloc module does this carefully — for example, on a failed accel_channel registration it explicitly unregisters the poller.
What to take away
A bdev module is two structs: a
spdk_bdev_module for the type, and a
spdk_bdev_fn_table for the per-instance
operations. The module struct is about lifecycle. The
fn_table is about the hot path. You register with a
constructor macro, the framework finds your module at
init time, and you become a first-class bdev type.
The "12 things" (or "7 must-haves") are the function
pointers. The simplest realistic module is
bdev_malloc.c. The most realistic virtual
module is vbdev_passthru.c. The next page,
the bdev_io lifecycle,
picks up from submit_request and walks
what happens between "I just got an I/O" and "I/O is
done."