Layer 4 · The bdev framework

What makes a bdev a bdev.

A "bdev" is whatever the framework says it is. A bdev module is a struct of function pointers plus some lifecycle metadata. Once the framework has a struct that matches the shape it expects, the module is a first-class citizen: it gets I/O submitted to it, it gets hot-remove events, it shows up in bdev_get_bdevs, and it can be a base for a virtual bdev on top. This page is the contract.

~20 min read2 diagramsprerequisites: 4.1
On this page
  1. Two structs: spdk_bdev_module and spdk_bdev_fn_table
  2. The 12 (give or take) things every module must implement
  3. Walkthrough: bdev_malloc.c — the simplest module
  4. Walkthrough: vbdev_passthru.c — a more realistic module
  5. Registration: how a module declares itself
  6. The instance vs the type: spdk_bdev vs spdk_bdev_module
  7. Edge cases: missing implementations, shared state, expected behavior

Two structs: module and fn_table

A bdev type is defined by two structs, not one. They capture the two levels of what a module is:

StructRepresentsHow many per process
spdk_bdev_moduleThe type ("malloc", "nvme", "lvol", "passthru", …)One per loaded module
spdk_bdev_fn_tableThe function pointers every bdev of this type responds toOne per bdev module

The split is deliberate. The spdk_bdev_module carries lifecycle (init, fini, examine, config serialization) — things that happen once per module per process. The spdk_bdev_fn_table carries the per-instance hot-path operations (submit_request, destruct, get_io_channel) — things that happen per-bdev and per-I/O. An spdk_bdev points at both: module for lifecycle, fn_table for I/O.

  1. module_init — called by the framework during spdk_bdev_initialize. Do per-module setup here. Return 0 on success, <0 on error. If async_init is true, return 0 and call spdk_bdev_module_init_done() later.

  2. init_complete — optional. Called when the bdev subsystem has finished initialization. By this point all other modules have been initialized and all configuration has been applied. Good place to auto-create virtual bdevs.

  3. fini_start — optional. Called before the framework starts unregistering bdevs. The module should release any claims on bdevs it doesn't have a virtual bdev on top of.

  4. module_fini — called after all bdevs of all modules are unregistered. Final cleanup.

  5. config_json — optional. Write module-level config (or per-bdev creation RPCs) to the JSON stream. Either this or fn_table->write_config_json can be used for per-bdev config, not both.

  6. name — the module's name. Used in JSON-RPC (bdev_get_bdevs -p malloc), in the bdev list, in error messages. Must be unique within the process.

  7. get_ctx_size — return the number of bytes the framework should reserve for the module's per-IO scratch space. Malloc returns sizeof(struct malloc_task). Passthru returns sizeof(struct passthru_bdev_io). NVMe is larger because it carries queue ids, etc.

  8. examine_config — virtual bdevs only. Called for every bdev that gets registered. The module decides synchronously whether to claim the bdev and create a virtual bdev on top.

  9. examine_disk — virtual bdevs only. Called after examine_config for bdevs that were claimed. May do I/O and finish asynchronously. Call spdk_bdev_module_examine_done() when done.

  10. async_init / async_fini / async_fini_start — flags. If true, the module's init/fini functions can return 0 and complete asynchronously via the corresponding _done() API.

Now the function table. This is the per-bdev hot path.

  1. destruct — required. Called when the bdev is being unregistered. Free the module's per-bdev context (ctx). Return 0 on synchronous completion, or 1 and call spdk_bdev_destruct_done() later for async.

  2. submit_request — required. The hot path. Module dispatches the bdev_io to its backing device (NVMe queue, AIO, malloc memcpy, …). When done, call spdk_bdev_io_complete(). See 4.3 for the full lifecycle.

  3. io_type_supported — required. Capability check. Returns true if this bdev supports a given I/O type. The framework calls this both at submit time and when populating the io_type_supported bitmap at register time.

  4. get_io_channel — required. Return a per-thread spdk_io_channel for this bdev. Almost always implemented as spdk_get_io_channel(ctx) for an io_device registered in module_init.

  5. dump_info_json — optional. Add module-specific fields to the bdev_get_bdevs output.

  6. write_config_json — optional. Output the RPC to recreate this specific bdev. Mutually exclusive with the module-level config_json for per-bdev output.

  7. get_spin_time — optional. Microseconds the thread spent spinning waiting for I/O last sample. Used by vtune integration and disk utilization.

  8. get_module_ctx — optional. Return a module-specific context pointer for this bdev. Set at register time; rarely changed.

  9. get_memory_domains — optional. Report which memory domains (DMA-capable memory regions) this bdev can work with. Used for zero-copy I/O across memory domain boundaries.

  10. reset_device_stat — optional. Reset the module's per-bdev I/O statistics.

  11. dump_device_stat_json — optional. Output per-bdev I/O statistics to a JSON stream.

  12. accel_sequence_supported — optional. Return true if the bdev can handle an spdk_accel_sequence in front of (or after) a given I/O type. Used for compressed reads, encryption, etc.

The (roughly) twelve things every module must implement

The style guide for this curriculum (and the actual SPDK docs) talks about "the 7 things every module must implement" — that count is the count of required function pointers. The full picture has more, but the must-implement set is:

#Function pointerWhereRequired?
1submit_requestfn_tableYes — the hot path
2destructfn_tableYes — cleanup
3io_type_supportedfn_tableYes — capability
4get_io_channelfn_tableYes — per-thread state
5module_initspdk_bdev_moduleYes — startup
6get_ctx_sizespdk_bdev_moduleYes — driver context size
7write_config_json or config_jsonfn_table or moduleStrongly recommended (config save/load)

Plus the optional ones (dump_info_json, get_spin_time, get_memory_domains, accel_sequence_supported, etc.) and the virtual-bdev-specific ones (examine_config, examine_disk, fini_start). The "12" number comes from adding in init_complete, module_fini, get_module_ctx, reset_device_stat, and dump_device_stat_json.

Walkthrough: bdev_malloc.c — the simplest module

The malloc module is what every SPDK tutorial starts from, because it's the only module where the backing storage is ordinary memory. There's no device, no DMA, no interrupt. You submit, the framework copies bytes. Done.

Here's the entire spdk_bdev_module definition:

Four fields. That's it. No async_init, no examine_config, no config_json. The malloc module doesn't auto-create bdevs (you call bdev_malloc_create over RPC), it doesn't claim any base bdevs, and it doesn't need async init. The two struct fields you might miss: init_complete and fini_start are NULL, and that's allowed — they're optional.

Now the function table:

Seven of the twelve possible function pointers filled in. The missing five (dump_info_json, get_spin_time, get_module_ctx, reset_device_stat, dump_device_stat_json) are all optional. None of them would do anything useful for malloc anyway — there's no "spin time" for a memcpy, no "reset stats" beyond the framework's built-in.

The module init: register an io_device

The interesting thing in module_init is the spdk_io_device_register call. This is what makes get_io_channel work:

The io_device is the tailq g_malloc_disks (see line 253: static TAILQ_HEAD(, malloc_disk) g_malloc_disks). The framework will create one channel per thread per call to spdk_get_io_channel(&g_malloc_disks), each channel will be sizeof(struct malloc_channel) bytes (set in the fourth argument), and the framework will invoke malloc_create_channel_cb on channel create and malloc_destroy_channel_cb on channel destroy.

get_io_channel: trivial

And the channel getter, for comparison:

Three lines, including the signature. The spdk_get_io_channel() call looks up (or creates) a per-thread channel for the io_device and returns it. The framework wraps it in a spdk_bdev_channel and that's what your submit_request receives as its first argument.

submit_request: the dispatch table

The submit path is a switch on bdev_io type. For each type, the module either completes immediately (reset, flush, abort) or kicks off a real operation (read, write, unmap, write-zeros, copy, zcopy).

Note the call to spdk_io_channel_get_ctx(ch). This is how you go from the framework's wrapped channel (spdk_io_channel *) to your module's per-thread state (struct malloc_channel *). The framework gives you a generic channel pointer; the io_device machinery gives you the bytes you asked for at register time.

And the dispatch:

Every case ends by either calling spdk_bdev_io_complete() (or a helper that does so) or scheduling async work that will eventually call it. The default case returns -1, which causes the caller to complete the I/O with FAILED. That's the framework's contract: if submit_request returns without completing the I/O and without queuing async work, the framework will fail it.

Walkthrough: vbdev_passthru.c — a more realistic module

Malloc is a leaf module — it owns a chunk of memory and that memory is the bdev. Passthru is a virtual module — it sits on top of another bdev and forwards every operation to it. This is the "vbdev" pattern and it underlies lvol, raid, split, gpt, and most of the composition you'll do.

The struct itself: smaller

The passthru module struct is similar but not identical:

Three differences from malloc:

  • examine_config is set. This is what makes passthru a "virtual" module. The framework calls it for every bdev that gets registered; passthru decides whether to attach itself on top of that bdev.

  • config_json is set, not write_config_json in the fn_table. The passthru module's bdevs are created by the RPC handler, and the module-level config_json walks the global list of vbdev_passthru nodes to emit one bdev_passthru_create RPC per bdev.

  • No init_complete. Passthru can complete its setup during the examine path, so it doesn't need post-init notification.

submit_request: forward to the base

The passthru submit path is a 1:1 forwarding of every I/O type. It uses the public bdev API to submit a new bdev_io on the base bdev, with a private completion callback that translates the result back to the original I/O:

Three things to notice:

  1. The base bdev is reached via pt_node->base_desc and pt_ch->base_ch. The passthru opened the base in vbdev_passthru_register() and saved the desc; the per-thread base channel was grabbed in pt_bdev_ch_create_cb() at channel create time.

  2. The completion callback is _pt_complete_io. It does the inverse translation: completes the original bdev_io with the status from the new one, then frees the new bdev_io with spdk_bdev_free_io. This is the "submit a child bdev_io and bridge it to the parent" pattern — the bdev_io you submit and the bdev_io you receive are different objects, even if their data represents the same operation.

  3. The -ENOMEM case: if the framework can't allocate a child bdev_io for the base, passthru queues the parent's bdev_io for later retry. This is vbdev_passthru_queue_io, which uses spdk_bdev_queue_io_wait to wake the I/O when an io frees up. Modules that allocate from the bdev_io mempool need to handle this.

destruct: the opposite of register

The destruct path mirrors the register path. For passthru:

The comment is gold: "It is important to follow this exact sequence of steps for destroying a vbdev…" That's not hyperbole. Get the order wrong and you can double-free, leak a desc, or deadlock on a thread message.

  1. Remove from your module's list first. This way no other code path in your module can find this node and try to use it.

  2. Release the claim on the base bdev. Otherwise the base bdev can't be unregistered (it thinks you still own it).

  3. Close the base desc on its opening thread. spdk_bdev_close is thread-bound; if you're on a different thread, send a message.

  4. Unregister the io_device. After this, the framework will drain any channels that are still open and call your destroy callback.

Registration: how a module declares itself

Once your spdk_bdev_module struct is defined, you make the framework aware of it with one line:

This is a attribute((constructor)) function. The compiler emits it into the module's .ctors section, the dynamic linker runs it before main(), and it calls spdk_bdev_module_list_add(module) to add the module to the framework's global list. By the time spdk_bdev_initialize is called from main(), the list is fully populated.

For malloc:

For passthru:

Same macro, different struct. Both modules are now part of the framework. The framework walks the list during spdk_bdev_initialize and calls module_init on each.

flowchart LR
A[Module .c file] -->|constructor| B[spdk_bdev_module_list_add]
B --> C[Global module list]
C --> D[spdk_bdev_initialize]
D --> E[module_init on each]
E -->|registers io_device| F[spdk_io_device_register]
F --> G[Module ready to receive I/O]
fig. 1 — module registration and discovery · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The module gets added to the global list at link time (via a constructor). spdk_bdev_initialize walks the list and calls module_init on each. After that, the module can register bdevs, which is the next step.

The instance vs the type: spdk_bdev vs spdk_bdev_module

A common confusion: how is the malloc module different from a single malloc bdev? The answer is the same as "class vs object":

  • spdk_bdev_module malloc_if is the class. There is exactly one of it. It has the module_init and module_fini functions. It's about lifecycle.

  • spdk_bdev pt_bdev is an object. There can be many of them. Each has its own name, blockcnt, ctxt back-pointer, and fn_table. The fn_table pointers (like submit_request) are the same for all bdevs of the same module — but they receive ctx as their first argument, and that's how the module knows which instance to operate on.

Concretely: passthru defines one spdk_bdev_module passthru_if and one spdk_bdev_fn_table vbdev_passthru_fn_table. But every spdk_bdev pt_bdev has its own struct vbdev_passthru *pt_node as its ctxt, and that node is what the submit_request callback uses to find its base bdev.

The framework is what connects the two. When a bdev is registered, the framework records the spdk_bdev_module *module field. When an I/O is submitted, the framework reads spdk_bdev->fn_table->submit_request and calls it. When an event fires, the framework consults spdk_bdev->module for any module-wide behavior (like the examine chain).

Edge cases: missing implementations, shared state, expected behavior

What if you forget io_type_supported?

You can't — the framework's I/O submit path will call it regardless of whether the bitmap has the type set, and a NULL function pointer will segfault. In practice, every module that compiles has it.

The more interesting failure is the wrong implementation. The passthru module's approach (delegate to the base) is correct. A naive implementation that returns true for everything will accept I/O types the underlying device can't service, and the failure mode is a generic SPDK_BDEV_IO_STATUS_FAILED that doesn't tell you why.

What if your get_ctx_size returns 0?

That's fine. The malloc module returns sizeof(struct malloc_task), but a module that has no per-IO state (very simple wrappers, for example) can return 0. The framework allocates sizeof(spdk_bdev_io) + max_ctx_size as the mempool element size; if your size is 0, you get just the bdev_io header.

What if your module_init does long synchronous work?

It'll block the app thread (the one that called spdk_bdev_initialize). That thread isn't doing I/O, but it is doing reactor bookkeeping. If module_init takes 100ms, the whole process stalls for 100ms. The fix is to set async_init = true, return 0 immediately, and call spdk_bdev_module_init_done() when ready.

What if you share state across bdevs incorrectly?

The classic bug: a module registers multiple bdevs and uses a single global list or a single lock for them. That's not wrong per se, but it conflicts with the framework's "one channel per thread per bdev" model. The framework can pin a thread and submit to a single bdev; if your code path takes a global lock, you've killed that bdev's scalability.

The correct pattern: per-bdev state in spdk_bdev.ctxt (or in a struct hung off it), per-thread state in your channel struct, and no global locks on the I/O path.

What if your destruct is async and you forget to call spdk_bdev_destruct_done?

The bdev will never be unregistered. The framework's unregister callback never fires. Your application will hang at shutdown waiting for it. (This is one of the very few "silently hangs" failures in SPDK. The fix is one line: a spdk_bdev_destruct_done(bdev, 0) call when your async work is done.)

What if you claim a bdev you don't have a vbdev for?

spdk_bdev_module_claim_bdev is documented as for vbdev modules that "create virtual bdevs on top" of the base. If you claim a bdev and never create the vbdev (e.g. examine_config returns without doing anything), the base bdev can never be unregistered by another module. The fix is to either create the vbdev or not claim the bdev.

What if you register two bdevs with the same name?

spdk_bdev_register returns -EEXIST. Your bdev is freed (you have to set up cleanup of ctxt on the failure path). This is checked in the framework; the second name is the loser.

What if your channel create callback fails partway?

The framework assumes your channel create callback is atomic: if it returns non-zero, the channel is assumed to be uninitialized and will be destroyed. If you've allocated state inside the callback, you must free it on the failure path. The malloc module does this carefully — for example, on a failed accel_channel registration it explicitly unregisters the poller.

What to take away

A bdev module is two structs: a spdk_bdev_module for the type, and a spdk_bdev_fn_table for the per-instance operations. The module struct is about lifecycle. The fn_table is about the hot path. You register with a constructor macro, the framework finds your module at init time, and you become a first-class bdev type.

The "12 things" (or "7 must-haves") are the function pointers. The simplest realistic module is bdev_malloc.c. The most realistic virtual module is vbdev_passthru.c. The next page, the bdev_io lifecycle, picks up from submit_request and walks what happens between "I just got an I/O" and "I/O is done."