Layer 8 · Write a bdev module

Reading passthru, the right way.

The vbdev_passthru module is 788 lines of C and the most useful file in the SPDK tree for learning how bdev modules work. It's a virtual module: it opens a base bdev, registers itself on top, and forwards every I/O 1:1. There is no transformation, no caching, no policy. That makes the code boring in the best possible way: nothing is hidden behind a clever algorithm. This page walks the file in the order a real reader should read it — module struct first, lifecycle second, construction third, hot path last — and annotates the why for every non-obvious line.

~25 min read2 diagramsprerequisites: 4.2 · 4.3
On this page
  1. File map: 13 things to look at, in order
  2. The module struct: passthru_if (lines 34-43)
  3. The struct zoo: vbdev_passthru, pt_io_channel, passthru_bdev_io
  4. Module init: vbdev_passthru_init (lines 500-503)
  5. Examine: vbdev_passthru_examine (lines 781-786)
  6. Register: vbdev_passthru_register (lines 589-717)
  7. Hot path: vbdev_passthru_submit_request (lines 271-352)
  8. The per-IO struct: how a request flows through passthru_bdev_io
  9. Completion: _pt_complete_io and the bridge
  10. Destruct: vbdev_passthru_destruct (lines 115-140)
  11. Persistence: vbdev_passthru_config_json
  12. Hot remove and event handling
  13. RPCs: vbdev_passthru_rpc.c companion file
  14. Edge cases: what breaks in passthru

File map: 13 things to look at, in order

The full file is 788 lines. If you read it top to bottom you'll get lost in the per-IO state struct and the NOMEM-queue handling. Here's the order that makes the concepts click:

STEP 01
1. Module struct
lines 34-43: passthru_if
STEP 02
2. The struct zoo
lines 57-87: three structs to keep straight
STEP 03
3. The IO device registration
lines 681-682: spdk_io_device_register
STEP 04
4. Module init / finish
lines 500-517
STEP 05
5. The bdev construction
lines 589-717: vbdev_passthru_register
STEP 06
6. The channel create/destroy
lines 433-454
STEP 07
7. get_io_channel
lines 372-387
STEP 08
8. submit_request
lines 271-352
STEP 09
9. The completion path
lines 146-189
STEP 10
10. The NOMEM queue
lines 192-217
STEP 11
11. The destruct
lines 115-140
STEP 12
12. The hot-remove callback
lines 559-584
STEP 13
13. The examine callback
lines 781-786

1. The module struct: passthru_if

Six fields set, the rest of the struct is zero-initialized. Three of the fields — init_complete, fini_start, examine_disk — are absent, which is allowed. The interesting choices:

  1. name = "passthru" — the module's name in JSON-RPC (bdev_get_bdevs -p passthru), in error logs, in bdev_get_bdevs output. Must be unique.

  2. examine_config — set, which is what makes this a "virtual" module. The framework calls this for every bdev that gets registered; passthru decides whether to attach itself on top.

  3. config_json at the module level — not fn_table->write_config_json. Passthru walks a global list of vbdev_passthru nodes and emits one bdev_passthru_create RPC per node. This pattern is right when your module has a global list of bdevs; fn_table->write_config_json is right when each bdev is self-contained.

The SPDK_BDEV_MODULE_REGISTER(passthru, &passthru_if) line is the magic: it expands to a function with attribute((constructor)), which the linker runs before main(). The function calls spdk_bdev_module_list_add(), which appends the module to a global list. By the time spdk_bdev_initialize() runs, the list is complete and the framework iterates it, calling module_init on each.

2. The struct zoo

Before reading further, internalize three structs. They show up throughout the file and the relationship between them is the whole architecture.

StructLives inRepresents
struct vbdev_passthruOne per passthru bdev, on g_pt_nodesA single passthru bdev and its relationship to a base bdev
struct pt_io_channelOne per (thread, passthru bdev) pairPer-thread state: here, just the cached base channel
struct passthru_bdev_ioOne per I/O in flightPer-IO scratch space; here, just a marker byte and a queue entry

Three structs, three lifetimes: vbdev_passthru lives for the life of the bdev; pt_io_channel is created and destroyed by the framework on a per-thread basis; passthru_bdev_io is recycled from a mempool with every I/O. Confusing them is the source of most passthru-shaped bugs.

flowchart LR
g[vbdev_passthru g_pt_nodes] --> n1[vbdev_passthru node 1]
g --> n2[vbdev_passthru node 2]
n1 -->|"spdk_bdev pt_bdev"| bdev1[pt_bdev A]
n1 -->|base_bdev + base_desc| base1[base Malloc0]
n2 -->|"spdk_bdev pt_bdev"| bdev2[pt_bdev B]
n2 -->|base_bdev + base_desc| base2[base Malloc1]
bdev1 --> fn[vbdev_passthru_fn_table]
bdev2 --> fn
fn --> sr[vbdev_passthru_submit_request]
sr --> io[passthru_bdev_io per I/O]
sr --> ch1[pt_io_channel on thread 1]
sr --> ch2[pt_io_channel on thread 2]
fig. 1 — the three structs and their relationships · tap or scroll to zoom · ↗ for fullscreen

fig. 1   Two passthru bdevs share one vbdev_passthru_fn_table but each has its own vbdev_passthru node and its own set of per-thread channels. Each I/O carries a passthru_bdev_io for the duration of the operation.

3. Module init: vbdev_passthru_init

Empty. Why? Because passthru has no module-wide state to set up. The malloc module uses module_init to call spdk_io_device_register() for its module-wide io_device, but passthru registers one io_device per passthru bdev (in vbdev_passthru_register). That happens during the examine path, not at module init.

The corresponding module_fini is similarly trivial:

This is the cleanup of g_bdev_names — a list of (base_bdev_name, vbdev_name) tuples the module accumulated during config-file parsing. The actual vbdev_passthru nodes are gone by this point: they've been freed in their destruct callbacks, which run before module_fini. What remains is the names list, which keeps references to bdevs that may never have shown up (because their base bdev never appeared). Always free what you allocated; this is what that looks like.

4. Examine: vbdev_passthru_examine

The framework calls this for every bdev that gets registered, once per module that has examine_config set. Passthru uses it to ask: "is this bdev in my config file's vbdev_names list? if so, build a passthru on top of it." Then it calls spdk_bdev_module_examine_done(), which tells the framework "I'm done with this bdev; you can call the next module's examine."

5. Register: vbdev_passthru_register

The big one. 130 lines that build a passthru bdev from a base bdev name. Read in five passes:

Pass 1: walk the names list (lines 603-606)

The list of (base, vbdev) pairs comes from the SPDK config file (parsed by the app framework) or from an earlier bdev_passthru_create RPC. The comparison is exact-match: if bdev_name matches one of the base_bdev_names in the list, build a passthru on top of it. If not, exit silently — examine is called for every bdev, and passthru is only interested in the ones it was told to care about.

Pass 2: allocate and open (lines 609-639)

Four allocations, each with a clean failure path: pt_node, the name string, the base bdev descriptor, and the bdev lookup. The descriptor is opened with write=true (second argument) so passthru can forward writes, and the event callback vbdev_passthru_base_bdev_event_cb is what the framework calls when the base bdev disappears. The -ENODEV error is handled specially: the base bdev isn't there yet, which is fine — we'll try again next time a bdev shows up.

Pass 3: copy base bdev properties (lines 657-672)

A passthru bdev has the same geometry as its base: same block size, same block count, same DIF/DIX metadata layout, same NUMA node. If you forget to copy required_alignment, your passthru bdev will misreport its alignment requirements and the framework will (or won't) double-buffer, depending on whether the caller's iov happens to be aligned. Always copy.

Pass 4: register the io_device (lines 676-687)

Three critical lines:

  1. pt_node->pt_bdev.ctxt = pt_node — every callback the framework makes to the bdev (submit_request, destruct, etc.) gets the ctxt. By pointing it at our vbdev_passthru node, we can find our state from any callback. This is the instance lookup in the "class vs object" distinction.

  2. pt_node->pt_bdev.fn_table = &vbdev_passthru_fn_table — every bdev of type passthru shares this vtable. The submit_request pointer is the same for all passthru bdevs, but the ctxt is different, so the callback knows which bdev it's serving.

  3. spdk_io_device_register(pt_node, ...) — the magic that makes get_io_channel work. Each call to spdk_get_io_channel(pt_node) from a thread will allocate (or look up) a pt_io_channel for that thread. We capture the current thread because the base bdev desc was opened on it, and spdk_bdev_close() must run on the same thread.

Pass 5: claim and register (lines 689-712)

spdk_bdev_module_claim_bdev() says "this module is now the exclusive writer on this base bdev, so no other vbdev on top of it will write." It's a marker that prevents weird shared-write races. The error path undoes every step: close the desc, remove from the global list, unregister the io_device, free the allocations.

spdk_bdev_register() makes the bdev visible to the framework. After this call, other modules (e.g. nvmf) can open pt_bdev. The error path is similar but adds spdk_bdev_module_release_bdev() to undo the claim. Note the order: release the claim first, then close the desc. Get the order wrong and you have a window where the base bdev can't be claimed by another module but the desc isn't closed yet.

6. The channel create/destroy

The framework allocates the channel bytes (we said sizeof(struct pt_io_channel) at spdk_io_device_register time) and hands the buffer to pt_bdev_ch_create_cb. We need to initialize the channel — and the only thing to initialize is the base channel. spdk_bdev_get_io_channel() gets (or creates) the base bdev's per-thread channel, and we cache it. On destroy, we put it back.

The interesting property: the per-thread channel is created lazily. The first time a thread submits to the passthru bdev, the framework calls get_io_channel, which calls spdk_get_io_channel, which calls our create_cb. Subsequent calls on the same thread return the cached channel. The framework handles the refcounting; we just have to make create and destroy symmetric.

7. get_io_channel: the trivial callback

The framework's spdk_bdev_get_io_channel() call lands here with the bdev's ctxt, which is our vbdev_passthru node. We pass the node to spdk_get_io_channel(), which triggers our io_device's create_cb. That's it. The framework wraps this channel in a spdk_bdev_channel and that's what the submit path gets.

8. vbdev_passthru_submit_request: THE hot path

The function you've been waiting for. 82 lines, a switch on I/O type. The pattern: for each I/O type, translate it into a bdev_io on the base bdev, with a completion callback that bridges the base bdev_io back to the original.

Three pointers, one setup. The SPDK_CONTAINEROF recovers our vbdev_passthru node from the spdk_bdev pointer that the framework hands us. spdk_io_channel_get_ctx(ch) gets the per-thread state, which is where the cached base channel lives. bdev_io->driver_ctx is the per-IO scratch space the framework reserved for us (we set the size in get_ctx_size). The test = 0x5a line is the comment in the source: a marker, set at submit and verified at completion, just to show that the passthru_bdev_io is round-tripped.

The switch, in detail

Nine I/O types, all handled, plus a default that completes the I/O as FAILED. The pattern for each non-read type is identical: call a public bdev API on the base bdev, passing the base's desc, the base's per-thread base_ch, the same offset and length, the completion callback _pt_complete_io, and the original bdev_io as the callback's argument.

The READ case is special. The caller might pass NULL in the iov (a "give me a buffer" read), and we need a buffer before we can forward the read to the base. spdk_bdev_io_get_buf() is the framework's "give this bdev_io a buffer" API. When the buffer is ready, the framework calls pt_read_get_buf_cb, which does the actual spdk_bdev_readv_blocks_ext call.

The error path: NOMEM

The bdev_io mempool can run out of entries under load. When that happens, spdk_bdev_writev_blocks_ext returns -ENOMEM instead of allocating a child bdev_io. Passthru doesn't fail the I/O; it queues it for retry. vbdev_passthru_queue_io() uses spdk_bdev_queue_io_wait(), which adds the I/O to a list that the framework will re-fire when an io frees up. This is the back-pressure mechanism in the bdev layer. The alternative — failing the I/O — would lose data on a write or return an error to the user on a read, both of which are worse than just slowing down.

9. The per-IO struct: how a request flows through

The "passthru_bdev_io" struct is intentionally tiny:

Three fields, each for a specific reason:

  1. test — sanity check. Set to 0x5a in submit_request, verified in the completion callback. The comment in the source says "just for fun," and it is: the field proves the round-trip bdev_io pointer passed as cb_arg is the same one we got at submit.

  2. ch — the io channel of the original I/O. Needed for the NOMEM queue: when the queue eventually fires vbdev_passthru_resubmit_io, we need to know which channel to call submit_request on. We can't recover it from the bdev_io (it doesn't have a back-pointer to the channel that submitted it).

  3. bdev_io_wait — the wait queue entry. spdk_bdev_queue_io_wait fills this in. When the framework frees a child bdev_io, it walks the wait queue and fires the callback. vbdev_passthru_resubmit_io is that callback.

10. The completion path: _pt_complete_io

The bridge. The base bdev finishes the child bdev_io; this callback runs. We do two things:

  1. spdk_bdev_io_complete_base_io_status(orig_io, bdev_io) — completes the original (parent) bdev_io with the status from the child. The _base_io_status variant copies the status enum (and the error info, like NVMe cdw0 or SCSI sense) from the child to the parent, so the caller of the passthru bdev sees the real error info, not a generic FAILED.

  2. spdk_bdev_free_io(bdev_io) — the child bdev_io came from a mempool; this returns it. If you forget this, you leak one bdev_io per I/O. The bdev_io mempool is large but not infinite, and a slow leak looks exactly like a memory leak in production.

The pattern is the same for every I/O type. The _pt_complete_zcopy_io variant is slightly different because zero-copy needs to set the buffer on the original I/O before completing it. The discipline is the same: complete the original, free the child.

sequenceDiagram
participant Caller
participant Framework
participant Passthru as passthru submit_request
participant Base as base submit_request
participant Disk
Caller->>Framework: spdk_bdev_write(...)
Framework->>Passthru: submit_request(orig_io)
Passthru->>Base: spdk_bdev_writev_blocks_ext(cb_arg=orig_io)
Base->>Disk: NVMe write
Disk-->>Base: completion
Base->>Framework: _pt_complete_io(new_io, cb_arg=orig_io)
Note over Framework: spdk_bdev_io_complete_base_io_status(orig_io, new_io)
spdk_bdev_free_io(new_io) Framework->>Caller: user's cb(orig_io)
fig. 2 — the bridge pattern · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The lifecycle of one write. The original bdev_io is created by the framework for the user; the child bdev_io is created by the framework for passthru to talk to the base. The two bdev_ios are different objects; the cb_arg links them. The bridge callback completes the original and frees the child.

11. The destruct: vbdev_passthru_destruct

The order-matters callback. Read it slowly.

  1. TAILQ_REMOVE(&g_pt_nodes, pt_node, link) — first. Remove from the global list. This way no other code path in passthru (an in-flight RPC, a re-entrant callback) can find this node and use it. The comment "It is important to follow this exact sequence" is serious.

  2. spdk_bdev_module_release_bdev(pt_node->base_bdev) — second. Release the claim. If another module was waiting to claim the base bdev, this lets it. (In practice, the base bdev is also about to be unregistered, so no one is waiting. But the order matters: if you unregister the io_device first, the base bdev might already be gone, and the release could be a no-op or worse.)

  3. spdk_bdev_close(pt_node->base_desc) — third, but on the right thread. spdk_bdev_close() is thread-bound: the desc was opened on pt_node->thread, and it must be closed on the same thread. If the destruct is being called from a different thread (which happens when a hot-remove callback fires asynchronously), we send a message to the opening thread and close on that one. The spdk_thread_send_msg() call is non-blocking; the actual close happens later.

  4. spdk_io_device_unregister(pt_node, _device_unregister_cb) — fourth. The framework will drain any channels that are still open. Each drain calls our pt_bdev_ch_destroy_cb, which puts the cached base channel. The framework then calls _device_unregister_cb with the io_device pointer (which is pt_node), and that's where we free pt_node->pt_bdev.name and pt_node itself.

12. Persistence: vbdev_passthru_config_json

The output of this function is exactly what save_config writes to the JSON file. When you replay that file, the framework calls bdev_passthru_create for each entry, which calls bdev_passthru_create_disk at module/bdev/passthru/vbdev_passthru.c:720 , which calls vbdev_passthru_insert_name (to add to g_bdev_names) and vbdev_passthru_register (to build the bdev). The whole pipeline restarts.

The shape of the JSON is: an object with a method field (the RPC name) and a params field (the RPC's parameters). That matches what the RPC handler at

module/bdev/passthru/vbdev_passthru_rpc.c:29

expects. Keep these in sync.

13. Hot remove and event handling

The framework calls vbdev_passthru_base_bdev_event_cb when something happens to the base bdev. Today, only SPDK_BDEV_EVENT_REMOVE is implemented: when the base bdev goes away (e.g. an NVMe controller hot-unplug), we walk the global list of passthru bdevs and unregister any whose base matches. The framework then calls our destruct for each, and the teardown happens in the correct order.

TAILQ_FOREACH_SAFE is the safe version of TAILQ_FOREACH: it gives you a tmp pointer that's pre-advanced, so you can safely remove the current entry from the list while iterating. Using TAILQ_FOREACH here would be a bug.

14. The RPCs: vbdev_passthru_rpc.c companion file

The RPC handlers are in a separate file. There are two of them: bdev_passthru_create and bdev_passthru_delete.

Three pieces: a struct for the parameters, a decoder table, and the handler. The decoder table is the convention: each row says "the JSON field with this name, decoded by this function, lands at this offset in the struct." The framework's spdk_json_decode_object walks the table and fills the struct. The true at the end of the uuid row says "this field is optional" — the framework will leave it zero-initialized if the JSON doesn't have it.

The handler follows the same five-step pattern as every JSON-RPC handler in SPDK: decode, validate, do the work, build the response, clean up. The SPDK_RPC_REGISTER at the bottom is the constructor that adds the method to the RPC server's list.

The companion bdev_passthru_delete RPC is even simpler: take a name, look up the bdev, unregister it, wait for the framework's callback, and send a success/error response. See module/bdev/passthru/vbdev_passthru_rpc.c:96 .

Edge cases: what breaks in passthru

The base bdev doesn't exist yet

spdk_bdev_open_ext returns -ENODEV. The handler in vbdev_passthru_register at

module/bdev/passthru/vbdev_passthru.c:628

treats -ENODEV as a soft failure: it frees the half-constructed node but doesn't return an error from the RPC. The trick is in bdev_passthru_create_disk at module/bdev/passthru/vbdev_passthru.c:735 : it returns success if -ENODEV happens after the name was added to g_bdev_names, because the framework will call examine_config again when the base bdev shows up. The create RPC is effectively "remember this; create it later."

Two passthru bdevs on the same base

Works: each vbdev_passthru node opens its own base_desc. The claim is shared (only one exclusive writer at a time), but multiple readers are fine. If you want to mix two passthrus that both write to the same base, you'll need to use the newer SPDK_BDEV_CLAIM_READ_MANY_WRITE_SHARED claim type.

Hot remove while I/O is in flight

The framework will close all open descriptors on the base bdev. Each spdk_bdev_writev_blocks_ext in flight is completed (probably as FAILED) by the base bdev module, our bridge runs, the original bdev_io is completed with that failure, and the caller's callback fires. Then the framework calls our destruct, and the teardown happens.

The "double-call" of the channel-destroy

If your channel create callback fails partway (e.g. spdk_bdev_get_io_channel returns NULL), the framework still calls your destroy callback. The destroy in this file is unconditional (spdk_put_io_channel(pt_ch->base_ch)); if base_ch is NULL because the create failed, the put is a no-op. The discipline: always initialize the channel bytes to zero before populating them. spdk_io_device_register does this for you — the framework zeroes the bytes before calling create.

The reclaim race

The destruct releases the claim. If a concurrent examine_config was already in flight and saw the claim, the next claim is fine. If the examine raced and the claim is gone before it gets a chance to test, it just won't claim. The TAILQ_REMOVE at the top of destruct prevents the new examine from finding the node, but it doesn't prevent a base-bdev open that the old node already had. The fix: the open's lifetime is bounded by the spdk_bdev_close in destruct, so the race is closed.

The 0x5a test

The io_ctx->test = 0x5a line is the only "fancy" thing in the per-IO struct. If you see "Error, original IO device_ctx is wrong!" in the log, one of three things happened: a memory corruption (rare), a use-after-free in the bdev_io path (catastrophic), or the test was set by a different module that re-used the bdev_io. The last one is unlikely in production, so this error usually means real corruption.

What to take away

Passthru is the smallest virtual bdev module in the SPDK tree, and once you see how the pieces fit, every other vbdev (lvol, raid, split, gpt, error, delay) is a variation on this template. The bridge pattern in _pt_complete_io is the defining technique: you submit a child bdev_io on the base, with the original as cb_arg, and the bridge callback completes the original and frees the child. The destruct order is the defining footgun: list, claim, desc, io_device, in that order. Get those two patterns right and you can read the rest of the SPDK tree.

The next page covers what you need to do outside the .c file: the Makefile, the modules list, the configure script, the RPC file, and a complete skeleton you can copy to start your own module.