Reading passthru, the right way.
The vbdev_passthru module is 788 lines of C
and the most useful file in the SPDK tree for learning how
bdev modules work. It's a virtual module: it
opens a base bdev, registers itself on top, and forwards
every I/O 1:1. There is no transformation, no caching, no
policy. That makes the code boring in the best possible
way: nothing is hidden behind a clever algorithm. This
page walks the file in the order a real reader should
read it — module struct first, lifecycle second,
construction third, hot path last — and annotates the
why for every non-obvious line.
- File map: 13 things to look at, in order
- The module struct:
passthru_if(lines 34-43) - The struct zoo:
vbdev_passthru,pt_io_channel,passthru_bdev_io - Module init:
vbdev_passthru_init(lines 500-503) - Examine:
vbdev_passthru_examine(lines 781-786) - Register:
vbdev_passthru_register(lines 589-717) - Hot path:
vbdev_passthru_submit_request(lines 271-352) - The per-IO struct: how a request flows through
passthru_bdev_io - Completion:
_pt_complete_ioand the bridge - Destruct:
vbdev_passthru_destruct(lines 115-140) - Persistence:
vbdev_passthru_config_json - Hot remove and event handling
- RPCs:
vbdev_passthru_rpc.ccompanion file - Edge cases: what breaks in passthru
File map: 13 things to look at, in order
The full file is 788 lines. If you read it top to bottom you'll get lost in the per-IO state struct and the NOMEM-queue handling. Here's the order that makes the concepts click:
1. The module struct: passthru_if
Six fields set, the rest of the struct is zero-initialized.
Three of the fields — init_complete,
fini_start, examine_disk — are
absent, which is allowed. The interesting choices:
name = "passthru"— the module's name in JSON-RPC (bdev_get_bdevs -p passthru), in error logs, inbdev_get_bdevsoutput. Must be unique.examine_config— set, which is what makes this a "virtual" module. The framework calls this for every bdev that gets registered; passthru decides whether to attach itself on top.config_jsonat the module level — notfn_table->write_config_json. Passthru walks a global list ofvbdev_passthrunodes and emits onebdev_passthru_createRPC per node. This pattern is right when your module has a global list of bdevs;fn_table->write_config_jsonis right when each bdev is self-contained.
The SPDK_BDEV_MODULE_REGISTER(passthru, &passthru_if)
line is the magic: it expands to a function with
attribute((constructor)), which the
linker runs before main(). The function calls
spdk_bdev_module_list_add(), which appends the
module to a global list. By the time
spdk_bdev_initialize() runs, the list is
complete and the framework iterates it, calling
module_init on each.
2. The struct zoo
Before reading further, internalize three structs. They show up throughout the file and the relationship between them is the whole architecture.
| Struct | Lives in | Represents |
|---|---|---|
struct vbdev_passthru | One per passthru bdev, on g_pt_nodes | A single passthru bdev and its relationship to a base bdev |
struct pt_io_channel | One per (thread, passthru bdev) pair | Per-thread state: here, just the cached base channel |
struct passthru_bdev_io | One per I/O in flight | Per-IO scratch space; here, just a marker byte and a queue entry |
Three structs, three lifetimes:
vbdev_passthru lives for the life of the
bdev; pt_io_channel is created and destroyed
by the framework on a per-thread basis;
passthru_bdev_io is recycled from a mempool
with every I/O. Confusing them is the source of most
passthru-shaped bugs.
flowchart LR g[vbdev_passthru g_pt_nodes] --> n1[vbdev_passthru node 1] g --> n2[vbdev_passthru node 2] n1 -->|"spdk_bdev pt_bdev"| bdev1[pt_bdev A] n1 -->|base_bdev + base_desc| base1[base Malloc0] n2 -->|"spdk_bdev pt_bdev"| bdev2[pt_bdev B] n2 -->|base_bdev + base_desc| base2[base Malloc1] bdev1 --> fn[vbdev_passthru_fn_table] bdev2 --> fn fn --> sr[vbdev_passthru_submit_request] sr --> io[passthru_bdev_io per I/O] sr --> ch1[pt_io_channel on thread 1] sr --> ch2[pt_io_channel on thread 2]
fig. 1 Two passthru bdevs share one
vbdev_passthru_fn_table but each has its own
vbdev_passthru node and its own set of
per-thread channels. Each I/O carries a
passthru_bdev_io for the duration of the
operation.
3. Module init: vbdev_passthru_init
Empty. Why? Because passthru has no module-wide state to
set up. The malloc module uses module_init to
call spdk_io_device_register() for its
module-wide io_device, but passthru registers one io_device
per passthru bdev (in vbdev_passthru_register).
That happens during the examine path, not at module init.
The corresponding module_fini is similarly
trivial:
This is the cleanup of g_bdev_names — a list
of (base_bdev_name, vbdev_name) tuples the module
accumulated during config-file parsing. The actual
vbdev_passthru nodes are gone by this point:
they've been freed in their destruct callbacks, which run
before module_fini. What remains is the names
list, which keeps references to bdevs that may never have
shown up (because their base bdev never appeared). Always
free what you allocated; this is what that looks like.
4. Examine: vbdev_passthru_examine
The framework calls this for every bdev that gets
registered, once per module that has examine_config
set. Passthru uses it to ask: "is this bdev in my config
file's vbdev_names list? if so, build a
passthru on top of it." Then it calls
spdk_bdev_module_examine_done(), which
tells the framework "I'm done with this bdev; you can
call the next module's examine."
5. Register: vbdev_passthru_register
The big one. 130 lines that build a passthru bdev from a base bdev name. Read in five passes:
Pass 1: walk the names list (lines 603-606)
The list of (base, vbdev) pairs comes from the SPDK
config file (parsed by the app framework) or from an
earlier bdev_passthru_create RPC. The
comparison is exact-match: if bdev_name
matches one of the base_bdev_names in the
list, build a passthru on top of it. If not, exit
silently — examine is called for every bdev, and
passthru is only interested in the ones it was told to
care about.
Pass 2: allocate and open (lines 609-639)
Four allocations, each with a clean failure path:
pt_node, the name string, the base bdev
descriptor, and the bdev lookup. The
descriptor is opened with write=true
(second argument) so passthru can forward writes, and
the event callback vbdev_passthru_base_bdev_event_cb
is what the framework calls when the base bdev
disappears. The -ENODEV error is handled
specially: the base bdev isn't there yet, which is
fine — we'll try again next time a bdev shows up.
Pass 3: copy base bdev properties (lines 657-672)
A passthru bdev has the same geometry as its base: same
block size, same block count, same DIF/DIX metadata
layout, same NUMA node. If you forget to copy
required_alignment, your passthru bdev will
misreport its alignment requirements and the framework
will (or won't) double-buffer, depending on whether the
caller's iov happens to be aligned. Always copy.
Pass 4: register the io_device (lines 676-687)
Three critical lines:
pt_node->pt_bdev.ctxt = pt_node— every callback the framework makes to the bdev (submit_request, destruct, etc.) gets the ctxt. By pointing it at ourvbdev_passthrunode, we can find our state from any callback. This is the instance lookup in the "class vs object" distinction.pt_node->pt_bdev.fn_table = &vbdev_passthru_fn_table— every bdev of type passthru shares this vtable. Thesubmit_requestpointer is the same for all passthru bdevs, but the ctxt is different, so the callback knows which bdev it's serving.spdk_io_device_register(pt_node, ...)— the magic that makesget_io_channelwork. Each call tospdk_get_io_channel(pt_node)from a thread will allocate (or look up) apt_io_channelfor that thread. We capture the current thread because the base bdev desc was opened on it, andspdk_bdev_close()must run on the same thread.
Pass 5: claim and register (lines 689-712)
spdk_bdev_module_claim_bdev() says "this
module is now the exclusive writer on this base bdev, so
no other vbdev on top of it will write." It's a marker
that prevents weird shared-write races. The error path
undoes every step: close the desc, remove from the
global list, unregister the io_device, free the
allocations.
spdk_bdev_register() makes the bdev visible
to the framework. After this call, other modules (e.g.
nvmf) can open pt_bdev. The error path is
similar but adds
spdk_bdev_module_release_bdev() to undo the
claim. Note the order: release the claim first, then
close the desc. Get the order wrong and you have
a window where the base bdev can't be claimed by
another module but the desc isn't closed yet.
6. The channel create/destroy
The framework allocates the channel bytes (we said
sizeof(struct pt_io_channel) at
spdk_io_device_register time) and hands the
buffer to pt_bdev_ch_create_cb. We need
to initialize the channel — and the only thing to
initialize is the base channel. spdk_bdev_get_io_channel()
gets (or creates) the base bdev's per-thread channel,
and we cache it. On destroy, we put it back.
The interesting property: the per-thread channel is
created lazily. The first time a thread submits to the
passthru bdev, the framework calls
get_io_channel, which calls
spdk_get_io_channel, which calls our
create_cb. Subsequent calls on the same thread return
the cached channel. The framework handles the
refcounting; we just have to make create and destroy
symmetric.
7. get_io_channel: the trivial callback
The framework's spdk_bdev_get_io_channel()
call lands here with the bdev's ctxt, which
is our vbdev_passthru node. We pass the
node to spdk_get_io_channel(), which
triggers our io_device's create_cb. That's it. The
framework wraps this channel in a
spdk_bdev_channel and that's what the
submit path gets.
8. vbdev_passthru_submit_request: THE hot path
The function you've been waiting for. 82 lines, a switch on I/O type. The pattern: for each I/O type, translate it into a bdev_io on the base bdev, with a completion callback that bridges the base bdev_io back to the original.
Three pointers, one setup. The
SPDK_CONTAINEROF recovers our
vbdev_passthru node from the
spdk_bdev pointer that the framework hands
us. spdk_io_channel_get_ctx(ch) gets the
per-thread state, which is where the cached base
channel lives. bdev_io->driver_ctx is
the per-IO scratch space the framework reserved for us
(we set the size in get_ctx_size). The
test = 0x5a line is the comment in the
source: a marker, set at submit and verified at
completion, just to show that the
passthru_bdev_io is round-tripped.
The switch, in detail
Nine I/O types, all handled, plus a default that
completes the I/O as FAILED. The pattern for each
non-read type is identical: call a public bdev API on
the base bdev, passing the base's desc, the
base's per-thread base_ch, the same offset
and length, the completion callback
_pt_complete_io, and the original
bdev_io as the callback's argument.
The READ case is special. The caller might pass
NULL in the iov (a "give me a buffer"
read), and we need a buffer before we can forward the
read to the base. spdk_bdev_io_get_buf()
is the framework's "give this bdev_io a buffer" API.
When the buffer is ready, the framework calls
pt_read_get_buf_cb, which does the actual
spdk_bdev_readv_blocks_ext call.
The error path: NOMEM
The bdev_io mempool can run out of entries under load.
When that happens, spdk_bdev_writev_blocks_ext
returns -ENOMEM instead of allocating a
child bdev_io. Passthru doesn't fail the I/O; it queues
it for retry. vbdev_passthru_queue_io()
uses spdk_bdev_queue_io_wait(), which
adds the I/O to a list that the framework will re-fire
when an io frees up. This is the back-pressure
mechanism in the bdev layer. The alternative — failing
the I/O — would lose data on a write or return an
error to the user on a read, both of which are worse
than just slowing down.
9. The per-IO struct: how a request flows through
The "passthru_bdev_io" struct is intentionally tiny:
Three fields, each for a specific reason:
test— sanity check. Set to0x5ain submit_request, verified in the completion callback. The comment in the source says "just for fun," and it is: the field proves the round-tripbdev_iopointer passed ascb_argis the same one we got at submit.ch— the io channel of the original I/O. Needed for the NOMEM queue: when the queue eventually firesvbdev_passthru_resubmit_io, we need to know which channel to callsubmit_requeston. We can't recover it from the bdev_io (it doesn't have a back-pointer to the channel that submitted it).bdev_io_wait— the wait queue entry.spdk_bdev_queue_io_waitfills this in. When the framework frees a child bdev_io, it walks the wait queue and fires the callback.vbdev_passthru_resubmit_iois that callback.
10. The completion path: _pt_complete_io
The bridge. The base bdev finishes the child bdev_io; this callback runs. We do two things:
spdk_bdev_io_complete_base_io_status(orig_io, bdev_io)— completes the original (parent) bdev_io with the status from the child. The_base_io_statusvariant copies the status enum (and the error info, like NVMe cdw0 or SCSI sense) from the child to the parent, so the caller of the passthru bdev sees the real error info, not a generic FAILED.spdk_bdev_free_io(bdev_io)— the child bdev_io came from a mempool; this returns it. If you forget this, you leak one bdev_io per I/O. The bdev_io mempool is large but not infinite, and a slow leak looks exactly like a memory leak in production.
The pattern is the same for every I/O type. The
_pt_complete_zcopy_io variant is slightly
different because zero-copy needs to set the buffer on
the original I/O before completing it. The discipline is
the same: complete the original, free the child.
sequenceDiagram participant Caller participant Framework participant Passthru as passthru submit_request participant Base as base submit_request participant Disk Caller->>Framework: spdk_bdev_write(...) Framework->>Passthru: submit_request(orig_io) Passthru->>Base: spdk_bdev_writev_blocks_ext(cb_arg=orig_io) Base->>Disk: NVMe write Disk-->>Base: completion Base->>Framework: _pt_complete_io(new_io, cb_arg=orig_io) Note over Framework: spdk_bdev_io_complete_base_io_status(orig_io, new_io)
spdk_bdev_free_io(new_io) Framework->>Caller: user's cb(orig_io)
fig. 2 The lifecycle of one write. The original
bdev_io is created by the framework for the user; the
child bdev_io is created by the framework for passthru
to talk to the base. The two bdev_ios are different
objects; the cb_arg links them. The bridge
callback completes the original and frees the child.
11. The destruct: vbdev_passthru_destruct
The order-matters callback. Read it slowly.
TAILQ_REMOVE(&g_pt_nodes, pt_node, link)— first. Remove from the global list. This way no other code path in passthru (an in-flight RPC, a re-entrant callback) can find this node and use it. The comment "It is important to follow this exact sequence" is serious.spdk_bdev_module_release_bdev(pt_node->base_bdev)— second. Release the claim. If another module was waiting to claim the base bdev, this lets it. (In practice, the base bdev is also about to be unregistered, so no one is waiting. But the order matters: if you unregister the io_device first, the base bdev might already be gone, and the release could be a no-op or worse.)spdk_bdev_close(pt_node->base_desc)— third, but on the right thread.spdk_bdev_close()is thread-bound: the desc was opened onpt_node->thread, and it must be closed on the same thread. If the destruct is being called from a different thread (which happens when a hot-remove callback fires asynchronously), we send a message to the opening thread and close on that one. Thespdk_thread_send_msg()call is non-blocking; the actual close happens later.spdk_io_device_unregister(pt_node, _device_unregister_cb)— fourth. The framework will drain any channels that are still open. Each drain calls ourpt_bdev_ch_destroy_cb, which puts the cached base channel. The framework then calls_device_unregister_cbwith the io_device pointer (which ispt_node), and that's where we freept_node->pt_bdev.nameandpt_nodeitself.
12. Persistence: vbdev_passthru_config_json
The output of this function is exactly what
save_config writes to the JSON file. When
you replay that file, the framework calls
bdev_passthru_create for each entry, which
calls bdev_passthru_create_disk at
module/bdev/passthru/vbdev_passthru.c:720 ,
which calls vbdev_passthru_insert_name (to
add to g_bdev_names) and
vbdev_passthru_register (to build the
bdev). The whole pipeline restarts.
The shape of the JSON is: an object with a
method field (the RPC name) and a
params field (the RPC's parameters). That
matches what the RPC handler at
expects. Keep these in sync.
13. Hot remove and event handling
The framework calls vbdev_passthru_base_bdev_event_cb
when something happens to the base bdev. Today, only
SPDK_BDEV_EVENT_REMOVE is implemented:
when the base bdev goes away (e.g. an NVMe controller
hot-unplug), we walk the global list of passthru bdevs
and unregister any whose base matches. The framework
then calls our destruct for each, and the
teardown happens in the correct order.
TAILQ_FOREACH_SAFE is the safe version
of TAILQ_FOREACH: it gives you a
tmp pointer that's pre-advanced, so you
can safely remove the current entry from the list
while iterating. Using TAILQ_FOREACH here
would be a bug.
14. The RPCs: vbdev_passthru_rpc.c companion file
The RPC handlers are in a separate file. There are two
of them: bdev_passthru_create and
bdev_passthru_delete.
Three pieces: a struct for the parameters, a decoder
table, and the handler. The decoder table is the
convention: each row says "the JSON field with this
name, decoded by this function, lands at this offset
in the struct." The framework's
spdk_json_decode_object walks the table
and fills the struct. The true at the end
of the uuid row says "this field is
optional" — the framework will leave it zero-initialized
if the JSON doesn't have it.
The handler follows the same five-step pattern as
every JSON-RPC handler in SPDK: decode, validate,
do the work, build the response, clean up. The
SPDK_RPC_REGISTER at the bottom is the
constructor that adds the method to the RPC server's
list.
The companion bdev_passthru_delete RPC is
even simpler: take a name, look up the bdev,
unregister it, wait for the framework's callback, and
send a success/error response. See
module/bdev/passthru/vbdev_passthru_rpc.c:96 .
Edge cases: what breaks in passthru
The base bdev doesn't exist yet
spdk_bdev_open_ext returns
-ENODEV. The handler in
vbdev_passthru_register at
treats -ENODEV as a soft failure: it
frees the half-constructed node but doesn't return
an error from the RPC. The trick is in
bdev_passthru_create_disk at
module/bdev/passthru/vbdev_passthru.c:735 :
it returns success if -ENODEV happens
after the name was added to g_bdev_names,
because the framework will call examine_config
again when the base bdev shows up. The create RPC is
effectively "remember this; create it later."
Two passthru bdevs on the same base
Works: each vbdev_passthru node opens its
own base_desc. The claim is shared (only
one exclusive writer at a time), but multiple readers
are fine. If you want to mix two passthrus that both
write to the same base, you'll need to use the newer
SPDK_BDEV_CLAIM_READ_MANY_WRITE_SHARED
claim type.
Hot remove while I/O is in flight
The framework will close all open descriptors on the
base bdev. Each spdk_bdev_writev_blocks_ext
in flight is completed (probably as FAILED) by the
base bdev module, our bridge runs, the original
bdev_io is completed with that failure, and the
caller's callback fires. Then the framework calls
our destruct, and the teardown happens.
The "double-call" of the channel-destroy
If your channel create callback fails partway (e.g.
spdk_bdev_get_io_channel returns NULL),
the framework still calls your destroy callback. The
destroy in this file is unconditional
(spdk_put_io_channel(pt_ch->base_ch));
if base_ch is NULL because the create
failed, the put is a no-op. The discipline: always
initialize the channel bytes to zero before
populating them. spdk_io_device_register
does this for you — the framework zeroes the bytes
before calling create.
The reclaim race
The destruct releases the claim. If a concurrent
examine_config was already in flight and
saw the claim, the next claim is fine. If the examine
raced and the claim is gone before it gets a chance to
test, it just won't claim. The TAILQ_REMOVE
at the top of destruct prevents the new examine from
finding the node, but it doesn't prevent a base-bdev
open that the old node already had. The fix: the
open's lifetime is bounded by the
spdk_bdev_close in destruct, so the race
is closed.
The 0x5a test
The io_ctx->test = 0x5a line is the
only "fancy" thing in the per-IO struct. If you see
"Error, original IO device_ctx is wrong!" in the log,
one of three things happened: a memory corruption
(rare), a use-after-free in the bdev_io path
(catastrophic), or the test was set by a different
module that re-used the bdev_io. The last one is
unlikely in production, so this error usually means
real corruption.
What to take away
Passthru is the smallest virtual bdev module in the SPDK
tree, and once you see how the pieces fit, every other
vbdev (lvol, raid, split, gpt, error, delay) is a
variation on this template. The bridge pattern in
_pt_complete_io is the
defining technique: you submit a child bdev_io
on the base, with the original as cb_arg,
and the bridge callback completes the original and
frees the child. The destruct order is the
defining footgun: list, claim, desc,
io_device, in that order. Get those two patterns right
and you can read the rest of the SPDK tree.
The next page covers what you need to do
outside the .c file: the
Makefile, the modules list, the configure script, the
RPC file, and a complete skeleton you can copy to
start your own module.