bdevs all the way down.
A bdev doesn't have to own a device. It can be a virtual bdev: a bdev that sits on top of another bdev, and forwards every I/O to it. lvol sits on a malloc or NVMe. RAID sits on multiple NVMes. Passthru sits on whatever you tell it. Split sits on a base bdev, exposing a sub-range. The same bdev framework treats all of them as bdevs. This page is about the stack.
- The "vbdev" pattern: a bdev on top of a bdev
- How a vbdev discovers its base
- The "open base" relationship
- How I/O flows: top-down
- Real examples from diskengine
- Concrete modules: passthru, lvol, raid, split, gpt
- Edge cases: what if the base goes away?
The "vbdev" pattern: a bdev on top of a bdev
A vbdev (virtual bdev) is just a bdev. Same struct, same fields, same registration process. The difference is conceptual: a vbdev's backing storage is another bdev, not a piece of hardware. It might transform the I/O (RAID does XOR), translate the offsets (split does arithmetic), or just pass it through (passthru).
The framework doesn't know or care which is which.
A consumer doing spdk_bdev_read on a
passthru bdev or an lvol bdev or a RAID bdev issues
exactly the same call. The framework routes it
through the module's submit_request,
which is the one place where the type-specific
behavior lives.
flowchart TB subgraph App["Application / nvmf target"] A[spdk_bdev_read on
'lvols/deadbeef'] end subgraph Stack["The bdev stack"] L["lvol bdev
'lvols/deadbeef'
product_name='Logical Volume'"] R["RAID bdev
'raid1'
product_name='raid'"] N1["NVMe bdev
'nvme0n1'"] N2["NVMe bdev
'nvme1n1'"] end A -->|submit_request| L L -->|spdk_bdev_write to base| R R -->|XOR + 2 spdk_bdev_writes| N1 R -->|XOR + 2 spdk_bdev_writes| N2 classDef vbdev fill:#f5e6c8,stroke:#a17f1a; classDef leaf fill:#d6f5d6,stroke:#2a6f2a; class L,R vbdev class N1,N2 leaf
fig. 1 A read on an lvol name. The framework dispatches to the lvol module. The lvol module reads from its base (the RAID bdev). The RAID bdev reads from its bases (the two NVMe namespaces). Each layer is an independent bdev with its own desc, channel, queue depth, and statistics.
Three things to note about this stack:
Each layer is a real bdev.
bdev_get_bdevslists all four of them. You can openraid1directly without going through the lvol. The layers are composable.Each layer adds latency. A read to the lvol bdev takes the lvol's read time, plus the RAID's read time, plus the NVMe's read time. For latency-sensitive applications, fewer layers is better.
Each layer is a separate module. The lvol module, RAID module, and NVMe module are three different
spdk_bdev_modulestructs. They can be loaded and unloaded independently (modulo the order ofmodule_init).
How a vbdev discovers its base
There are two ways a vbdev can come into existence:
Configuration-based. The user wrote a config file (or sent an RPC) saying "create an lvol named
fooon top of bdevbar." The lvol module stores this request and waits forbarto appear.Auto-discovery. When a new bdev is registered, the framework iterates through all loaded vbdev modules and calls each one's
examine_config(and possiblyexamine_disk) callback with the new bdev. The module can decide to create a vbdev on top of it.
Passthru uses both. The RPC bdev_passthru_create
<base_bdev_name> <vbdev_name>
records a pending creation in a global list
(g_bdev_names). When the base bdev
shows up — either already present at start, or
registered later by the NVMe module — passthru's
examine_config callback is invoked with
the base bdev; passthru checks the pending list and
creates the vbdev if there's a match.
Three lines. The framework calls this for every
bdev that gets registered. vbdev_passthru_register
walks the pending list and creates a passthru
vbdev for every match. The
spdk_bdev_module_examine_done call
tells the framework "I'm done examining this bdev
for this module." The framework tracks the
examine-in-progress count for each module and only
moves to the next module's examine when this one is
done.
The "open base" relationship
A vbdev must open its base bdev to do I/O
on it. This is what
spdk_bdev_open_ext is for. The vbdev
gets back a spdk_bdev_desc * that it
uses for every I/O submission to the base.
The open happens during vbdev creation. For passthru:
After the open:
pt_node->base_desc— the desc for the base. Bound to the thread that calledspdk_bdev_open_ext(the one that invokedvbdev_passthru_register).pt_node->base_bdev— the base'sspdk_bdevpointer. Used for capability queries, name printing, etc.pt_node->thread— saved separately, so the destruct path knows which thread to send the close to.
The claim
A vbdev that opens its base also claims it. The claim tells the framework "this base is mine — don't let any other module create a competing vbdev on top of it." The claim is the framework's mechanism for preventing two lvols from both being created on the same malloc bdev (which would corrupt the metadata).
The claim is released in the destruct path with
spdk_bdev_module_release_bdev (see
module/bdev/passthru/vbdev_passthru.c:127 ).
For lvol, the equivalent is
spdk_bdev_module_release_bdev in the
lvol destruct callback.
How I/O flows: top-down
When an application submits a read on
lvols/deadbeef, the framework dispatches
it to the lvol module. The lvol module:
Each layer adds its own bdev_io. The total number of in-flight bdev_ios at any moment is bounded by the application's submission rate, not by the number of layers. The framework's mempool is shared, so a 65535-pool can support a 3-layer stack of 65535 / 3 = ~21000 outstanding I/Os per layer at the same time. In practice, the actual number is much smaller because I/O is fast.
This is the pattern: the completion callback receives the new bdev_io (the one submitted to the base), translates the status, completes the original bdev_io (the one received from the framework), then frees the new bdev_io (which the vbdev owns). The framework will free the original bdev_io as part of its standard completion path.
The spdk_bdev_io_complete_base_io_status
helper is the right thing to call here. It copies
the error status (NVMe, SCSI, AIO, or generic) from
the new bdev_io to the original one, so the caller's
callback can inspect it with
spdk_bdev_io_get_nvme_status() or
similar. The original success/fail boolean is
derived from the status.
sequenceDiagram participant App as App participant Lvol as lvol module participant Raid as RAID module participant Nvme as NVMe module App->>Lvol: spdk_bdev_read(bdev_io_1) Lvol->>Lvol: alloc child bdev_io_2 Lvol->>Raid: spdk_bdev_read(child bdev_io_2) Raid->>Raid: alloc 2 child bdev_io_3, bdev_io_4 Raid->>Nvme: spdk_bdev_read(bdev_io_3) Raid->>Nvme: spdk_bdev_read(bdev_io_4) Nvme->>Nvme: submit to NVMe queue Nvme-->>Raid: bdev_io_3 complete Nvme-->>Raid: bdev_io_4 complete Raid->>Raid: XOR, then complete bdev_io_2 Raid-->>Lvol: bdev_io_2 complete Lvol->>Lvol: copy / metadata, complete bdev_io_1 Lvol-->>App: bdev_io_1 complete
fig. 2 A read on a 3-layer stack. Each layer allocates its own bdev_io on the base (orange arrows). Completions flow back up (blue arrows). The application sees only the top-most bdev_io; everything else is implementation detail.
Real examples from diskengine
diskengine is built on top of the SPDK bdev hierarchy. The provisioning loop in
provisionLvol:74creates an lvol bdev, attaches it to an NVMe-oF subsystem, and exposes it to the world. The lvol sits on a RAID bdev that sits on a malloc bdev (for tests) or NVMe namespaces (in production).
The key RPC call is
bdev_lvol_create:
From the Go side this is one RPC. The SPDK side is
two layers of bdev: the lvol module creates an
lvol on top of the lvstore's bdev, which is
itself a bdev pointing at the underlying
malloc/nvme/whatever the lvstore was built on.
The framework's hierarchy makes this transparent:
the RPC just says "create an lvol," and the lvol
module figures out the rest from the
lvstore UUID.
The clone path in
DDCloneSnapshotToRaid:49goes the other direction: an NBD export of a
snapshot bdev, an NBD export of the destination
RAID bdev, and a dd from one to the
other. The skip-bytes logic accounts for the
RAID superblock, which is just SPDK metadata
stored at the start of each base bdev. The
point: even operations that look like "copy
between block devices" are just I/O on the
bdev hierarchy.
Concrete modules: passthru, lvol, raid, split, gpt
Five concrete examples of the vbdev pattern. All are real modules in the SPDK tree. All do their work by sitting on top of another bdev and forwarding (or transforming) I/O.
Passthru
The simplest. A passthru bdev is a 1:1 wrapper around its base. Every I/O type is forwarded unchanged. The only transformation is in metadata handling (DIF / DIX is propagated but not modified).
Every case ends with _pt_complete_io
as the completion callback. The new bdev_io (the
one submitted to the base) is freed in that
callback.
lvol
Logical volumes from the lvol store. The lvol bdev sits on a "lvstore bdev" (an lvol-internal representation of the underlying device). Reads are translated from logical offsets to cluster addresses; writes go through the blobstore metadata; copy-on-write is implemented here. See 5.1 for the full layout.
The interesting thing about lvol is that the "base" bdev is a real SPDK bdev (malloc, nvme, aio, etc.) but the lvol module treats it specially — it doesn't claim the base; instead, the lvstore module claims the base and creates the lvstore bdev. The lvol bdev is then a vbdev-on-vbdev.
raid
RAID 0/1/5/6/10 (concat, raid0, raid1, raid5f). The RAID bdev sits on top of N base bdevs (N=2 for RAID1, N=4 for RAID5, etc.). For reads, it picks one base. For writes, it writes to all bases (plus parity for RAID5/6). The submission paths allocate one child bdev_io per base, and the RAID module waits for all of them to complete before completing the parent's bdev_io.
Multiple bases means multiple descs, multiple channels, multiple per-thread base_channels in the RAID bdev's channel struct. The RAID module tracks them all in a per-channel array.
split
A split bdev exposes a sub-range of another
bdev. You tell it "expose blocks 100-200 of
Malloc0 as a new bdev called
Malloc0_part1." The split module
rewrites the offset on every I/O and forwards to
the base. The base bdev is unaware that the split
exists.
The interesting design choice: split is
implemented using
spdk_bdev_part_base_construct and
spdk_bdev_part_construct in
include/spdk/bdev_module.h:1659 ,
a higher-level API designed for "expose a
sub-range of an existing bdev." The framework
handles the offset arithmetic and the channel
forwarding; the split module is mostly glue.
gpt
A GPT bdev exposes a partition from a GUID Partition Table. The GPT module reads the partition table at construction time, finds the named partition, and creates a bdev spanning exactly that partition's LBA range. Functionally similar to split, but the offsets come from the GPT header on the disk rather than from user configuration.
GPT is also a "disposable" vbdev in the sense that it gets recreated every time the base bdev is examined. There's no "create a GPT partition named X" RPC; you have to write the GPT header to the disk first, then the GPT module notices it and creates the bdev.
Other vbdev modules
The SPDK tree has more:
null — a bdev that returns zeros on read and silently drops writes. Useful for testing.
error — a bdev that returns errors on every I/O. Also for testing.
delay — adds a configurable delay between submit and complete. For testing timeouts.
crypto — encrypts / decrypts on the way through.
ocf — integrates with the Open CAS Framework for caching.
Edge cases: what if the base goes away?
Hot-removal of the base
A base bdev can be unregistered while vbdevs sit on top of it. The framework calls each vbdev's base-event callback. For passthru:
The pattern: react to the event by unregistering
the vbdev. The framework handles the rest
(closing descs, releasing claims, draining
channels). The vbdev effectively disappears; new
I/Os submitted to it will fail because the
unregister puts it in
SPDK_BDEV_STATUS_REMOVING.
The 3-layer stack: cascading hot-removal
If a 3-layer stack (lvol → raid → nvme) has its NVMe base hot-removed, the cascade is:
NVMe module unregisters
nvme0n1.Framework notifies the RAID module: nvme0n1 is going away. RAID unregisters
raid1.Framework notifies the lvol module: raid1 is going away. lvol unregisters
lvols/deadbeef.Framework notifies the nvmf target: lvol is going away. nvmf detaches the namespace.
Each step happens asynchronously on the appropriate thread. The total time for the cascade is bounded by the cost of each individual unregister, which for a clean shutdown is microseconds.
Configuration reload
A "config reload" is a process where the running SPDK application is asked to apply a new config file without restarting. The bdev hierarchy makes this tractable:
New bdevs are created first (malloc, NVMe, etc.) so that vbdevs that depend on them can be created next.
vbdevs are created in dependency order: lvols after lvstores after RAIDs after NVMe.
Stale vbdevs (those that were in the old config but not the new) are unregistered.
nvmf subsystems and namespaces are updated.
The framework's examine_config
callback is the mechanism for the "new bdev"
part. The framework's unregister API is the
mechanism for the "stale bdev" part. The
orchestrating code (typically
bdev_rpc.c for the JSON-RPC side)
walks the config file and issues the right
creation/destruction calls.
What if two modules want to claim the same base?
The second claim fails with -EPERM (or -EBUSY, depending on the claim type). The vbdev creation fails. The base bdev is still owned by the first claimant. This is the framework's mechanism for "only one vbdev on top of a base" — without it, you'd get corruption from competing metadata updates.
What if a vbdev is created on a base that's being removed?
spdk_bdev_open_ext returns -ENODEV
(the bdev is in SPDK_BDEV_STATUS_REMOVING).
The vbdev creation fails. The user retries later
when a fresh bdev shows up.
What if the base bdev changes size (resize)?
The base bdev fires
SPDK_BDEV_EVENT_RESIZE. The
framework notifies the vbdev's event callback.
The vbdev can choose to:
Update its own blockcnt to match (passthru does this; lvol does this on cluster boundaries).
Ignore the event (the vbdev keeps its old size; the user has to recreate it).
Refuse the resize and unregister.
For most vbdevs, "update my blockcnt" is the
right answer. The framework provides
spdk_bdev_notify_blockcnt_change for
this — the bdev module updates its block count
and the framework re-broadcasts.
What to take away
A bdev doesn't have to own a device. A vbdev is a bdev that sits on top of another bdev and forwards I/O to it. lvol, raid, split, gpt, passthru — all vbdevs. The framework doesn't distinguish between leaf and virtual bdevs; both are bdevs, both have descs and channels, both can be exposed through nvmf or vhost.
The hierarchy is the foundation. Every layer in the stack is an independent bdev, and the framework's mempool, statistics, hot-remove events, and capability checks all work the same way for every layer. diskengine's provisioning flow ends in an lvol bdev, but the lvol is sitting on a RAID which is sitting on NVMe — three layers, each one a real bdev, each one doing exactly what the bdev framework says a bdev does.
This is the end of the bdev framework section. The next layer, 5, digs into the lvol module specifically — the metadata layout, the copy-on-write, the snapshotting. The bdev hierarchy you've just learned is what makes all of that work.