Layer 4 · The bdev framework

bdevs all the way down.

A bdev doesn't have to own a device. It can be a virtual bdev: a bdev that sits on top of another bdev, and forwards every I/O to it. lvol sits on a malloc or NVMe. RAID sits on multiple NVMes. Passthru sits on whatever you tell it. Split sits on a base bdev, exposing a sub-range. The same bdev framework treats all of them as bdevs. This page is about the stack.

~15 min read2 diagramsprerequisites: 4.1 · 4.2 · 4.3
On this page
  1. The "vbdev" pattern: a bdev on top of a bdev
  2. How a vbdev discovers its base
  3. The "open base" relationship
  4. How I/O flows: top-down
  5. Real examples from diskengine
  6. Concrete modules: passthru, lvol, raid, split, gpt
  7. Edge cases: what if the base goes away?

The "vbdev" pattern: a bdev on top of a bdev

A vbdev (virtual bdev) is just a bdev. Same struct, same fields, same registration process. The difference is conceptual: a vbdev's backing storage is another bdev, not a piece of hardware. It might transform the I/O (RAID does XOR), translate the offsets (split does arithmetic), or just pass it through (passthru).

The framework doesn't know or care which is which. A consumer doing spdk_bdev_read on a passthru bdev or an lvol bdev or a RAID bdev issues exactly the same call. The framework routes it through the module's submit_request, which is the one place where the type-specific behavior lives.

flowchart TB
subgraph App["Application / nvmf target"]
  A[spdk_bdev_read on
'lvols/deadbeef'] end subgraph Stack["The bdev stack"] L["lvol bdev
'lvols/deadbeef'
product_name='Logical Volume'"] R["RAID bdev
'raid1'
product_name='raid'"] N1["NVMe bdev
'nvme0n1'"] N2["NVMe bdev
'nvme1n1'"] end A -->|submit_request| L L -->|spdk_bdev_write to base| R R -->|XOR + 2 spdk_bdev_writes| N1 R -->|XOR + 2 spdk_bdev_writes| N2 classDef vbdev fill:#f5e6c8,stroke:#a17f1a; classDef leaf fill:#d6f5d6,stroke:#2a6f2a; class L,R vbdev class N1,N2 leaf
fig. 1 — a typical bdev stack · tap or scroll to zoom · ↗ for fullscreen

fig. 1   A read on an lvol name. The framework dispatches to the lvol module. The lvol module reads from its base (the RAID bdev). The RAID bdev reads from its bases (the two NVMe namespaces). Each layer is an independent bdev with its own desc, channel, queue depth, and statistics.

Three things to note about this stack:

  • Each layer is a real bdev. bdev_get_bdevs lists all four of them. You can open raid1 directly without going through the lvol. The layers are composable.

  • Each layer adds latency. A read to the lvol bdev takes the lvol's read time, plus the RAID's read time, plus the NVMe's read time. For latency-sensitive applications, fewer layers is better.

  • Each layer is a separate module. The lvol module, RAID module, and NVMe module are three different spdk_bdev_module structs. They can be loaded and unloaded independently (modulo the order of module_init).

How a vbdev discovers its base

There are two ways a vbdev can come into existence:

  1. Configuration-based. The user wrote a config file (or sent an RPC) saying "create an lvol named foo on top of bdev bar." The lvol module stores this request and waits for bar to appear.

  2. Auto-discovery. When a new bdev is registered, the framework iterates through all loaded vbdev modules and calls each one's examine_config (and possibly examine_disk) callback with the new bdev. The module can decide to create a vbdev on top of it.

Passthru uses both. The RPC bdev_passthru_create <base_bdev_name> <vbdev_name> records a pending creation in a global list (g_bdev_names). When the base bdev shows up — either already present at start, or registered later by the NVMe module — passthru's examine_config callback is invoked with the base bdev; passthru checks the pending list and creates the vbdev if there's a match.

Three lines. The framework calls this for every bdev that gets registered. vbdev_passthru_register walks the pending list and creates a passthru vbdev for every match. The spdk_bdev_module_examine_done call tells the framework "I'm done examining this bdev for this module." The framework tracks the examine-in-progress count for each module and only moves to the next module's examine when this one is done.

The "open base" relationship

A vbdev must open its base bdev to do I/O on it. This is what spdk_bdev_open_ext is for. The vbdev gets back a spdk_bdev_desc * that it uses for every I/O submission to the base.

The open happens during vbdev creation. For passthru:

After the open:

  • pt_node->base_desc — the desc for the base. Bound to the thread that called spdk_bdev_open_ext (the one that invoked vbdev_passthru_register).

  • pt_node->base_bdev — the base's spdk_bdev pointer. Used for capability queries, name printing, etc.

  • pt_node->thread — saved separately, so the destruct path knows which thread to send the close to.

The claim

A vbdev that opens its base also claims it. The claim tells the framework "this base is mine — don't let any other module create a competing vbdev on top of it." The claim is the framework's mechanism for preventing two lvols from both being created on the same malloc bdev (which would corrupt the metadata).

The claim is released in the destruct path with spdk_bdev_module_release_bdev (see module/bdev/passthru/vbdev_passthru.c:127 ). For lvol, the equivalent is spdk_bdev_module_release_bdev in the lvol destruct callback.

How I/O flows: top-down

When an application submits a read on lvols/deadbeef, the framework dispatches it to the lvol module. The lvol module:

STEP 01
Receive bdev_io on the lvol bdev
submit_request called with ch and bdev_io
STEP 02
Translate offsets and sizes
lvol does cluster / lvol-block math
STEP 03
Submit a new bdev_io on the base (RAID)
via spdk_bdev_readv_blocks_ext
STEP 04
Wait for the new bdev_io to complete
in the lvol's completion callback
STEP 05
Complete the original bdev_io with the new status
via spdk_bdev_io_complete_base_io_status

Each layer adds its own bdev_io. The total number of in-flight bdev_ios at any moment is bounded by the application's submission rate, not by the number of layers. The framework's mempool is shared, so a 65535-pool can support a 3-layer stack of 65535 / 3 = ~21000 outstanding I/Os per layer at the same time. In practice, the actual number is much smaller because I/O is fast.

This is the pattern: the completion callback receives the new bdev_io (the one submitted to the base), translates the status, completes the original bdev_io (the one received from the framework), then frees the new bdev_io (which the vbdev owns). The framework will free the original bdev_io as part of its standard completion path.

The spdk_bdev_io_complete_base_io_status helper is the right thing to call here. It copies the error status (NVMe, SCSI, AIO, or generic) from the new bdev_io to the original one, so the caller's callback can inspect it with spdk_bdev_io_get_nvme_status() or similar. The original success/fail boolean is derived from the status.

sequenceDiagram
participant App as App
participant Lvol as lvol module
participant Raid as RAID module
participant Nvme as NVMe module

App->>Lvol: spdk_bdev_read(bdev_io_1)
Lvol->>Lvol: alloc child bdev_io_2
Lvol->>Raid: spdk_bdev_read(child bdev_io_2)
Raid->>Raid: alloc 2 child bdev_io_3, bdev_io_4
Raid->>Nvme: spdk_bdev_read(bdev_io_3)
Raid->>Nvme: spdk_bdev_read(bdev_io_4)
Nvme->>Nvme: submit to NVMe queue
Nvme-->>Raid: bdev_io_3 complete
Nvme-->>Raid: bdev_io_4 complete
Raid->>Raid: XOR, then complete bdev_io_2
Raid-->>Lvol: bdev_io_2 complete
Lvol->>Lvol: copy / metadata, complete bdev_io_1
Lvol-->>App: bdev_io_1 complete
fig. 2 — I/O flowing down the stack · tap or scroll to zoom · ↗ for fullscreen

fig. 2   A read on a 3-layer stack. Each layer allocates its own bdev_io on the base (orange arrows). Completions flow back up (blue arrows). The application sees only the top-most bdev_io; everything else is implementation detail.

Real examples from diskengine

diskengine is built on top of the SPDK bdev hierarchy. The provisioning loop in

provisionLvol:74

creates an lvol bdev, attaches it to an NVMe-oF subsystem, and exposes it to the world. The lvol sits on a RAID bdev that sits on a malloc bdev (for tests) or NVMe namespaces (in production).

The key RPC call is bdev_lvol_create:

From the Go side this is one RPC. The SPDK side is two layers of bdev: the lvol module creates an lvol on top of the lvstore's bdev, which is itself a bdev pointing at the underlying malloc/nvme/whatever the lvstore was built on. The framework's hierarchy makes this transparent: the RPC just says "create an lvol," and the lvol module figures out the rest from the lvstore UUID.

The clone path in

DDCloneSnapshotToRaid:49

goes the other direction: an NBD export of a snapshot bdev, an NBD export of the destination RAID bdev, and a dd from one to the other. The skip-bytes logic accounts for the RAID superblock, which is just SPDK metadata stored at the start of each base bdev. The point: even operations that look like "copy between block devices" are just I/O on the bdev hierarchy.

Concrete modules: passthru, lvol, raid, split, gpt

Five concrete examples of the vbdev pattern. All are real modules in the SPDK tree. All do their work by sitting on top of another bdev and forwarding (or transforming) I/O.

Passthru

The simplest. A passthru bdev is a 1:1 wrapper around its base. Every I/O type is forwarded unchanged. The only transformation is in metadata handling (DIF / DIX is propagated but not modified).

Every case ends with _pt_complete_io as the completion callback. The new bdev_io (the one submitted to the base) is freed in that callback.

lvol

Logical volumes from the lvol store. The lvol bdev sits on a "lvstore bdev" (an lvol-internal representation of the underlying device). Reads are translated from logical offsets to cluster addresses; writes go through the blobstore metadata; copy-on-write is implemented here. See 5.1 for the full layout.

The interesting thing about lvol is that the "base" bdev is a real SPDK bdev (malloc, nvme, aio, etc.) but the lvol module treats it specially — it doesn't claim the base; instead, the lvstore module claims the base and creates the lvstore bdev. The lvol bdev is then a vbdev-on-vbdev.

raid

RAID 0/1/5/6/10 (concat, raid0, raid1, raid5f). The RAID bdev sits on top of N base bdevs (N=2 for RAID1, N=4 for RAID5, etc.). For reads, it picks one base. For writes, it writes to all bases (plus parity for RAID5/6). The submission paths allocate one child bdev_io per base, and the RAID module waits for all of them to complete before completing the parent's bdev_io.

Multiple bases means multiple descs, multiple channels, multiple per-thread base_channels in the RAID bdev's channel struct. The RAID module tracks them all in a per-channel array.

split

A split bdev exposes a sub-range of another bdev. You tell it "expose blocks 100-200 of Malloc0 as a new bdev called Malloc0_part1." The split module rewrites the offset on every I/O and forwards to the base. The base bdev is unaware that the split exists.

The interesting design choice: split is implemented using spdk_bdev_part_base_construct and spdk_bdev_part_construct in include/spdk/bdev_module.h:1659 , a higher-level API designed for "expose a sub-range of an existing bdev." The framework handles the offset arithmetic and the channel forwarding; the split module is mostly glue.

gpt

A GPT bdev exposes a partition from a GUID Partition Table. The GPT module reads the partition table at construction time, finds the named partition, and creates a bdev spanning exactly that partition's LBA range. Functionally similar to split, but the offsets come from the GPT header on the disk rather than from user configuration.

GPT is also a "disposable" vbdev in the sense that it gets recreated every time the base bdev is examined. There's no "create a GPT partition named X" RPC; you have to write the GPT header to the disk first, then the GPT module notices it and creates the bdev.

Other vbdev modules

The SPDK tree has more:

  • null — a bdev that returns zeros on read and silently drops writes. Useful for testing.

  • error — a bdev that returns errors on every I/O. Also for testing.

  • delay — adds a configurable delay between submit and complete. For testing timeouts.

  • crypto — encrypts / decrypts on the way through.

  • ocf — integrates with the Open CAS Framework for caching.

Edge cases: what if the base goes away?

Hot-removal of the base

A base bdev can be unregistered while vbdevs sit on top of it. The framework calls each vbdev's base-event callback. For passthru:

The pattern: react to the event by unregistering the vbdev. The framework handles the rest (closing descs, releasing claims, draining channels). The vbdev effectively disappears; new I/Os submitted to it will fail because the unregister puts it in SPDK_BDEV_STATUS_REMOVING.

The 3-layer stack: cascading hot-removal

If a 3-layer stack (lvol → raid → nvme) has its NVMe base hot-removed, the cascade is:

  1. NVMe module unregisters nvme0n1.

  2. Framework notifies the RAID module: nvme0n1 is going away. RAID unregisters raid1.

  3. Framework notifies the lvol module: raid1 is going away. lvol unregisters lvols/deadbeef.

  4. Framework notifies the nvmf target: lvol is going away. nvmf detaches the namespace.

Each step happens asynchronously on the appropriate thread. The total time for the cascade is bounded by the cost of each individual unregister, which for a clean shutdown is microseconds.

Configuration reload

A "config reload" is a process where the running SPDK application is asked to apply a new config file without restarting. The bdev hierarchy makes this tractable:

  1. New bdevs are created first (malloc, NVMe, etc.) so that vbdevs that depend on them can be created next.

  2. vbdevs are created in dependency order: lvols after lvstores after RAIDs after NVMe.

  3. Stale vbdevs (those that were in the old config but not the new) are unregistered.

  4. nvmf subsystems and namespaces are updated.

The framework's examine_config callback is the mechanism for the "new bdev" part. The framework's unregister API is the mechanism for the "stale bdev" part. The orchestrating code (typically bdev_rpc.c for the JSON-RPC side) walks the config file and issues the right creation/destruction calls.

What if two modules want to claim the same base?

The second claim fails with -EPERM (or -EBUSY, depending on the claim type). The vbdev creation fails. The base bdev is still owned by the first claimant. This is the framework's mechanism for "only one vbdev on top of a base" — without it, you'd get corruption from competing metadata updates.

What if a vbdev is created on a base that's being removed?

spdk_bdev_open_ext returns -ENODEV (the bdev is in SPDK_BDEV_STATUS_REMOVING). The vbdev creation fails. The user retries later when a fresh bdev shows up.

What if the base bdev changes size (resize)?

The base bdev fires SPDK_BDEV_EVENT_RESIZE. The framework notifies the vbdev's event callback. The vbdev can choose to:

  • Update its own blockcnt to match (passthru does this; lvol does this on cluster boundaries).

  • Ignore the event (the vbdev keeps its old size; the user has to recreate it).

  • Refuse the resize and unregister.

For most vbdevs, "update my blockcnt" is the right answer. The framework provides spdk_bdev_notify_blockcnt_change for this — the bdev module updates its block count and the framework re-broadcasts.

What to take away

A bdev doesn't have to own a device. A vbdev is a bdev that sits on top of another bdev and forwards I/O to it. lvol, raid, split, gpt, passthru — all vbdevs. The framework doesn't distinguish between leaf and virtual bdevs; both are bdevs, both have descs and channels, both can be exposed through nvmf or vhost.

The hierarchy is the foundation. Every layer in the stack is an independent bdev, and the framework's mempool, statistics, hot-remove events, and capability checks all work the same way for every layer. diskengine's provisioning flow ends in an lvol bdev, but the lvol is sitting on a RAID which is sitting on NVMe — three layers, each one a real bdev, each one doing exactly what the bdev framework says a bdev does.

This is the end of the bdev framework section. The next layer, 5, digs into the lvol module specifically — the metadata layout, the copy-on-write, the snapshotting. The bdev hierarchy you've just learned is what makes all of that work.