vhost-scsi vs vhost-blk.
Same wire protocol, two device personalities. vhost-blk exposes a single block device — one virtio-blk disk, one bdev on the backend. vhost-scsi exposes an entire SCSI target — one virtio-scsi HBA, many LUNs, each a bdev on the backend. The trade is simplicity against flexibility.
- The two flavours and the personality they expose to the guest
- vhost-blk: one bdev, one disk, no SCSI in the guest
- vhost-scsi: one controller, many LUNs, SCSI in the guest
- How SPDK implements each (
lib/vhost/vhost_blk.c,lib/vhost/vhost_scsi.c) - The bdev backing: a vhost-scsi controller as a multi-bdev LUN container
- Performance trade-offs
- What diskengine uses and why
- Edge cases: LUN hotremove, persistent reservations, multipath
The two flavours
A vhost-user socket carries I/O for one virtio device. The protocol doesn't care what kind of device. The control messages are the same. The data plane is the same. The only thing that changes is the device "personality" — the way the backend interprets the I/O requests and turns them into bdev operations.
vhost-blk and vhost-scsi are the two personalities SPDK
ships. Both live in lib/vhost/. Both
register a spdk_vhost_user_dev_backend via
the dispatch table at
lib/vhost/vhost.c:434 . Both
are constructed by the same
vhost_dev_register function at
lib/vhost/vhost.c:115 ; the
only branch is the backend type:
if (vdev->backend->type == VHOST_BACKEND_SCSI) {
rc = vhost_user_dev_create(vdev, name, &cpumask, user_backend, delay);
} else {
/* When VHOST_BACKEND_BLK, delay should not be true. */
assert(delay == false);
rc = virtio_blk_construct_ctrlr(vdev, name, &cpumask, params, user_backend);
}The delay flag is the operational
difference: SCSI controllers can be created in a
"delay" mode (the controller is registered with the
framework but the socket isn't bound yet, so QEMU
can't connect) and then started later via
vhost_start_scsi_controller. vhost-blk
always binds the socket immediately on construction.
That makes SCSI's hot-add flow possible (and it's why
SCSI can hotplug LUNs after the controller is up).
flowchart LR subgraph blk["vhost-blk"] B1["vhost controller vhost1234"] --> B2["1 bdev: raid_2591"] B2 --> B3["1 virtqueue (or multi-queue, but 1 device)"] end subgraph scsi["vhost-scsi"] S1["vhost controller VhostScsi0"] --> S2["scsi_dev 0 (LUN 0) → bdev raid_2591"] S1 --> S3["scsi_dev 1 (LUN 1) → bdev raid_2592"] S1 --> S4["scsi_dev N (LUN N) → bdev raid_2593"] S2 --> S5["eventq + controlq + requestq"] S3 --> S5 S4 --> S5 end classDef guest fill:#cfe1ff,stroke:#1c4f8a; classDef spdk fill:#d6f5d6,stroke:#2a6f2a;
fig. 1 Left: vhost-blk. One controller, one bdev,
one device exposed to the guest as /dev/vda.
Right: vhost-scsi. One controller, up to 256 scsi_devs
(SPDK's hard cap), each a bdev, each a LUN to the guest,
sharing a single event/control/request queue set. The
guest sees an HBA with N LUNs.
vhost-blk: one bdev, one disk, no SCSI
vhost-blk is the simpler of the two. A controller is
constructed with a single backing bdev. Each guest sees
one block device (e.g. /dev/vda). The
in-guest driver is virtio_blk.ko — no SCSI
layer, no multipath, no LUN discovery, no SG_IO escape
hatch.
The data path is correspondingly simple. The requestq poller is registered in lib/vhost/vhost_blk.c:1373 :
if (bvdev->bdev) {
bvsession->requestq_poller = SPDK_POLLER_REGISTER(vdev_worker, bvsession, 0);
} else {
bvsession->requestq_poller = SPDK_POLLER_REGISTER(no_bdev_vdev_worker, bvsession, 0);
}The poller calls vdev_worker, which reads
the avail ring, builds a bdev_io for each request, and
submits to the bdev layer. When the bdev_io completes,
the completion path writes to the used ring and
signals the guest via the callfd. That's the entire
hot path.
What you get with vhost-blk
One device per controller. Simple mapping: one vhost controller = one bdev = one guest disk.
No SCSI in the guest. The
virtio_blk.kodriver is small, simple, and very fast. No SCSI mid-layer, no LUN scanning, no sense data.Multi-queue support. A single controller can have multiple virtqueues (
num_queuesin the RPC), each pinned to a different MSI-X vector. Linuxvirtio_blkuses this forblk-mqand gets a queue per CPU.Discard / write-zeroes / flush. These map directly to bdev_io types (
SPDK_BDEV_IO_TYPE_UNMAP, etc.). The feature bits are negotiated at vhost-user feature time based on what the bdev supports.
What you don't get
No multipath.
/dev/vdais one path. You can't have two vhost-blk controllers exposing the same bdev and have the guest see them as one logical volume. The virtio-blk driver doesn't know about multipath.No SCSI SG_IO. Userspace programs in the guest that issue raw SCSI commands (e.g.
sg_io,sg_dd,smartctlon SCSI disks) getENOTTY. Smartctl works for ATA through/dev/vdavia theata_genericlayer, but pure SCSI commands don't.No persistent reservations. SCSI PR (the mechanism for "this disk is reserved for me, others get EAGAIN") is a SCSI-only feature.
vhost-scsi: one controller, many LUNs, SCSI in the guest
vhost-scsi exposes an entire SCSI target. The guest
sees an HBA, the HBA has N LUNs, each LUN is a bdev
on the backend. The in-guest driver is
virtio_scsi.ko + the standard Linux SCSI
mid-layer + the disk driver's sd.ko.
The internal data structure is the
spdk_vhost_scsi_dev, which has up to
SPDK_VHOST_SCSI_CTRLR_MAX_DEVS
scsi_dev_state entries. Each
scsi_dev_state is a pointer to an
spdk_scsi_dev (the SPDK SCSI target
abstraction) which in turn has a list of LUNs. LUNs
are bdevs.
flowchart TB A["spdk_vhost_scsi_dev svdev"] A --> B0["scsi_dev_state[0]"] A --> B1["scsi_dev_state[1]"] A --> BN["scsi_dev_state[N]"] B0 --> C0["spdk_scsi_dev 0 (target 0)"] C0 --> D0["spdk_scsi_lun 0 → bdev raid_2591"] B1 --> C1["spdk_scsi_dev 1 (target 1)"] C1 --> D1["spdk_scsi_lun 0 → bdev raid_2592"] BN --> CN["spdk_scsi_dev N"] CN --> DN["spdk_scsi_lun 0 → bdev raid_259N"]
fig. 2 The internal state of a vhost-scsi
controller. Each scsi_dev_state[i] is a
SCSI target with its own bdev-backed LUN. The guest
sees the controller as one HBA with N LUNs (the
report LUNs SCSI command returns the
list).
What you get with vhost-scsi
One controller, many LUNs. A single vhost-user socket carries I/O for an entire set of disks. Useful for VMs that have many volumes.
Hot-add / hot-remove of LUNs. The
spdk_vhost_scsi_dev_add_tgtandspdk_vhost_scsi_dev_remove_tgtRPCs add and remove LUNs from a live controller. The guest sees the new LUNs on the next SCSI rescan (or automatically ifscsi_scan_asyncis enabled).SG_IO and the full SCSI command set.
sg_io, persistent reservations,READ CAPACITY 16,WRITE SAME 16, all of it. Anything that usesSG_IOin the guest works.Multipath. Two vhost-scsi controllers, each with a LUN pointing at the same bdev, look like two paths to the same SCSI disk to the guest's
dm_multipath. With vhost-blk there's no way to expose two paths.Resilience to LUN removal. If one LUN goes away, the controller and the other LUNs keep running. With vhost-blk, removing the bdev requires removing the controller, which tears down the whole device.
What you give up
Speed. Every I/O is a SCSI command. The SCSI mid-layer in the guest adds overhead per request (the CDB construction, the sense-data round-trip on error). The virtio-scsi driver is also more complex than virtio-blk, which means more CPU per I/O.
Simplicity. The SPDK side has to implement the SCSI target layer (
lib/vhost/vhost_scsi.c:1spdk_scsi_dev,spdk_scsi_lun, the SCSI task model). There's a non-trivial amount of code inthat exists only because SCSI is a richer protocol.
The cost of LUN lifecycle. The
lib/vhost/vhost_scsi.c:247process_removed_devsfunction atwalks every LUN looking for hotremove candidates. It's run on every mgmt poller tick. With many LUNs, the scan is O(N).
How SPDK implements each
The two backends share the protocol layer in
lib/vhost/rte_vhost_user.c but have
completely separate device-side code. The
spdk_vhost_user_dev_backend struct is the
interface between them:
The bdev backing: a vhost-scsi controller as a multi-bdev container
The interesting design point of vhost-scsi is that a
single controller can back N bdevs. Each LUN is a bdev.
Adding a LUN is the RPC
spdk_vhost_scsi_dev_add_tgt:
The corresponding remove path is at
lib/vhost/vhost_scsi.c:1206 .
A LUN hotremove sets the scsi_dev_state to
VHOST_SCSI_DEV_REMOVING, sends a SCSI
async event to the guest, waits for outstanding I/O
to drain via the mgmt poller, and then frees the
io_channel. The mgmt poller is at
(the registration call in
vhost_scsi_start).
Performance trade-offs
| Dimension | vhost-blk | vhost-scsi |
|---|---|---|
| Per-I/O CPU in the guest | ~250 ns (virtio_blk) | ~500 ns (virtio_scsi + sd) |
| Maximum IOPS per guest CPU | ~800 K | ~400 K |
| Per-I/O CPU in SPDK | ~200 ns (bdev submit only) | ~350 ns (bdev + SCSI task model) |
| Multi-queue | Yes (up to virtio negotiated max) | Single multiqueue eventq/controlq/requestq set |
| Multiple LUNs per controller | No | Yes (up to SPDK_VHOST_SCSI_CTRLR_MAX_DEVS) |
| Hotplug | No | Yes (per LUN) |
| Multipath (guest-side) | No | Yes (multiple controllers, same bdevs) |
| SG_IO in the guest | No | Yes |
| Persistent reservations | No | Yes |
What diskengine uses
diskengine uses vhost-blk for the VM path.
The vhost-blk RPC is the
VhostDeleteController:282 wrapper
for the vhost_delete_controller RPC. The
SCSI path is in the codebase but is not enabled
by default — the vhost-detach goroutine is commented
out in
diskengine/internal/baremetal/baremetal.go:79 , while the
vfio-user path is live.
For VMs that need SCSI features (which on baremetal is rare — most cloud VMs just need a block device), the SCSI path is in the code. The startVhostDetachLoop:29 function is the mapping-level vhost teardown. It expects a controller per mapping and tears it down on mapping detach.
Edge cases & what trips people up
1. LUN hotremove in flight during a QMP quit
A vhost-scsi controller with a LUN that is mid-hotremove
when the QEMU process exits. The
stop_session path at
unregisters the requestq poller, then unregisters the
mgmt poller. The mgmt poller is the one that was
driving the LUN's hotremove async event. If the LUN
state was VHOST_SCSI_DEV_REMOVING when
the stop happened, the destroy_session_poller_cb at
tries to take user_dev->lock via
pthread_mutex_trylock and waits for the
mgmt poller to finish. The mgmt poller is gone, so
the trylock succeeds, and the LUN's
spdk_scsi_dev_free_io_channels runs.
That can fail or wedge if the io_channel was tied to
the bdev that's being torn down in parallel.
2. Persistent reservations
SCSI persistent reservations are state on the LUN, not on the controller. If two VMs have access to the same LUN (via two vhost-scsi controllers), a PR registration on one VM is visible to the other VM. There's no way to scope PRs per-vhost-controller in the current SPDK. If you need per-VM PR isolation, the backing bdev has to be different.
3. Multipath asymmetry
Multipath with two vhost-scsi controllers requires
both controllers to be live and have the LUN
bound. If one controller's session is still in the
"starting" state when the guest starts the dm-multipath
path probe, the guest sees only one path. The fix is
to never expose a multipath LUN until both controllers'
sessions are started = true. The diskengine
vfio-user path handles this by waiting for the second
session in the attach loop. The vhost-blk path doesn't
have this problem because there's no multipath.
4. SCSI async event lost in transit
A LUN hotremove sends a SCSI async event to the guest
via the eventq. If the eventq is full (the guest
hasn't drained the previous event), the event is
dropped. The guest's virtio_scsi driver
will not rescan, and the LUN appears to stay forever.
The fix is for the guest to periodically
issue_lip (or for the host to keep retrying
the event send until it lands). SPDK's current code
only sends the event once.
5. The bdev gets resized underneath a vhost-scsi LUN
If the backing bdev is a logical volume that's been
resized, the LUN's capacity as seen by the guest is
stale. The guest's sd driver caches the
capacity and won't re-read it until told. The fix is
a SCSI READ CAPACITY 16 from userspace
(e.g. sg_readcap) or a rescan. vhost-blk
has the same issue, but the
virtio_blk driver polls
config->capacity more often.
6. The "innocent global" race
The
spdk_vhost_scsi_dev_state[i].status
field is read by the requestq poller, the mgmt
poller, and the destroy_session_poller_cb. It's
modified by
spdk_vhost_scsi_dev_remove_tgt (which
sets it to VHOST_SCSI_DEV_REMOVING).
All three access points take the per-user-device
user_dev->lock, but the requestq
poller only takes it for the read of the LUN
pointer, not for the state field. If the timing is
exactly wrong, a LUN in
VHOST_SCSI_DEV_REMOVING state can
still receive a new I/O. The fix is to read the
state field under the same lock.
Why it matters
The two backends are easy to confuse. They share a
wire protocol, a vhost-user file, a
vhost_user_dev_unregister path. But
they have completely different bdev lifecycles, SCSI
vs non-SCSI personalities, and LUN-vs-no-LUN
structures. The QMP quit wedge on page 7.4 hits
both backends but the SCSI path has more moving
parts that can wedge (the mgmt poller, the LUN
hotremove state machine, the io_channel per
LUN).
The next page, 7.3, looks at vfio-user — the alternative transport that doesn't use the vhost-user Unix socket at all. It uses shared memory and doorbells. Different transport, different protocol, different cost model. That's what diskengine uses for the VM path in production.