Layer 7 · vhost / virtio / VFIO-user

vhost-scsi vs vhost-blk.

Same wire protocol, two device personalities. vhost-blk exposes a single block device — one virtio-blk disk, one bdev on the backend. vhost-scsi exposes an entire SCSI target — one virtio-scsi HBA, many LUNs, each a bdev on the backend. The trade is simplicity against flexibility.

~12 min read2 diagramsprerequisite: 7.1
On this page
  1. The two flavours and the personality they expose to the guest
  2. vhost-blk: one bdev, one disk, no SCSI in the guest
  3. vhost-scsi: one controller, many LUNs, SCSI in the guest
  4. How SPDK implements each (lib/vhost/vhost_blk.c, lib/vhost/vhost_scsi.c)
  5. The bdev backing: a vhost-scsi controller as a multi-bdev LUN container
  6. Performance trade-offs
  7. What diskengine uses and why
  8. Edge cases: LUN hotremove, persistent reservations, multipath

The two flavours

A vhost-user socket carries I/O for one virtio device. The protocol doesn't care what kind of device. The control messages are the same. The data plane is the same. The only thing that changes is the device "personality" — the way the backend interprets the I/O requests and turns them into bdev operations.

vhost-blk and vhost-scsi are the two personalities SPDK ships. Both live in lib/vhost/. Both register a spdk_vhost_user_dev_backend via the dispatch table at lib/vhost/vhost.c:434 . Both are constructed by the same vhost_dev_register function at lib/vhost/vhost.c:115 ; the only branch is the backend type:

if (vdev->backend->type == VHOST_BACKEND_SCSI) {
    rc = vhost_user_dev_create(vdev, name, &cpumask, user_backend, delay);
} else {
    /* When VHOST_BACKEND_BLK, delay should not be true. */
    assert(delay == false);
    rc = virtio_blk_construct_ctrlr(vdev, name, &cpumask, params, user_backend);
}

The delay flag is the operational difference: SCSI controllers can be created in a "delay" mode (the controller is registered with the framework but the socket isn't bound yet, so QEMU can't connect) and then started later via vhost_start_scsi_controller. vhost-blk always binds the socket immediately on construction. That makes SCSI's hot-add flow possible (and it's why SCSI can hotplug LUNs after the controller is up).

flowchart LR
subgraph blk["vhost-blk"]
  B1["vhost controller vhost1234"] --> B2["1 bdev: raid_2591"]
  B2 --> B3["1 virtqueue (or multi-queue, but 1 device)"]
end

subgraph scsi["vhost-scsi"]
  S1["vhost controller VhostScsi0"] --> S2["scsi_dev 0 (LUN 0) → bdev raid_2591"]
  S1 --> S3["scsi_dev 1 (LUN 1) → bdev raid_2592"]
  S1 --> S4["scsi_dev N (LUN N) → bdev raid_2593"]
  S2 --> S5["eventq + controlq + requestq"]
  S3 --> S5
  S4 --> S5
end

classDef guest fill:#cfe1ff,stroke:#1c4f8a;
classDef spdk fill:#d6f5d6,stroke:#2a6f2a;
fig. 1 — the two device topologies · tap or scroll to zoom · ↗ for fullscreen

fig. 1   Left: vhost-blk. One controller, one bdev, one device exposed to the guest as /dev/vda. Right: vhost-scsi. One controller, up to 256 scsi_devs (SPDK's hard cap), each a bdev, each a LUN to the guest, sharing a single event/control/request queue set. The guest sees an HBA with N LUNs.

vhost-blk: one bdev, one disk, no SCSI

vhost-blk is the simpler of the two. A controller is constructed with a single backing bdev. Each guest sees one block device (e.g. /dev/vda). The in-guest driver is virtio_blk.ko — no SCSI layer, no multipath, no LUN discovery, no SG_IO escape hatch.

The data path is correspondingly simple. The requestq poller is registered in lib/vhost/vhost_blk.c:1373 :

if (bvdev->bdev) {
    bvsession->requestq_poller = SPDK_POLLER_REGISTER(vdev_worker, bvsession, 0);
} else {
    bvsession->requestq_poller = SPDK_POLLER_REGISTER(no_bdev_vdev_worker, bvsession, 0);
}

The poller calls vdev_worker, which reads the avail ring, builds a bdev_io for each request, and submits to the bdev layer. When the bdev_io completes, the completion path writes to the used ring and signals the guest via the callfd. That's the entire hot path.

What you get with vhost-blk

  • One device per controller. Simple mapping: one vhost controller = one bdev = one guest disk.

  • No SCSI in the guest. The virtio_blk.ko driver is small, simple, and very fast. No SCSI mid-layer, no LUN scanning, no sense data.

  • Multi-queue support. A single controller can have multiple virtqueues (num_queues in the RPC), each pinned to a different MSI-X vector. Linux virtio_blk uses this for blk-mq and gets a queue per CPU.

  • Discard / write-zeroes / flush. These map directly to bdev_io types (SPDK_BDEV_IO_TYPE_UNMAP, etc.). The feature bits are negotiated at vhost-user feature time based on what the bdev supports.

What you don't get

  • No multipath. /dev/vda is one path. You can't have two vhost-blk controllers exposing the same bdev and have the guest see them as one logical volume. The virtio-blk driver doesn't know about multipath.

  • No SCSI SG_IO. Userspace programs in the guest that issue raw SCSI commands (e.g. sg_io, sg_dd, smartctl on SCSI disks) get ENOTTY. Smartctl works for ATA through /dev/vda via the ata_generic layer, but pure SCSI commands don't.

  • No persistent reservations. SCSI PR (the mechanism for "this disk is reserved for me, others get EAGAIN") is a SCSI-only feature.

vhost-scsi: one controller, many LUNs, SCSI in the guest

vhost-scsi exposes an entire SCSI target. The guest sees an HBA, the HBA has N LUNs, each LUN is a bdev on the backend. The in-guest driver is virtio_scsi.ko + the standard Linux SCSI mid-layer + the disk driver's sd.ko.

The internal data structure is the spdk_vhost_scsi_dev, which has up to SPDK_VHOST_SCSI_CTRLR_MAX_DEVS scsi_dev_state entries. Each scsi_dev_state is a pointer to an spdk_scsi_dev (the SPDK SCSI target abstraction) which in turn has a list of LUNs. LUNs are bdevs.

flowchart TB
A["spdk_vhost_scsi_dev svdev"]
A --> B0["scsi_dev_state[0]"]
A --> B1["scsi_dev_state[1]"]
A --> BN["scsi_dev_state[N]"]
B0 --> C0["spdk_scsi_dev 0 (target 0)"]
C0 --> D0["spdk_scsi_lun 0 → bdev raid_2591"]
B1 --> C1["spdk_scsi_dev 1 (target 1)"]
C1 --> D1["spdk_scsi_lun 0 → bdev raid_2592"]
BN --> CN["spdk_scsi_dev N"]
CN --> DN["spdk_scsi_lun 0 → bdev raid_259N"]
fig. 2 — the vhost-scsi internal state · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The internal state of a vhost-scsi controller. Each scsi_dev_state[i] is a SCSI target with its own bdev-backed LUN. The guest sees the controller as one HBA with N LUNs (the report LUNs SCSI command returns the list).

What you get with vhost-scsi

  • One controller, many LUNs. A single vhost-user socket carries I/O for an entire set of disks. Useful for VMs that have many volumes.

  • Hot-add / hot-remove of LUNs. The spdk_vhost_scsi_dev_add_tgt and spdk_vhost_scsi_dev_remove_tgt RPCs add and remove LUNs from a live controller. The guest sees the new LUNs on the next SCSI rescan (or automatically if scsi_scan_async is enabled).

  • SG_IO and the full SCSI command set. sg_io, persistent reservations, READ CAPACITY 16, WRITE SAME 16, all of it. Anything that uses SG_IO in the guest works.

  • Multipath. Two vhost-scsi controllers, each with a LUN pointing at the same bdev, look like two paths to the same SCSI disk to the guest's dm_multipath. With vhost-blk there's no way to expose two paths.

  • Resilience to LUN removal. If one LUN goes away, the controller and the other LUNs keep running. With vhost-blk, removing the bdev requires removing the controller, which tears down the whole device.

What you give up

  • Speed. Every I/O is a SCSI command. The SCSI mid-layer in the guest adds overhead per request (the CDB construction, the sense-data round-trip on error). The virtio-scsi driver is also more complex than virtio-blk, which means more CPU per I/O.

  • Simplicity. The SPDK side has to implement the SCSI target layer (spdk_scsi_dev, spdk_scsi_lun, the SCSI task model). There's a non-trivial amount of code in

    lib/vhost/vhost_scsi.c:1

    that exists only because SCSI is a richer protocol.

  • The cost of LUN lifecycle. The process_removed_devs function at

    lib/vhost/vhost_scsi.c:247

    walks every LUN looking for hotremove candidates. It's run on every mgmt poller tick. With many LUNs, the scan is O(N).

How SPDK implements each

The two backends share the protocol layer in lib/vhost/rte_vhost_user.c but have completely separate device-side code. The spdk_vhost_user_dev_backend struct is the interface between them:

The bdev backing: a vhost-scsi controller as a multi-bdev container

The interesting design point of vhost-scsi is that a single controller can back N bdevs. Each LUN is a bdev. Adding a LUN is the RPC spdk_vhost_scsi_dev_add_tgt:

The corresponding remove path is at lib/vhost/vhost_scsi.c:1206 . A LUN hotremove sets the scsi_dev_state to VHOST_SCSI_DEV_REMOVING, sends a SCSI async event to the guest, waits for outstanding I/O to drain via the mgmt poller, and then frees the io_channel. The mgmt poller is at

lib/vhost/vhost_scsi.c:1471

(the registration call in vhost_scsi_start).

Performance trade-offs

Dimensionvhost-blkvhost-scsi
Per-I/O CPU in the guest~250 ns (virtio_blk)~500 ns (virtio_scsi + sd)
Maximum IOPS per guest CPU~800 K~400 K
Per-I/O CPU in SPDK~200 ns (bdev submit only)~350 ns (bdev + SCSI task model)
Multi-queueYes (up to virtio negotiated max)Single multiqueue eventq/controlq/requestq set
Multiple LUNs per controllerNoYes (up to SPDK_VHOST_SCSI_CTRLR_MAX_DEVS)
HotplugNoYes (per LUN)
Multipath (guest-side)NoYes (multiple controllers, same bdevs)
SG_IO in the guestNoYes
Persistent reservationsNoYes

What diskengine uses

diskengine uses vhost-blk for the VM path. The vhost-blk RPC is the VhostDeleteController:282 wrapper for the vhost_delete_controller RPC. The SCSI path is in the codebase but is not enabled by default — the vhost-detach goroutine is commented out in diskengine/internal/baremetal/baremetal.go:79 , while the vfio-user path is live.

For VMs that need SCSI features (which on baremetal is rare — most cloud VMs just need a block device), the SCSI path is in the code. The startVhostDetachLoop:29 function is the mapping-level vhost teardown. It expects a controller per mapping and tears it down on mapping detach.

Edge cases & what trips people up

1. LUN hotremove in flight during a QMP quit

A vhost-scsi controller with a LUN that is mid-hotremove when the QEMU process exits. The stop_session path at

lib/vhost/vhost_scsi.c:1542

unregisters the requestq poller, then unregisters the mgmt poller. The mgmt poller is the one that was driving the LUN's hotremove async event. If the LUN state was VHOST_SCSI_DEV_REMOVING when the stop happened, the destroy_session_poller_cb at

lib/vhost/vhost_scsi.c:1477

tries to take user_dev->lock via pthread_mutex_trylock and waits for the mgmt poller to finish. The mgmt poller is gone, so the trylock succeeds, and the LUN's spdk_scsi_dev_free_io_channels runs. That can fail or wedge if the io_channel was tied to the bdev that's being torn down in parallel.

2. Persistent reservations

SCSI persistent reservations are state on the LUN, not on the controller. If two VMs have access to the same LUN (via two vhost-scsi controllers), a PR registration on one VM is visible to the other VM. There's no way to scope PRs per-vhost-controller in the current SPDK. If you need per-VM PR isolation, the backing bdev has to be different.

3. Multipath asymmetry

Multipath with two vhost-scsi controllers requires both controllers to be live and have the LUN bound. If one controller's session is still in the "starting" state when the guest starts the dm-multipath path probe, the guest sees only one path. The fix is to never expose a multipath LUN until both controllers' sessions are started = true. The diskengine vfio-user path handles this by waiting for the second session in the attach loop. The vhost-blk path doesn't have this problem because there's no multipath.

4. SCSI async event lost in transit

A LUN hotremove sends a SCSI async event to the guest via the eventq. If the eventq is full (the guest hasn't drained the previous event), the event is dropped. The guest's virtio_scsi driver will not rescan, and the LUN appears to stay forever. The fix is for the guest to periodically issue_lip (or for the host to keep retrying the event send until it lands). SPDK's current code only sends the event once.

5. The bdev gets resized underneath a vhost-scsi LUN

If the backing bdev is a logical volume that's been resized, the LUN's capacity as seen by the guest is stale. The guest's sd driver caches the capacity and won't re-read it until told. The fix is a SCSI READ CAPACITY 16 from userspace (e.g. sg_readcap) or a rescan. vhost-blk has the same issue, but the virtio_blk driver polls config->capacity more often.

6. The "innocent global" race

The spdk_vhost_scsi_dev_state[i].status field is read by the requestq poller, the mgmt poller, and the destroy_session_poller_cb. It's modified by spdk_vhost_scsi_dev_remove_tgt (which sets it to VHOST_SCSI_DEV_REMOVING). All three access points take the per-user-device user_dev->lock, but the requestq poller only takes it for the read of the LUN pointer, not for the state field. If the timing is exactly wrong, a LUN in VHOST_SCSI_DEV_REMOVING state can still receive a new I/O. The fix is to read the state field under the same lock.

Why it matters

The two backends are easy to confuse. They share a wire protocol, a vhost-user file, a vhost_user_dev_unregister path. But they have completely different bdev lifecycles, SCSI vs non-SCSI personalities, and LUN-vs-no-LUN structures. The QMP quit wedge on page 7.4 hits both backends but the SCSI path has more moving parts that can wedge (the mgmt poller, the LUN hotremove state machine, the io_channel per LUN).

The next page, 7.3, looks at vfio-user — the alternative transport that doesn't use the vhost-user Unix socket at all. It uses shared memory and doorbells. Different transport, different protocol, different cost model. That's what diskengine uses for the VM path in production.