The slow write, the fast write, and why they differ.
A clone's first write to a cluster shared with its parent snapshot is a copy-on-write: the cluster is allocated from the free pool, the user's data is written to the new cluster, and the clone's extent table is updated to point at the new cluster. From the clone's perspective, the cluster now has its own copy. From the snapshot's perspective, nothing changed. This page is the inside of that operation: the structures, the metadata writes, and the failure modes.
- The CoW data structure: the extent table
- What happens on first write to a shared cluster
- What happens on read of an unwritten cluster
- The metadata write: a small md_page
- Random-write performance: why snapshots are slower
- CoW and delete: refcounting clusters
- Inflate, decouple, and the snapshot-merge story
- Edge cases & what trips people up
The CoW data structure: the extent table
The CoW machinery is built on a single data structure: the blob's extent table. The extent table maps the lvol's logical cluster indices to physical cluster indices in the lvstore. A "0" entry in the table means "unallocated" (the cluster doesn't have a backing cluster yet). A non-zero entry means "this logical cluster is backed by physical cluster N."
The extent table is run-length-encoded (RLE) on disk. The
on-disk descriptor is
SPDK_MD_DESCRIPTOR_TYPE_EXTENT_RLE
( lib/blob/blobstore.h:283 ) — an
array of (cluster_idx, length) pairs. A snapshot's extent
table is a copy of the parent's. A clone's extent table is
initially empty; it grows as CoW happens.
For very large lvols, the extent table itself doesn't fit in
one metadata page. The blobstore handles this with two more
descriptor types: SPDK_MD_DESCRIPTOR_TYPE_EXTENT_TABLE
( lib/blob/blobstore.h:288 ) and
SPDK_MD_DESCRIPTOR_TYPE_EXTENT_PAGE
( lib/blob/blobstore.h:293 ). The
table points to multiple extent pages; each extent page
holds a chunk of the cluster index array. The choice of RLE
vs. extent_table is a per-blob decision and is detected
automatically based on size
(see use_extent_table in the blob struct).
What happens on first write to a shared cluster
The bdev layer's lvol_write function
( module/bdev/lvol/vbdev_lvol.c:941 )
hands the request to the blobstore. The blobstore's
spdk_blob_io_writev_ext is the actual workhorse.
For an unallocated cluster in a thin blob, the path is:
The first write to a shared cluster (i.e. a cluster that the clone is reading from its parent snapshot) takes a slightly different path. The extent table entry is non-zero (it points at the snapshot's cluster), but the act of writing means the clone needs its own copy. The sequence:
Once the cluster is allocated, the user's data is DMA'd to it. Then the metadata update: the clone's blob md_page that contains the affected extent table entry is queued for a write. The metadata write is a single 4 KiB page write to the metadata area. After the metadata write completes, the I/O is reported as complete.
What happens on read of an unwritten cluster
A read of a cluster that the clone has never written to falls
through to the parent snapshot. The blobstore handles this
in spdk_blob_io_readv_ext: if the extent table
entry is 0 (unallocated), the read is forwarded to the
parent. The parent's data is read instead.
In other words, the clone is a sparse view of the parent. Reads from unallocated regions of the clone read from the parent. Reads from allocated regions of the clone read from the clone's own cluster. The framework doesn't have to know this is happening — it's all internal to the blobstore.
flowchart TB subgraph Before["Before the first write to cluster 7"] SNAP1["snapshot 'snap1'
extents: [1..256]"] CL1["clone 'cl1'
extents: [empty]
all reads fall through to snap1"] SNAP1 -.->|cluster 7| C7A["physical cluster 7
(shared, read-only)"] CL1 -.->|read of cluster 7| C7A end subgraph After["After first write to cluster 7"] SNAP2["snapshot 'snap1'
extents: [1..256]
(unchanged)"] CL2["clone 'cl1'
extents: [..7..]
cluster 7 now own"] SNAP2 -.->|cluster 7| C7B["physical cluster 7
(shared, read-only)"] CL2 -.->|cluster 7| C7NEW["physical cluster 100
(cl1's private copy)"] end classDef snap fill:#d6f5d6,stroke:#2a6f2a; classDef clone fill:#f5d6e0,stroke:#8a1c4f; classDef shared fill:#cfe1ff,stroke:#1c4f8a; classDef own fill:#f5e6c8,stroke:#a17f1a; class SNAP1,SNAP2 snap class CL1,CL2 clone class C7A,C7B shared class C7NEW own
fig. 1 Before the CoW, the clone has no private copy of cluster 7 and reads from the snapshot. After the CoW, the clone owns physical cluster 100 (allocated from the free pool) and reads/writes go there. The snapshot is unchanged.
The metadata write: a small md_page
The metadata update for a CoW is tiny — typically a single 4 KiB md_page. The page contains the affected blob's metadata chain, of which the extent table is one descriptor. The exact size depends on the extent table's layout: a small lvol may have its entire extent table in one md_page; a large lvol with an extent_table has the table in one md_page and the actual extents in many extent_pages.
The metadata write is part of the bdev_io's lifetime. The blobstore doesn't return success to the bdev layer until both the data write and the metadata write have completed. The bdev layer doesn't return success to the original submitter until the blobstore's callback fires. The full chain is described in lib/blob/blobstore.c:4535 .
In diskengine's free-bytes sync, this matters for one reason: a CoW that fails partway through (data write succeeds, metadata write fails) leaves the lvstore in an inconsistent state. The cluster is allocated to the clone (so FreeClusters has been decremented) but the extent table doesn't reflect the new cluster. On the next load, the cluster will be reclaimed by the used_clusters bit pool rebuild. The clean-shutdown bit in the super block is cleared in this case, so the load is forced to validate everything.
Random-write performance: why snapshots are slower
A random write to a non-shared cluster in a regular lvol is cheap: one cluster allocation, one data write, one metadata write. A random write to a shared cluster in a clone is the same cost — one allocation, one data write, one metadata write — but the cluster being allocated is a new cluster (the clone's own), not a pre-allocated one. The data DMA goes to the new cluster. So the cost is the same.
Where CoW hurts is when most of the writes are to previously-unwritten clusters. A workload that touches every cluster of a 1 TiB clone will eventually allocate 256K clusters, but the first such write triggers the expensive "all metadata updates in one go" path. Subsequent writes are fast.
CoW and delete: refcounting clusters
A cluster is freed only when the last blob that
references it is deleted. The blobstore's used_clusters bit
pool is the source of truth: a cluster is in the pool iff
at least one blob references it. The deletion of a blob's
cluster is implemented in
spdk_bs_free_cluster
( lib/blob/blobstore.c:150 ), which
clears the cluster's bit in the pool.
The lvol layer's
spdk_lvol_deletable
( lib/lvol/lvol.c:1870 ) checks for
clones. The bdev-layer
_vbdev_lvol_destroy
( module/bdev/lvol/vbdev_lvol.c:650 )
calls
spdk_blob_get_clones
( module/bdev/lvol/vbdev_lvol.c:663 )
and refuses to delete if the clone count is > 1. A clone
count of 1 means "this blob has no clones" — the count
includes the blob itself.
Diskengine's delete path relies on this: deleting a snapshot whose children have been deleted is the same operation as deleting a regular lvol. The blobstore handles the refcounting.
Inflate, decouple, and the snapshot-merge story
Two RPCs give you a way out of the "my clone's clusters are shared with a snapshot" situation:
| RPC | What it does | Cost |
|---|---|---|
bdev_lvol_inflate | For a thin lvol, allocates a private cluster for every currently-unallocated cluster in the extent table. After this, the lvol is effectively thick (no fall-through to parent for reads). | O(size) — every unallocated cluster triggers a CoW. Can be slow for large lvols. |
bdev_lvol_decouple_parent | Removes the parent link. Subsequent reads of unallocated clusters return zeros (or unmap'd behavior), not the parent's data. | Cheap — just clears metadata. |
Inflate is the "I want a real copy, fully independent" path. After inflate, no reads fall through; the lvol owns clusters for every index. This is what you call before a destructive operation (re-encrypt, scrub, reformat) on a clone that came from a snapshot.
Decouple is the "I want to forget the parent ever existed" path. The lvol stays thin, but its parent reference is gone. Reads of unallocated clusters return zeros. This is the lighter-weight alternative to inflate when the goal is just to break the link.
There is no native "merge" or "flatten" operation that
combines a snapshot's data into a single lvol. The closest
equivalent is to inflate the snapshot and then delete the
original — but that doubles the storage cost briefly, and
the lvstore needs to have enough free clusters to hold the
inflated snapshot. For an in-place merge of large
snapshots, the standard pattern is: create a new empty
thick lvol, copy the data over with
spdk_lvol_shallow_copy or external
dd over NBD, then delete the snapshot.
Edge cases & what trips people up
Concurrent writes to the same cluster in parent and child
The parent is read-only, so there are no concurrent writes to
the parent. Concurrent writes to the same cluster in the
child are serialized by the blobstore's per-channel lock —
the lvol's spdk_io_channel is per-thread, and
the blobstore ensures that at most one write to a given
cluster is in flight per channel. The first write triggers
CoW; subsequent writes to the same cluster hit the
already-allocated cluster and don't.
CoW metadata runs out of space
The CoW needs to write the new extent table to a metadata
page. If the metadata area is full, the allocation fails
with -ENOSPC. The lvol's data is still consistent (the
cluster was allocated, the data was written, but the
metadata wasn't updated). The next load may report the lvol
as inconsistent. The fix is to grow the lvstore's metadata
area (spdk_lvs_grow, requires the underlying
bdev to have spare space) or delete some lvols to free up
metadata pages.
Read-only snapshots
A snapshot is read-only by metadata. The
vbdev_lvol_io_type_supported function
( module/bdev/lvol/vbdev_lvol.c:843 )
consults spdk_blob_is_read_only to reject
writes, unmaps, and write-zeroes. Reads, resets, seeks are
all allowed. The framework will fail a write at submit time
with SPDK_BDEV_IO_STATUS_FAILED; the request
never reaches the blobstore.
Crash recovery: the clean-shutdown bit
A clean shutdown sets the clean bit in the
super block to 1
( lib/blob/blobstore.c:574 ). A
crash leaves it 0. On load, the blobstore doesn't
immediately fail — it just marks the lvstore as dirty and
may rebuild the used_clusters bit pool from scratch. The
rebuild is in
bs_recover
( lib/blob/blobstore.c:9880 ) and is
safe: it walks every metadata page and re-marks every
cluster that's referenced by any blob.
What is not safe is a torn metadata page. If the crash happened during a CoW metadata write, the affected page may have a torn CRC, and the load will reject the page. The blobstore's response is conservative: the affected blob is marked as having an inconsistent metadata chain, and reads of clusters owned by that blob may return errors. The safe answer is to delete the affected blob and restore from a snapshot.
Inflate of a deeply-shared clone: cluster exhaustion
Inflate of a clone of size N on a lvstore with less than N free clusters will fail mid-way. The clusters allocated so far are owned by the clone (the inflate is doing the CoW as it goes), so the lvstore is now in a state where some clusters are owned by the partially-inflated clone. The inflate cannot be resumed; the only options are to delete the clone (which frees the inflate'd clusters) or to grow the lvstore. Diskengine's provisioning loop is aware of this; the resize path ( processInPlaceResizes:43 ) checks free cluster counts before issuing the resize.
What to take away
CoW is a per-cluster allocation + metadata update. The expensive part is the metadata page write, not the data write — the data DMA is the same cost as a regular write. A workload that touches every cluster of a clone pays a one-time metadata cost per cluster; subsequent writes are cheap. The blobstore refcounts clusters in its used_clusters bit pool; a cluster is freed only when the last blob that references it is deleted. Read-only snapshots are enforced at the bdev layer's io_type_supported check, not in the blobstore itself. Inflate and decouple are the two ways out of a shared state; there is no native merge. The next page, 5.4 — diskengine lvol operations traced, is the diskengine side: how the Go code drives all of this from outside.