Layer 9 · Operations & debugging

The bugs that actually broke production.

Every other page in this curriculum is forward-looking: here is the design, here is the contract, here is the code path. This page is backward-looking. It is the four real bug classes that hit baremetal ex9 between 2026-05-18 and 2026-05-25, the symptoms we saw, the root causes we found, the tools we used to find them, and the rules to follow so they do not happen again. If you come back to this curriculum after a bad day on a target, start here. The end of the page is a one-page debug playbook — a flowchart you can run on a wedged target without thinking.

~20 min read1 diagram (the playbook)prerequisites: 4.3 · 7.4 · 9.1

On this page

The four bug classes — at a glance
Bug 1: The VFIO/QMP quit wedge
Bug 2: The forkbomb + terminate race
Bug 3: The crash recovery flow
Bug 4: Live migration issues
Bug 5: The bdev_get_bdevs SEGV (bonus)
The debug playbook: SPDK is wedged, what now?
Edge cases & what trips people up

The four bug classes — at a glance

Bug class	What you see	Curriculum layer	Recovery
VFIO/QMP quit wedge	Stale vfio-user export; `nvmf_subsystem_remove_ns` blocks; fresh VMs stuck in `CREATING`	7.4	SPDK service restart
Forkbomb + terminate race	Live QEMU, API `TERMINATING`; no escalation; QEMU never dies	7.3	QMP `quit` with deadline
SPDK crash recovery	All vhost-user sockets die; VMs hung forever; no auto-reconnect	7.1	systemd marker + restart vmengine
Live migration	VFIO migration data copy hangs; large pages cannot iterate	7.4	vfio-user migration; or no-migration
`bdev_get_bdevs` SEGV (bonus)	SPDK segfaults on a read-only RPC after QEMU death under active IO	4.2	Patch `nvme_rdma_ctrlr_get_memory_domains`

Bug 1: The VFIO/QMP quit wedge

What happened

On 2026-05-22, the following sequence on baremetal ex9 wedged the entire create path. The repro, in full, is at VFIO/QMP quit wedge repro:1 :

Started direct root-disk IO inside VM 11263.
Triggered guest kernel panic via sysrq.
Sent QMP quit to the QEMU monitor socket.
Watched diskengine try to detach the VM's SPDK NVMf/vfio-user subsystem.
Tried to create a new VM. The root RAID came online; the vfio-user export was never created; /var/diskengine/vfio-user/<vm_id>/cntrl never appeared; QEMU never launched.

The result, in diskengine's logs:

raid detach: vol 2591 spdk gate failed (namespace raid_2591 still present
  in nqn.2026-02.io.excloud:vm:11263); deferring

And the stuck VM in the API:

11265 / i-e53f8f6d80a9  CREATING
/var/diskengine/vfio-user/11265/cntrl: missing
/var/vmengine/vms/11265: missing
QEMU for 11265: missing
nqn.2026-02.io.excloud:vm:11265: missing

Why it happened

The working theory, captured in the incident report:

guest panic + active root IO + QMP quit
-> vfio-user endpoint disconnect while IO accounting is still outstanding
-> NVMf subsystem pause never completes
-> remove_ns/delete_subsystem RPC blocks
-> diskengine VFIO detach pass never returns
-> fresh attach/export work is starved
-> new VM reaches RAID online but no cntrl socket/QEMU

The root cause, in the curriculum's terms, is a missing pause-counter accounting path. The NVMf subsystem pause in lib/nvmf/nvmf.c waits while management IO or namespace IO is outstanding. When QEMU dies while IO is outstanding, the vfio-user transport's tran_sock_get_request_header returns ENOTCONN, and the quiesce path in module/vfu_device/vfu_virtio.c returns -EBUSY while io_outstanding is non-zero. The bdev IO that the count is waiting for never completes because the QEMU side is gone. The pause callback never fires. The remove_ns RPC never returns.

The diskengine side compounds this: it does not have a timeout on the JSON-RPC call, so a single stuck RPC pins a goroutine forever, and the next iteration of the vfio-user attach loop is starved behind it.

How to detect it in the future

Three signals, in order of reliability:

diskengine's "spdk gate failed" log line. The namespace-gate deferral is the surface symptom and is always logged. Watch for it repeating every few seconds.
Direct SPDK RPC check: scripts/rpc.py nvmf_subsystem_remove_ns <nqn> <nsid> hangs. Other read-only RPCs (nvmf_get_subsystems, bdev_raid_get_bdevs) still respond — so SPDK is not globally dead, only this one call.
spdk_top — POLLERS tab: the bdev module poller is at 0% (no completions are coming through) but the threads are at 100% (something is pinned). This is the “reactor at 100% but doing nothing” pattern from 9.1.

How to prevent it in code

Three rules. The first is the only one that fixes the root cause; the second and third are the diskengine side:

On vfio-user endpoint disconnect, force-clear the outstanding-IO counter. The NVMf subsystem pause callback should treat a client disconnect as a quiesce boundary and zero out io_outstanding. This is a SPDK change in lib/nvmf/vfio_user.c; until it lands, the bug recurs.
Put deadlines on every JSON-RPC call. A context.WithTimeout around every SPDK RPC, and a config knob for the timeout (default 30 s). The internal/spdkclient/client.go call path already has deadline code present but commented out — un-comment it and set a sane default.
Decouple attach and detach goroutines. A wedged detach must not block a new attach. The vfio_user_attach.go and vfio_user_detach.go loops are already separate files; make sure they run on independent goroutines with no shared mutex.

How to recover when it happens

The recovery, per the incident report: systemctl restart spdk.service. The chain that follows:

STEP 01

spdk.service stop

systemd kills the nvmf_tgt

→

STEP 02

diskengine auto-killed

BindTo=spdk.service

→

STEP 03

spdk.service start

fresh PID, clean state

→

STEP 04

diskengine auto-start

PartOf=spdk.service

→

STEP 05

vmengine restart

from ExecStopPost marker (see Bug 3)

After the restart, the stale NQN is gone, the new VM (11265) reaches RUNNING, and the affected VM (11263) is reported as STOPPED with a residual inconsistency in vmengine local state (the /var/vmengine/vms/11263/vm.state file remains RUNNING). The SPDK side is fully recovered. The local state is a separate vmengine bug (see Bug 2 below).

Bug 2: The forkbomb + terminate race

What happened

On 2026-05-19, the 10-VM immediate-terminate repro reproduced this bug at scale. From VM_LIFECYCLE repro log:780 and Bug 1:33 :

If a VM is terminated immediately after it first reaches RUNNING, vmengine sends one QMP system_powerdown, writes local vm.state=TERMINATING, and then never escalates. The guest may ignore/delay early ACPI shutdown, leaving QEMU alive and the API stuck in TERMINATING.

Concretely, for VMs 10916–10925:

- All ten reproduced API TERMINATING with live QEMU still running.
- Local ex9 VM state files were TERMINATING.
- Scoped QMP quit to only these ten test VMs generated host-qmp-quit
  shutdown events and allowed vmengine cleanup to finalize all ten
  to TERMINATED.

Why it happened

The vmengine terminate path sends one QMP system_powerdown, writes vm.state=TERMINATING, and then on later reconciliation passes “immediately returns if local state file is already TERMINATING.” There is no timeout/escalation path analogous to the STOPPING path.

The QMP client is also missing read/write deadlines. The qmp.Run function uses net.DialTimeout but no per-operation deadline. When the QEMU monitor socket is alive but QEMU is hung inside the guest, the read can block forever.

How to detect it

The diagnostic is the API state:

exc compute list | awk '$1=="<id>"'
# STATE: TERMINATING for more than 30 seconds with a live QEMU

And the QMP check:

ssh ex9 'printf "{\"execute\":\"qmp_capabilities\"}\r\n{\"execute\":\"query-status\"}\r\n" \
  | sudo -n socat - UNIX-CONNECT:/var/vmengine/vms/<id>/monitor.sock'
# status=running, running=true

If the API says TERMINATING, the local vm.state file says TERMINATING, but the QEMU is alive and reporting status=running, you have a stuck terminate.

How to prevent it in code

Two rules, both implemented in the working tree at the time of the incident:

Track terminatingSince like stoppingSince. Persist it (or derive from vm.state mtime) so escalation survives vmengine restarts. If the VM remains TERMINATING with QEMU alive past the deadline, escalate: graceful system_powerdown → QMP quit → OS signal.
Add QMP per-operation deadlines. The qmp.Run function should set SetReadDeadline / SetWriteDeadline on the Unix socket connection. The QEMU monitor socket can refuse reads while the guest is in a stuck shutdown state.

How to recover when it happens

The recovery is scoped QMP quit:

ssh ex9 'printf "{\"execute\":\"qmp_capabilities\"}\r\n{\"execute\":\"quit\"}\r\n" \
  | sudo -n socat - UNIX-CONNECT:/var/vmengine/vms/<id>/monitor.sock'
# SHUTDOWN reason=host-qmp-quit

The vmengine loop will then finalize the VM to TERMINATED on the next reconciliation tick. This was the recovery used in all of the 10-VM repros; it is fast (sub-second) and does not require any service restart.

Bug 3: The crash recovery flow

What happened

The scenario, not a one-time bug: when SPDK crashes or restarts, every vhost-user socket dies. QEMU VMs that were using those sockets hang forever — the file descriptor is dead, there is no auto-reconnect. The diskengine's reconciliation loops do retry, but with no escalation, and vmengine's forceStopQemuVM cannot bring up dead QEMUs that need their fds refreshed.

The bug class is “no defined recovery path for ‘SPDK restarted under a running QEMU’.” Ungraceful BM reboots are not a problem — the loops rebuild everything. The problem is a service restart while QEMU is alive.

Why it happened

The original design assumed QEMU's vhost-user-blk socket is a stable file descriptor across SPDK restarts. It is not. vhost-user-blk uses a Unix socket fd obtained at QEMU boot. When SPDK crashes, the fd becomes a dead reference to a no-longer-listening socket. The VM is wedged even after SPDK comes back with a fresh socket at the same path.

How to detect it

Two signals:

spdk_top is empty or the JSON-RPC socket is missing. Check with ls -l /var/tmp/spdk.sock and systemctl status spdk.service.
VMs are still RUNNING in the API but QEMU is hung on a dead socket. The vmengine log will show repeated dial unix /var/vmengine/vms/<id>/monitor.sock: connect: connection refused or vfio_user_device_io_region_read: timed out waiting for reply: Connection timed out.

How to prevent it in code

The recovery design is at SPDK crash recovery plan:13 . The rules:

Use systemd ExecStopPost to drop marker files. Two markers: /var/vmengine/SPDK_RESTART and /var/diskengine/SPDK_RESTART. The systemd unit drops them in ExecStopPost:
```
ExecStopPost=/bin/touch /var/vmengine/SPDK_RESTART
ExecStopPost=/bin/touch /var/diskengine/SPDK_RESTART
ExecStopPost=/bin/systemctl restart vmengine
ExecStopPost=/bin/systemctl restart diskengine
```
Order matters: the markers must exist before vmengine/diskengine start.
Clean stale vhost sockets in the SPDK init script. The spdk-baremetal-init.sh must run rm -f /var/diskengine/vhost* before SPDK opens its RPC socket. Stale socket files confuse diskengine's ensureVhost() checks.
vmengine on startup: SIGINT every RUNNING QEMU and set it to STARTING. Check the marker on startup. If present, query DB for RUNNING/STARTING/ RESTARTING/UPGRADING/MIGRATING VMs on this BM, send SIGINT to each (best-effort), set the DB state to STARTING, and remove the marker. Then proceed to the normal loop.
diskengine on startup: clean DISKENGINE_CLEANUP markers. The existing reconcile loops are idempotent — they see DB state and rebuild whatever is missing in SPDK. No special recovery logic is needed; just remove the marker.

How to recover when it happens

The recovery is a sequence of service restarts, not a manual one. The systemd unit ordering does the work:

STEP 01

SPDK crashes

or is restarted for any reason

→

STEP 02

systemd ExecStopPost fires

markers + restart vmengine

→

STEP 03

diskengine auto-killed

BindTo=spdk.service

→

STEP 04

SPDK restarts

Restart=always, RestartSec=3

→

STEP 05

init script cleans vhost sockets

rm -f /var/diskengine/vhost*

→

STEP 06

vmengine sees marker

SIGINT RUNNING QEMUs, set to STARTING

→

STEP 07

diskengine sees marker

clean DISKENGINE_CLEANUP, then normal loops

→

STEP 08

diskengine loops converge

RAIDs rebuilt, vhosts recreated

→

STEP 09

vmengine Apply() boots VMs

wait for vhost socket, launch QEMU

The crucial line: VMs must be killed. QEMU's vhost-user-blk has no auto-reconnect. The running QEMU holds a dead fd even after diskengine recreates the socket at the same path. There is no way to make a running QEMU use a fresh fd; you have to kill it and re-launch.

Bug 4: Live migration issues

What happened

Live migration with vhost-user-blk on SPDK is possible but has narrow timing constraints. The vfio-user path (the new path) is designed for live migration, but the on-the-wire protocol is large and state-heavy. The plan is at VFIO user migration plan:1 .

The current state on ex9: ex9 uses vfio-user NVMe, not vhost-user-blk. Live migration is not yet production-enabled. Validation status, from the plan:

QEMU's vfio-user-pci path is supported, but SPDK namespace add/remove semantics are version-sensitive: the programming guide requires paused or inactive subsystems for namespace changes, while newer changelog notes say add/remove can be done with more limited pause scopes. This implementation uses the direct RPC path and should be tested against the deployed SPDK build before production rollout.

Why it happened

Two independent issues:

Namespace mutation semantics. SPDK requires a paused subsystem for nvmf_subsystem_add_ns or nvmf_subsystem_remove_ns in some versions; the RPC will reject the call mid-IO otherwise. For a live migration, the source VM is actively doing I/O. Pausing the subsystem stops the I/O, which is the migration boundary. If diskengine is in the middle of adding a hot-plugged disk during a migration, the order is: add_ns → migration add_ns again. Both calls need to be pause-safe.
vfio-user state serialization. The vfio-user protocol carries the device's full state across the migration — pending completions, mapped memory, etc. The page directory (for DMA) must be rebuilt on the target. If a guest has pinned hugepages in a configuration the target doesn't have, the migration hangs in the “precopy” phase trying to iterate pages it cannot address.

How to detect it

A migration hang looks like:

QEMU info migrate reports postcopy not started, or precopy iteration count growing slowly.
diskengine vfio-user export for the migrating VM is in paused state (the NVMf subsystem was paused for the namespace operation). It must come back to active on the target.
The QMP query-migrate blocks indefinitely. The monitor socket is alive but QEMU is stuck.

How to prevent it in code

Three rules, ordered by importance:

Make the namespace mutation pause-scope explicit. Use a dedicated nvmf_subsystem_pause call with a short timeout, do the namespace operation, then nvmf_subsystem_resume. The current design in internal/baremetal/vfio_user_attach.go relies on the default pause behaviour, which is version-dependent. Pin the behaviour.
Test the migration end-to-end on the deployed SPDK build before enabling it in production. The vfio-user migration path depends on both QEMU and SPDK behaving consistently about the pause/resume semantics.
Pre-pin the source BM's hugepage configuration to match the target. The migration data copy moves memory pages; if the target has fewer hugepages or different NUMA layout, the page iteration hangs.

How to recover when it happens

The recovery is to not try to recover the migration. Kill the source QEMU, kill the target QEMU, restart the VM fresh on the target. The vfio-user protocol does not support mid-flight resumption — once the precopy loop is hung, the migration is lost. The cleanest exit is to fail fast and let the caller retry.

Bug 5: The `bdev_get_bdevs` SEGV (bonus)

What happened

On 2026-05-20, an ex9 SPDK process segfaulted when an operator called the read-only RPC bdev_get_bdevs after a batch of VMs was killed via SIGKILL while their root-volume IO was active. The crash was reproducible, with this stack (from the core file at Bug 6:501 ):

#0  nvme_rdma_ctrlr_get_memory_domains
#1  bdev_nvme_get_memory_domains
#2  rpc_dump_bdev_info
#3  spdk_for_each_bdev
#4  rpc_bdev_get_bdevs
#5  parse_single_request
#6  jsonrpc_server_conn_recv
#7  rpc_subsystem_poll_servers
#8  thread_execute_timed_poller
#9  spdk_thread_poll
#10 _reactor_run

The fault address in the kernel log was 0x10, consistent with a NULL/invalid nested pointer in nvme_rdma_ctrlr_get_memory_domains after the controller/qpair was torn down.

Why it happened

The crashing SPDK function dereferences the RDMA admin qpair path without local null/state guards:

/* /Users/lolwierd/Projects/excloud/spdk/lib/nvme/nvme_rdma.c:3670 */
nvme_rdma_ctrlr_get_memory_domains(...)
    rqpair = nvme_rdma_qpair(ctrlr->adminq)
    domains[0] = rqpair->rdma_qp->domain  // <-- crash here

The flow was: a forced QEMU death left the NVMe/RDMA controller in a torn-down state. bdev_get_bdevs walked the bdev list, hit the bdev backed by the broken controller, called spdk_nvme_ctrlr_get_memory_domains, which dereferenced a now-invalid rdma_qp pointer.

How to detect it

The crash is detectable by:

The systemd journal records a Signal 11 (SEGV) for the SPDK process. Look for spdk.service ... Main PID changed (old=N, new=M) immediately after the SEGV.
A core file appears at /var/lib/systemd/coredump/core.reactor_*.<pid>.<ts>.zst. Extract with systemd-coredump and gdb on the core.

How to prevent it in code

Two rules, with the first being the actual fix:

Patch nvme_rdma_ctrlr_get_memory_domains to guard against NULL/invalid state:

nvme_rdma_ctrlr_get_memory_domains(ctrlr, ...)
{
    if (!ctrlr || !ctrlr->adminq)
        return 0;  /* no memory domain available */
    rqpair = nvme_rdma_qpair(ctrlr->adminq);
    if (!rqpair || !rqpair->rdma_qp)
        return 0;
    domains[0] = rqpair->rdma_qp->domain;
    ...
}

The existing code trusts that the controller is in a good state; the fix is the early return.

Don't call bdev_get_bdevs on a baremetal/SPDK target during controller reset. The diskengine code already documents this in internal/baremetal/utils.go:146: “This intentionally avoids calling bdev_get_bdevs, which can crash SPDK during NVMe controller reset (SEGV observed in production).” The same rule applies to operator scripts and to any other code path.

How to recover when it happens

systemd auto-restarts SPDK with Restart=always and RestartSec=3. The new SPDK process comes up, diskengine re-establishes the vfio-user exports, vmengine sees the marker from Bug 3's recovery flow and brings the VMs back. The whole recovery is automatic — the operator doesn't have to do anything. But the bug class will recur until the nvme_rdma_ctrlr_get_memory_domains patch lands.

The debug playbook: SPDK is wedged, what now?

The flowchart below is the one-page summary. Print it. Put it on the runbook page. Use it the next time the alert fires.

flowchart TD
Q0["SPDK target is wedged.
What now?"] --> Q1["Q1: Is the
process alive?"]
Q1 -- "no" --> A1["Check coredump
gdb on core
then gdb_macros
spdk_print_bdevs"]
Q1 -- "yes" --> Q2["Q2: Is the JSON-RPC
socket responding?"]
Q2 -- "no" --> A2["systemctl status spdk.service
ls -l on the RPC socket
journalctl -u spdk --since 5m"]
Q2 -- "yes" --> Q3["Q3: spdk_top THREADS tab:
one reactor at 100%,
others idle?"]
Q3 -- "yes" --> A3["spdk_top POLLERS tab
sort by Run count desc
find the runaway poller"]
Q3 -- "no" --> Q4["Q4: spdk_top POLLERS tab:
a single RPC handler
poller pinned?"]
Q4 -- "yes" --> A4["strace -p PID
trace=read,write
to see the syscall hang"]
Q4 -- "no" --> Q5["Q5: Is a specific RPC
hung but others
still work?"]
Q5 -- "yes" --> A5["spdk_trace -s app -p pid
check the tracepoint
just before the hang"]
Q5 -- "no" --> Q6["Q6: Is a bdev module
poller at high busy count
but reactor at low busy?"]
Q6 -- "yes" --> A6["Backend saturated
bdev_get_iostat for queue depth
app-side limit, not SPDK bug"]
Q6 -- "no" --> Q7["Q7: Is a vfio-user
NQN stuck on remove_ns?"]
Q7 -- "yes" --> A7["VFIO/QMP quit wedge
recover: systemctl restart
spdk.service (see Bug 1)"]
Q7 -- "no" --> Q8["Q8: Are you debugging
a production target
or a reproducer?"]
Q8 -- "production" --> A8["bpftrace on bdev_io submit
to confirm IOPS,
then escalate (see 9.2)"]
Q8 -- "reproducer" --> A9["valgrind --tool=memcheck
on a minimal repro
(see 9.2)"]

classDef recov fill:#d6f5d6,stroke:#2a6f2a;
classDef tool fill:#cfe1ff,stroke:#1c4f8a;
classDef tip fill:#fdf2cf,stroke:#8a6f1a;
classDef bug fill:#f5d6e0,stroke:#8a1c4f;
class A1,A2,A3,A4,A5,A6,A8,A9 tool
class A7 bug
class Q0 tip

fig. 1 — SPDK-is-wedged debug playbook · tap or scroll to zoom · ↗ for fullscreen

fig. 1 The eight questions that cover every wedged-SPDK incident we have seen. The first three are non-negotiable checks: is the process alive, is the RPC socket alive, is one reactor pinned. Past that, the answer is the specific tool. The branches are blue (tool), green (recovery), or pink (the one bug class with a known recovery: restart spdk.service).

Edge cases & what trips people up

Reproducibility

The QMP quit wedge is the most-likely-to-miss-the-repro bug in this set. The same sequence on a clean baremetal often does not reproduce it. From the incident report, the 2026-05-25 BM rerun reproduced the symptom (stale vfio-user/NQN condition) but not the attach wedge:

The clean BM did not reproduce the SPDK attach wedge in three QMP-quit attempts, including a higher-IO run. The original hang still looks real, but it likely depends on a dirtier/stale detach condition or a narrower timing window where nvmf_subsystem_remove_ns is called while outstanding vfio-user IO accounting cannot drain.

The lesson: the bug is real, the conditions to trigger it are narrow, and you may not be able to reproduce it on demand. The recovery (restart spdk.service) is the production-grade fix. Patch the root cause; fall back to the restart until the patch lands.

Stale local state on the operating layer

After Bug 1's recovery, the affected VM is STOPPED in the API but /var/vmengine/vms/<id>/vm.state is still RUNNING. This is a vmengine bug that the recovery does not fix. The path to fix it is at VM lifecycle plan:29 : move resource release from StopVM to the cleanup script, and use FOR UPDATE SKIP LOCKED to avoid races between StartVM and cleanup. The local vm.state file should be the source of truth for vmengine reconciliation, not the API.

What trips people up

“SPDK was restarted, but the VMs are still hung.” You have Bug 3. The recovery requires vmengine to be restarted too, which requires the ExecStopPost marker. If the marker is not in the systemd unit, the recovery is incomplete. Verify with systemctl cat spdk.service | grep ExecStopPost.
“I sent QMP quit and the VM still didn't terminate.” The QMP monitor socket may be wedged on a stuck read. The fix is the QMP deadline (Bug 2's first rule) plus SIGKILL as a final fallback. Make sure the SIGKILL is scoped to a specific test VM (matched on -name vm<id>,) before pulling that trigger.
“bdev_get_bdevs crashed SPDK.” You have Bug 5. Don't call that RPC on a baremetal SPDK target. Use bdev_raid_get_bdevs all or bdev_nvme_get_controllers instead.
“Live migration is hung in precopy.” You have Bug 4. Don't wait for it to complete. Kill the source QEMU, the target QEMU, and restart the VM on the target. Pause-scope the namespace operation before the next attempt.
“The diskengine log shows ‘spdk gate failed’ but I can't find the cause.” This is Bug 1's surface symptom. Run scripts/rpc.py nvmf_subsystem_remove_ns <nqn> <nsid> directly. If it hangs, the bug is in the NVMf pause path. If it returns immediately, the bug is in the diskengine gate logic.

What to take away

The four bug classes in this page are not exhaustive, but they cover every wedged-SPDK incident on ex9 in the May 2026 window. Three rules cover the root causes for all of them:

Every SPDK RPC has a deadline. Without one, a single stuck RPC pins a goroutine forever. Add the deadline at the internal/spdkclient/client.go layer; it fixes Bug 1 and Bug 2's escalation.
vfio-user endpoint disconnect is a quiesce boundary, not a pause-and-wait. When the QEMU side goes away, the NVMf subsystem must force-clear its outstanding-IO counters. Until that is patched, the recovery is systemctl restart spdk.service.
Use systemd ExecStopPost for crash signalling. Marker files are atomic, crash-safe, zero-latency, and don't depend on the DB. This is the design that makes Bug 3's recovery automatic instead of manual.

You have now finished the curriculum. The next time a target is wedged, start at the playbook flowchart. The next time a new bug class appears, write a retrospective like this one — “what happened, why it happened, how to detect it, how to prevent it, how to recover.” The retrospective is the artefact; the rules are what survive across incidents.

The bugs that actually broke production.

The four bug classes — at a glance

Bug 1: The VFIO/QMP quit wedge

What happened

Why it happened

How to detect it in the future

How to prevent it in code

How to recover when it happens

Bug 2: The forkbomb + terminate race

What happened

Why it happened

How to detect it

How to prevent it in code

How to recover when it happens

Bug 3: The crash recovery flow

What happened

Why it happened

How to detect it

How to prevent it in code

How to recover when it happens

Bug 4: Live migration issues

What happened

Why it happened

How to detect it

How to prevent it in code

How to recover when it happens

Bug 5: The bdev_get_bdevs SEGV (bonus)

What happened

Why it happened

How to detect it

How to prevent it in code

How to recover when it happens

The debug playbook: SPDK is wedged, what now?

Edge cases & what trips people up

Reproducibility

Stale local state on the operating layer

What trips people up

What to take away

Bug 5: The `bdev_get_bdevs` SEGV (bonus)