Layer 9 · Operations & debugging

The right tool for the right kind of broken.

spdk_top is great for the steady-state “is anything on fire?” view. It is bad at three things: sequence (which event came before which),
post-mortem (the process already crashed and there's only a core file), and production overhead (you can't run ncurses on a 16-core target under load). This page covers the three tools that fill those gaps: tracepoints, USDT, and gdb macros. Plus perf, valgrind, strace, and /proc/PID/maps. Then a decision tree that maps a symptom to a tool.

~15 min read2 diagramsprerequisites: 2.1 · 3.1 · 4.3
On this page
  1. The four kinds of "I need to know more"
  2. Tracing: lib/trace/ and the spdk_trace_* API
  3. Enabling tracepoints: -e flag, masks, post-mortem
  4. Concrete example: trace a bdev_io end-to-end
  5. USDT: static probes, bpftrace, zero overhead
  6. Concrete example: count bdev_io submits per second
  7. gdb macros: pretty-printers, spdk_print_bdevs
  8. Concrete example: inspect a bdev_io from a core
  9. perf, valgrind, strace, /proc
  10. The decision tree: symptom → tool
  11. Edge cases: production overhead, containers, stripped binaries

The four kinds of "I need to know more"

When a target misbehaves, you almost always want to answer one of four questions. Each maps to a different tool:

QuestionTool
“What sequence of events led to this single hung RPC?”Tracepoints (lib/trace/)
“How often does this code path fire, in production, on a live target?”USDT probes (bpftrace)
“What is this core file pointing at, and what are the linked structures doing?”gdb macros (doc/gdb_macros.md)
“Where is the process spending its CPU time?”perf

The first three are SPDK-specific. perf works on anything compiled with frame pointers (which SPDK is). The tools below this line — valgrind, strace, /proc — are general-purpose and the section at the end explains why most of them are not useful for SPDK and when they are.

Tracing: lib/trace/ and the spdk_trace_* API

The SPDK trace framework is a circular buffer in shared memory. The target writes into it from hot code paths. Other processes (or the target itself, post-mortem) read it back and produce a timestamped log of events. The buffer is shared memory so that spdk_trace can attach to a running target without forcing the target to dump.

The three pieces are:

  1. Tracepoints are calls in C code at well-known places. They look like spdk_trace_record(TRACE_FOO, tsc, 0, arg1, arg2). Each tracepoint has a registered name and argument schema. The compiled-in tracepoints are listed by the application's -e flag.

  2. Trace history is the per-lcore circular buffer. One buffer per reactor lcore, plus extra slots for user threads. The size is set at startup (-e all uses the default, -s overrides).

  3. The shared memory file is exposed at /dev/shm/<app>_trace.pid<N> and is readable by any process with file permissions. This is the on-disk artefact you copy to a debug host for offline analysis.

The whole framework is in one file: lib/trace/trace.c and its sibling header lib/trace/trace_internal.h. The interesting entry point is _spdk_trace_record:

Three things to notice. (1) The first thing it does is look up the current lcore. There is one buffer per lcore, so the lookup is just an index. (2) If the current lcore is SPDK_ENV_LCORE_ID_ANY and the thread is not a registered user thread, the function returns without recording anything. This means a thread that has not been initialised properly into the SPDK threading model is silently invisible to tracing. (3) The argument count is asserted — if the tracepoint definition in trace_defs.h says three arguments and the call site passes two, the process asserts in a debug build. This is intentional: it is a build-time check on the tracepoint contracts.

Enabling tracepoints: the -e flag

The application is responsible for parsing the -e flag and registering the tracepoints. A target that wants “all tracepoints on” starts with:

/var/diskengine/spdk/build/bin/nvmf_tgt -e all

At startup the framework logs:

app.c: 527:spdk_app_setup_trace: *NOTICE*: Tracepoint Group Mask 0xFFFF specified.
app.c: 531:spdk_app_setup_trace: *NOTICE*: Use 'spdk_trace -s nvmf -p 24147' to capture a snapshot of events at runtime.
app.c: 533:spdk_app_setup_trace: *NOTICE*: Or copy /dev/shm/nvmf_trace.pid24147 for offline analysis/debug.

That second line is the key — it tells you the SHM name and the PID. The SHM file is at /dev/shm/nvmf_trace.pid24147. The flag is -e nvmf_rdma to enable only the RDMA transport group, -e nvmf_rdma:B to enable specific tracepoints inside the group, and -e nvmf_rdma:2,thread to combine.

The group mask is a 16-bit value. A common pattern is -e nvmf_rdma:FF to enable all RDMA tracepoints but skip others. The available groups are listed by nvmf_tgt -e help.

How to dump traces

The dump tool is app/trace/spdk_trace.c. Run it against the SHM file:

/var/diskengine/spdk/build/bin/spdk_trace -s nvmf -p 24147

If you forgot the PID, run it with no arguments on the same host and it will use the most recent SHM file in /dev/shm. The output is a table of trace events, one per line, with the tracepoint name, timestamp in TSC ticks converted to microseconds, and the argument values. Sample output from the SPDK docs (RDMA I/O state machine):

28:   6026.658 ( 12656064)     RDMA_REQ_NEED_BUFFER              id:    r3622  time:  0.019
28:   6026.694 ( 12656140)     RDMA_REQ_RDY_TO_EXECUTE           id:    r3622  time:  0.055
28:   6026.820 ( 12656406)     RDMA_REQ_EXECUTING                id:    r3622  time:  0.182
28:   6026.992 ( 12656766)     RDMA_REQ_EXECUTED                 id:    r3477  time:  228.510

The id: column is the “object” the tracepoint refers to. The same id across tracepoints is the same request, so you can read down the lines and see one request's full lifecycle. The time: column is the delta from the previous tracepoint on the same id, which gives you the dwell time at each state. The (12656064) is the raw TSC value, useful for cross-referencing with a different log source.

Post-mortem: the file is not deleted

This is the part that makes the trace framework useful for crash analysis. From the source, in spdk_trace_cleanup:

The SHM file is unlinked only if the first TSC value in every lcore's history is 0 — i.e. nothing was ever recorded. As soon as one tracepoint fires, the file persists past process exit. This is the post-mortem hook: a crashed target still has its trace history in /dev/shm. Just copy it out and analyse it on a debug host.

Concrete example: trace a bdev_io end-to-end

The bdev framework emits tracepoints at the major lifecycle points (see 4.3 for the full lifecycle). To trace one bdev_io through its entire life, the minimal -e flag is:

nvmf_tgt -e bdev

Then run a single I/O, dump the trace, and look for the bdev_io pointer (it appears in the object_id column) across events. The shape you'll see is:

BDEV_IO_SUBMIT        id: <bdev_io_ptr>  arg0: <desc_ptr>  arg1: <ch_ptr>
BDEV_IO_DISPATCH      id: <bdev_io_ptr>  arg0: <ch_ptr>     arg1: 0
BDEV_IO_DONE          id: <bdev_io_ptr>  arg0: <ch_ptr>     arg1: <queue_depth>

The times between these three lines are the I/O latency at the bdev layer. If the bdev is on a fast NVMe drive, the DISPATCH-to-DONE delta is on the order of 10 µs. If it is on a slow backend, it is 1–10 ms. If the I/O never produces a DONE line, it is stuck somewhere — that is the bug, and you go look at the poller that is responsible.

USDT: static probes, bpftrace, zero overhead

USDT is “userspace DTrace probes.” They are compile-time markers that, when the binary is built with --with-usdt, end up as ENABLED instructions in the binary that are no-ops until something attaches to them. DTrace, bpftrace, and SystemTap can attach to a probe and execute a script on every fire. The cost of an attached probe is on the order of a function call; the cost of an unattached probe is effectively zero.

The SPDK USDT probes are at include/spdk/usdt.h and are emitted at strategic hot-path locations: the bdev submit path, the nvmf state transitions, the spdk_thread_send_msg site. The companion script is scripts/bpftrace.sh, which handles the path substitution that USDT requires.

To use USDT, the target must be built with --with-usdt:

./configure --with-usdt

The bpftrace installation needs care. From doc/usdt.md:1 :

We have found issues with the packaged bpftrace on both Ubuntu 20.04 and Fedora 33. So bpftrace should be built and installed from source.

After that, run a probe script. The convention is to write a .bt file and pass it to bpftrace.sh:

scripts/bpftrace.sh `pidof nvmf_tgt` scripts/bpf/nvmf.bt

The script will print probe fires as they happen. For example, the shipped scripts/bpf/nvmf.bt prints state transitions on subsystems:

2110.935735: nvmf_tgt reached state NONE
2110.954316: nvmf_tgt reached state CREATE_TARGET
2110.967905: nvmf_tgt reached state CREATE_POLL_GROUPS
2111.235982: nvmf_tgt reached state START_SUBSYSTEMS
2111.253560: nqn.2014-08.org.nvmexpress.discovery change state from INACTIVE to ACTIVE start
...

This is zero-overhead monitoring in production. The probes fire as the I/O is submitted; bpftrace aggregates and prints. You can leave it running for hours.

Concrete example: count bdev_io submits per second

The bdev submit path has a USDT probe at spdk_bdev_io_submit. The probe fires once per bdev_io, so counting the fires is counting the IOPS. A one-liner bpftrace script:

# count_submits.bt
usdt:./build/bin/nvmf_tgt:spdk:spdk_bdev_io_submit {
    @count++;
}

interval:s:1 {
    printf("submits/sec: %d\n", @count);
    delete(@count);
}

Run it:

scripts/bpftrace.sh `pidof nvmf_tgt` count_submits.bt

The output prints the number of bdev_io submits in the last second, every second. spdk_bdev_io_submit is the one place that catches every I/O regardless of the module, so the number is the global submit rate of the target.

gdb macros: pretty-printers for live debugging

The gdb macros are Python helpers that walk SPDK's linked-list structures (which are BSD-style TAILQ lists) and present them as if they were arrays. The reason they exist: a bdev list is a TAILQ, which means each element has a link field that points to the next element. To walk a list of 200 bdevs by hand in gdb, you type p bdev->internal->link->tqe_next->name 200 times. The macros collapse that into a single command.

The macros are documented in doc/gdb_macros.md:1 . The user-facing commands are:

CommandWhat it prints
spdk_load_macrosLoad the helpers (call once after gdb starts)
spdk_print_bdevsOne line per bdev: address, name
spdk_find_bdev NAMESubstring search across bdev names
spdk_print_io_devicesAll io_devices with ref counts
spdk_print_nvmf_subsystemsOne line per nvmf subsystem: NQN, ID
spdk_print_threadsOne line per spdk_thread: name, IO channels with their devices and ref counts

Loading the macros

The macros ship in the SPDK source tree as a Python module. Load them from gdb:

(gdb) python
>import sys
>sys.path.append('/var/diskengine/spdk')
>import gdb_macros
>end
(gdb) spdk_load_macros

Or add a .gdbinit so this happens automatically on gdb start. A typical one:

source /var/diskengine/spdk/scripts/gdb_macros.py

But you still have to call spdk_load_macros manually after gdb has loaded the binary, because the macros need the SPDK globals (g_bdev_mgr, g_threads, etc.) to be available. From doc/gdb_macros.md:219 : “The reason is that the macros need to use globals provided by spdk in order to iterate the spdk lists and build iterable representations of the list objects.” If you call them before the binary's symbol table is loaded, gdb prints RuntimeError: No symbol table is loaded and the macros refuse to load.

Concrete example: inspect a bdev_io from a core

The macros are not very useful for examining a single spdk_bdev_io because that is a structure, not a list. For a single bdev_io you use plain gdb. The flow is:

(gdb) frame 0
#0  bdev_io_complete (bdev_io=0x7f7dcc2c7c08, ...) at bdev.c:7960
(gdb) p *bdev_io
$1 = struct spdk_bdev_io {
    bdev = 0x7f7dcc2c7000,
    internal = {
        ch = 0x7f7dcc2c7c20,
        ...
    },
    type = SPDK_BDEV_IO_TYPE_READ,
    u = { ... }
}
(gdb) p bdev_io->bdev->name
$2 = 0x7f7dcc0b21b0 "raid_2591"

This is a typical post-mortem scenario: you have a core from a crashed target, you find yourself in bdev_io_complete, and you want to know which bdev_io you are. The answer is the bdev pointer's name field.

The macros come in for “list” questions. After the crash above, the natural follow-up is “what other bdevs does this target have?”. The macro:

(gdb) spdk_print_bdevs

SPDK object of type struct spdk_bdev at 0x7f7dcc1642a8
((struct spdk_bdev*) 0x7f7dcc1642a8)
name 0x7f7dcc0b21b0 "raid_2591"

---------------

SPDK object of type struct spdk_bdev at 0x7f7dcc216008
((struct spdk_bdev*) 0x7f7dcc216008)
name 0x7f7dcc0b21a70 "raid_2592"
...

Now you can correlate the crashed bdev_io with the list of all bdevs. The first one in the list is the one that was active at the time of the crash.

perf, valgrind, strace, /proc

These are general-purpose Linux tools. Most are not useful for SPDK. A few are. Here is the honest accounting.

perf record / perf report — useful for hot-path profiling

perf is the right tool when you have a “the target is slow but I don't know why” question that spdk_top cannot answer. Record with -g to capture call graphs, sample at 99 Hz to avoid perturbing the target. The output is a flame graph you can read with perf report.

perf record -F 99 -p $SPDK_PID -g -- sleep 30
perf report

The trap: SPDK runs on dedicated lcores, so the call graph tends to be flat and dominated by your poller callbacks. If perf shows the same 3 functions for 95% of samples, the target is healthy and the call chain is just “reactor -> poller -> nvme_qpair_process_completions -> bdev_io_complete” repeated millions of times. That's not a bug; that's how it's supposed to look.

perf is not useful for “why is this RPC hung” — for that you need a trace dump (sequence of events) or a core file (current state).

valgrind — useful, but at 10-50x slowdown

valgrind --tool=memcheck catches use-after-free, uninitialized reads, and double-frees. It is the right tool for “this bdev_io was used after free” bugs. The problem: it slows the target down by 10-50x. You cannot run it in production. You can run it in CI, or in a staging test that exercises the buggy path.

A common pattern: write a small C program that exercises the bdev module's submit path 10,000 times, link against the SPDK libraries, run under valgrind. If there is a use after free, valgrind prints the exact line where the freed memory was last touched and the exact line of the free.

strace — mostly useless for SPDK

strace traces syscalls. The reactor model deliberately avoids syscalls on the hot path, so strace will show you a lot of epoll_wait calls and almost nothing else. The one time it is useful: when an SPDK target is hanging and you suspect it is blocked on a read from a socket. strace -p $PID -e trace=read,write,recv,send will show the live socket I/O.

A second useful case: strace -f -e trace=open,openat -p $PID to find which file the target is opening right now. This is the fastest way to answer “why is my config not being picked up?”.

/proc/PID/maps — hugepage setup verification

/proc/PID/maps shows the process's virtual memory layout. For SPDK the interesting thing is the hugepage regions. A correct setup looks like:

7f4d00000000-7f5d00000000 rw-p 00000000 00:0d 3145740  /dev/hugepages/...
...

Multiple hugepage regions, one per allocated hugepage. If the target cannot allocate enough hugepages, you see only the regular anonymous mappings and no /dev/hugepages regions. From app/spdk_top/spdk_top.c:1107 , the framework assumes the hugepages are already mounted.

The mapping also reveals the libraries that were loaded. A target that was supposed to use the SPDK asan build but accidentally linked against the release build will have different shared objects in the map. Comparing /proc/PID/maps against the expected list of .so files is a quick way to find that mistake.

The decision tree: symptom → tool

The hardest part of being an SPDK operator is not knowing any one of these tools. It is knowing which one to reach for when the alert fires at 2 am. Here is a flowchart that maps symptoms to tools.

flowchart TD
S0["SPDK target is broken.
Pick a symptom."] --> S1["Q1: Is the process
still running?"] S1 -- "no, it crashed" --> D1["core file
gdb plus spdk_print_bdevs"] S1 -- "yes" --> S2["Q2: Is the JSON-RPC
socket responding?"] S2 -- "no" --> D2["strace -p PID
to see what syscall
is hung"] S2 -- "yes" --> S3["Q3: Is one reactor
pinned at 100%
in spdk_top?"] S3 -- "yes" --> D3["spdk_top POLLERS tab
sort by run count desc
find the runaway poller"] S3 -- "no" --> S4["Q4: Is a specific RPC
hung but others work?"] S4 -- "yes" --> D4["spdk_trace -s app -p pid
check the tracepoint
fires just before the hang"] S4 -- "no" --> S5["Q5: Do you need to know
the I/O rate in production?"] S5 -- "yes" --> D5["bpftrace on the bdev_io
submit USDT probe"] S5 -- "no" --> S6["Q6: Is something crashing
on load (shared lib
mismatch)?"] S6 -- "yes" --> D6["cat /proc/PID/maps
compare to expected
list of .so files"] S6 -- "no" --> S7["Q7: Is something freeing
memory it is still using?"] S7 -- "yes" --> D7["valgrind --tool=memcheck
on a minimal repro"] classDef crash fill:#f5d6e0,stroke:#8a1c4f; classDef tui fill:#d6f5d6,stroke:#2a6f2a; classDef trace fill:#cfe1ff,stroke:#1c4f8a; classDef os fill:#fdf2cf,stroke:#8a6f1a; class D1,D2 crash class D3,D4 tui class D5,D6,D7 trace class S0 os
fig. 2 — symptom → tool decision tree · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The first three questions are the only ones that matter: is the process alive, is the RPC socket alive, is one reactor pinned. After that, each branch is a single tool. The exceptions are “no to all three”, which means the target is in a state that needs deeper inspection — see 9.3.

Edge cases: production overhead, containers, stripped binaries

Tracing overhead in production

The trace framework is not free. Each fire is a TSC read, an lcore lookup, an argument count assert, a memcpy into the circular buffer, and a smp_wmb. On a hot bdev_io_submit path that fires millions of times per second, the cost is measurable. The recommendation: enable tracepoints only when you are actively chasing a bug. The target app.c parses -e all at startup and does not disable it later.

A common production pattern is to leave the framework compiled in but with no groups enabled (the default). The cost is the per-lcore TSC-read and the empty branch. You can then enable a specific group at runtime via the JSON-RPC trace_enable_tpoint_group RPC, fire I/O, dump the trace, and disable the group. This is the “on-demand” workflow.

USDT in containers

USDT probes require the kernel to allow attaching to userspace probes. In a default Docker container, the perf_event_open syscall (which bpftrace uses internally) returns EPERM. The fix is to grant the container the right capabilities:

docker run --cap-add=SYS_ADMIN --cap-add=SYS_PTRACE ...

Or, in a Kubernetes Pod, the pod securityContext needs capabilities.add: ["SYS_ADMIN", "SYS_PTRACE"]. The second caveat: the SPDK binary inside the container must be built with --with-usdt. If the binary was copied in from an external build, the probes may not be present in the target.

gdb macros for stripped binaries

The gdb macros walk global symbol tables. Stripped binaries have no symbol table, and the macros will fail to load with the error in doc/gdb_macros.md:230 : “RuntimeError: No symbol table is loaded.”

If you need to debug a production binary that was stripped at install time, rebuild locally with ./configure --enable-debug --disable-strip and use the unstripped version. Or use the file command on the binary: a non-stripped SPDK binary reads “not stripped”; a stripped one reads “stripped”.

What trips people up

  • “I ran spdk_trace and it returned no events.” The SHM file is at /dev/shm/<app>_trace.pid<pid> and the PID is the SPDK target's PID, not the trace tool's PID. The app.c: ... Use 'spdk_trace -s nvmf -p 24147' line at target startup gives you both.

  • “I attached bpftrace and the probe never fired.” The binary was not built with --with-usdt. Run readelf -n ./build/bin/nvmf_tgt and check for the SPDK_PROBE section. If it is missing, rebuild.

  • spdk_load_macros gives a RuntimeError.” You attached gdb to a process that does not have SPDK's globals loaded, or the binary is stripped. Use a non-stripped build, or load the binary into gdb with file ./build/bin/nvmf_tgt first.

  • perf shows 99% in __lll_lock_wait.” The target is blocking on a pthread mutex. This is a bug — SPDK threads are not supposed to take pthread mutexes on the hot path. The trace dump will tell you which RPC started the lock. (The likely culprit is the JSON-RPC server.)

  • valgrind reports ‘still reachable’ for huge amounts of memory.” This is normal for SPDK — mempools and the JSON-RPC server hold onto memory intentionally. The leaks that matter are the “definitely lost” and “indirectly lost” categories.

Why it matters

spdk_top is the day-to-day tool. The three tools on this page are the incident-response tools. The decision tree is short because the questions are short: is the process alive, is the RPC socket alive, is a reactor pinned. Past that, you are doing forensics — and forensics is tracing + gdb + USDT in some combination.

The next page, 9.3 — failure mode retrospectives, is the capstone. It walks through four real production incidents on baremetal ex9, what tool was used to diagnose each, and the rules to follow so the same bug class doesn't happen again.