The right tool for the right kind of broken.
spdk_top is great for the steady-state “is
anything on fire?” view. It is bad at three things:
sequence (which event came before which),
post-mortem (the process already crashed and
there's only a core file), and production overhead
(you can't run ncurses on a 16-core target under load). This page
covers the three tools that fill those gaps: tracepoints, USDT, and
gdb macros. Plus perf, valgrind,
strace, and /proc/PID/maps. Then a
decision tree that maps a symptom to a tool.
- The four kinds of "I need to know more"
- Tracing:
lib/trace/and the spdk_trace_* API - Enabling tracepoints:
-eflag, masks, post-mortem - Concrete example: trace a bdev_io end-to-end
- USDT: static probes, bpftrace, zero overhead
- Concrete example: count bdev_io submits per second
- gdb macros: pretty-printers,
spdk_print_bdevs - Concrete example: inspect a bdev_io from a core
perf,valgrind,strace,/proc- The decision tree: symptom → tool
- Edge cases: production overhead, containers, stripped binaries
The four kinds of "I need to know more"
When a target misbehaves, you almost always want to answer one of four questions. Each maps to a different tool:
| Question | Tool |
|---|---|
| “What sequence of events led to this single hung RPC?” | Tracepoints (lib/trace/) |
| “How often does this code path fire, in production, on a live target?” | USDT probes (bpftrace) |
| “What is this core file pointing at, and what are the linked structures doing?” | gdb macros (doc/gdb_macros.md) |
| “Where is the process spending its CPU time?” | perf |
The first three are SPDK-specific. perf works on
anything compiled with frame pointers (which SPDK is). The tools
below this line — valgrind, strace,
/proc — are general-purpose and the section at the
end explains why most of them are not useful for SPDK and when
they are.
Tracing: lib/trace/ and the spdk_trace_* API
The SPDK trace framework is a circular buffer in shared memory. The
target writes into it from hot code paths. Other processes (or the
target itself, post-mortem) read it back and produce a
timestamped log of events. The buffer is shared memory so that
spdk_trace can attach to a running target without
forcing the target to dump.
The three pieces are:
Tracepoints are calls in C code at well-known places. They look like
spdk_trace_record(TRACE_FOO, tsc, 0, arg1, arg2). Each tracepoint has a registered name and argument schema. The compiled-in tracepoints are listed by the application's-eflag.Trace history is the per-lcore circular buffer. One buffer per reactor lcore, plus extra slots for user threads. The size is set at startup (
-e alluses the default,-soverrides).The shared memory file is exposed at
/dev/shm/<app>_trace.pid<N>and is readable by any process with file permissions. This is the on-disk artefact you copy to a debug host for offline analysis.
The whole framework is in one file:
lib/trace/trace.c and its sibling header
lib/trace/trace_internal.h. The interesting entry
point is _spdk_trace_record:
Three things to notice. (1) The first thing it does is look up
the current lcore. There is one buffer per lcore, so the lookup
is just an index. (2) If the current lcore is
SPDK_ENV_LCORE_ID_ANY and the thread is not a
registered user thread, the function returns without recording
anything. This means a thread that has not been initialised
properly into the SPDK threading model is silently
invisible to tracing. (3) The argument count is asserted
— if the tracepoint definition in trace_defs.h says
three arguments and the call site passes two, the process asserts
in a debug build. This is intentional: it is a build-time check
on the tracepoint contracts.
Enabling tracepoints: the -e flag
The application is responsible for parsing the
-e flag and registering the tracepoints. A target
that wants “all tracepoints on” starts with:
/var/diskengine/spdk/build/bin/nvmf_tgt -e allAt startup the framework logs:
app.c: 527:spdk_app_setup_trace: *NOTICE*: Tracepoint Group Mask 0xFFFF specified.
app.c: 531:spdk_app_setup_trace: *NOTICE*: Use 'spdk_trace -s nvmf -p 24147' to capture a snapshot of events at runtime.
app.c: 533:spdk_app_setup_trace: *NOTICE*: Or copy /dev/shm/nvmf_trace.pid24147 for offline analysis/debug.That second line is the key — it tells you the SHM name and the
PID. The SHM file is at
/dev/shm/nvmf_trace.pid24147. The flag is
-e nvmf_rdma to enable only the RDMA transport
group, -e nvmf_rdma:B to enable specific
tracepoints inside the group, and -e nvmf_rdma:2,thread
to combine.
The group mask is a 16-bit value. A common pattern is
-e nvmf_rdma:FF to enable all RDMA tracepoints but
skip others. The available groups are listed by
nvmf_tgt -e help.
How to dump traces
The dump tool is app/trace/spdk_trace.c. Run it
against the SHM file:
/var/diskengine/spdk/build/bin/spdk_trace -s nvmf -p 24147If you forgot the PID, run it with no arguments on the same host
and it will use the most recent SHM file in
/dev/shm. The output is a table of trace events,
one per line, with the tracepoint name, timestamp in TSC ticks
converted to microseconds, and the argument values. Sample
output from the SPDK docs (RDMA I/O state machine):
28: 6026.658 ( 12656064) RDMA_REQ_NEED_BUFFER id: r3622 time: 0.019
28: 6026.694 ( 12656140) RDMA_REQ_RDY_TO_EXECUTE id: r3622 time: 0.055
28: 6026.820 ( 12656406) RDMA_REQ_EXECUTING id: r3622 time: 0.182
28: 6026.992 ( 12656766) RDMA_REQ_EXECUTED id: r3477 time: 228.510The id: column is the “object” the
tracepoint refers to. The same id across tracepoints is the
same request, so you can read down the lines and see one
request's full lifecycle. The time: column is the
delta from the previous tracepoint on the same id, which gives
you the dwell time at each state. The
(12656064) is the raw TSC value, useful for
cross-referencing with a different log source.
Post-mortem: the file is not deleted
This is the part that makes the trace framework useful for
crash analysis. From the source, in spdk_trace_cleanup:
The SHM file is unlinked only if the first TSC value in every
lcore's history is 0 — i.e. nothing was ever recorded. As soon
as one tracepoint fires, the file persists past process exit.
This is the post-mortem hook: a crashed target still has its
trace history in /dev/shm. Just copy it out and
analyse it on a debug host.
Concrete example: trace a bdev_io end-to-end
The bdev framework emits tracepoints at the major lifecycle
points (see 4.3 for the full
lifecycle). To trace one bdev_io through its entire life, the
minimal -e flag is:
nvmf_tgt -e bdevThen run a single I/O, dump the trace, and look for the bdev_io
pointer (it appears in the object_id column) across
events. The shape you'll see is:
BDEV_IO_SUBMIT id: <bdev_io_ptr> arg0: <desc_ptr> arg1: <ch_ptr>
BDEV_IO_DISPATCH id: <bdev_io_ptr> arg0: <ch_ptr> arg1: 0
BDEV_IO_DONE id: <bdev_io_ptr> arg0: <ch_ptr> arg1: <queue_depth>The times between these three lines are the I/O latency at the
bdev layer. If the bdev is on a fast NVMe drive, the
DISPATCH-to-DONE delta is on the order
of 10 µs. If it is on a slow backend, it is 1–10 ms. If the I/O
never produces a DONE line, it is stuck
somewhere — that is the bug, and you go look at the poller that
is responsible.
USDT: static probes, bpftrace, zero overhead
USDT is “userspace DTrace probes.” They are
compile-time markers that, when the binary is built with
--with-usdt, end up as ENABLED
instructions in the binary that are no-ops until something
attaches to them. DTrace, bpftrace, and
SystemTap can attach to a probe and execute a
script on every fire. The cost of an attached probe is on the
order of a function call; the cost of an unattached probe is
effectively zero.
The SPDK USDT probes are at include/spdk/usdt.h and
are emitted at strategic hot-path locations: the bdev submit
path, the nvmf state transitions, the spdk_thread_send_msg
site. The companion script is scripts/bpftrace.sh,
which handles the path substitution that USDT requires.
To use USDT, the target must be built with
--with-usdt:
./configure --with-usdtThe bpftrace installation needs care. From doc/usdt.md:1 :
We have found issues with the packaged bpftrace on both Ubuntu 20.04 and Fedora 33. So bpftrace should be built and installed from source.
After that, run a probe script. The convention is to write a
.bt file and pass it to bpftrace.sh:
scripts/bpftrace.sh `pidof nvmf_tgt` scripts/bpf/nvmf.btThe script will print probe fires as they happen. For example,
the shipped scripts/bpf/nvmf.bt prints
state transitions on subsystems:
2110.935735: nvmf_tgt reached state NONE
2110.954316: nvmf_tgt reached state CREATE_TARGET
2110.967905: nvmf_tgt reached state CREATE_POLL_GROUPS
2111.235982: nvmf_tgt reached state START_SUBSYSTEMS
2111.253560: nqn.2014-08.org.nvmexpress.discovery change state from INACTIVE to ACTIVE start
...This is zero-overhead monitoring in production. The probes fire as the I/O is submitted; bpftrace aggregates and prints. You can leave it running for hours.
Concrete example: count bdev_io submits per second
The bdev submit path has a USDT probe at
spdk_bdev_io_submit. The probe fires once per
bdev_io, so counting the fires is counting the IOPS. A
one-liner bpftrace script:
# count_submits.bt
usdt:./build/bin/nvmf_tgt:spdk:spdk_bdev_io_submit {
@count++;
}
interval:s:1 {
printf("submits/sec: %d\n", @count);
delete(@count);
}Run it:
scripts/bpftrace.sh `pidof nvmf_tgt` count_submits.btThe output prints the number of bdev_io submits in the last
second, every second. spdk_bdev_io_submit is the
one place that catches every I/O regardless of the module, so
the number is the global submit rate of the target.
gdb macros: pretty-printers for live debugging
The gdb macros are Python helpers that walk SPDK's linked-list
structures (which are BSD-style TAILQ lists) and
present them as if they were arrays. The reason they exist: a
bdev list is a TAILQ, which means each element has
a link field that points to the next element. To
walk a list of 200 bdevs by hand in gdb, you type
p bdev->internal->link->tqe_next->name 200
times. The macros collapse that into a single command.
The macros are documented in doc/gdb_macros.md:1 . The user-facing commands are:
| Command | What it prints |
|---|---|
spdk_load_macros | Load the helpers (call once after gdb starts) |
spdk_print_bdevs | One line per bdev: address, name |
spdk_find_bdev NAME | Substring search across bdev names |
spdk_print_io_devices | All io_devices with ref counts |
spdk_print_nvmf_subsystems | One line per nvmf subsystem: NQN, ID |
spdk_print_threads | One line per spdk_thread: name, IO channels with their devices and ref counts |
Loading the macros
The macros ship in the SPDK source tree as a Python module. Load them from gdb:
(gdb) python
>import sys
>sys.path.append('/var/diskengine/spdk')
>import gdb_macros
>end
(gdb) spdk_load_macrosOr add a .gdbinit so this happens automatically on
gdb start. A typical one:
source /var/diskengine/spdk/scripts/gdb_macros.pyBut you still have to call spdk_load_macros
manually after gdb has loaded the binary, because the macros
need the SPDK globals (g_bdev_mgr,
g_threads, etc.) to be available. From
doc/gdb_macros.md:219 : “The
reason is that the macros need to use globals provided by spdk
in order to iterate the spdk lists and build iterable
representations of the list objects.” If you call them
before the binary's symbol table is loaded, gdb prints
RuntimeError: No symbol table is loaded and the
macros refuse to load.
Concrete example: inspect a bdev_io from a core
The macros are not very useful for examining a single
spdk_bdev_io because that is a structure, not a
list. For a single bdev_io you use plain gdb. The flow is:
(gdb) frame 0
#0 bdev_io_complete (bdev_io=0x7f7dcc2c7c08, ...) at bdev.c:7960
(gdb) p *bdev_io
$1 = struct spdk_bdev_io {
bdev = 0x7f7dcc2c7000,
internal = {
ch = 0x7f7dcc2c7c20,
...
},
type = SPDK_BDEV_IO_TYPE_READ,
u = { ... }
}
(gdb) p bdev_io->bdev->name
$2 = 0x7f7dcc0b21b0 "raid_2591"This is a typical post-mortem scenario: you have a core from a
crashed target, you find yourself in bdev_io_complete,
and you want to know which bdev_io you are. The
answer is the bdev pointer's name field.
The macros come in for “list” questions. After the crash above, the natural follow-up is “what other bdevs does this target have?”. The macro:
(gdb) spdk_print_bdevs
SPDK object of type struct spdk_bdev at 0x7f7dcc1642a8
((struct spdk_bdev*) 0x7f7dcc1642a8)
name 0x7f7dcc0b21b0 "raid_2591"
---------------
SPDK object of type struct spdk_bdev at 0x7f7dcc216008
((struct spdk_bdev*) 0x7f7dcc216008)
name 0x7f7dcc0b21a70 "raid_2592"
...Now you can correlate the crashed bdev_io with the list of all bdevs. The first one in the list is the one that was active at the time of the crash.
perf, valgrind, strace, /proc
These are general-purpose Linux tools. Most are not useful for SPDK. A few are. Here is the honest accounting.
perf record / perf report — useful for hot-path profiling
perf is the right tool when you have a
“the target is slow but I don't know why” question
that spdk_top cannot answer. Record with
-g to capture call graphs, sample at 99 Hz to
avoid perturbing the target. The output is a flame graph you can
read with perf report.
perf record -F 99 -p $SPDK_PID -g -- sleep 30
perf reportThe trap: SPDK runs on dedicated lcores, so the call graph
tends to be flat and dominated by your poller callbacks. If
perf shows the same 3 functions for 95% of
samples, the target is healthy and the call chain is just
“reactor -> poller -> nvme_qpair_process_completions ->
bdev_io_complete” repeated millions of times. That's not
a bug; that's how it's supposed to look.
perf is not useful for “why is
this RPC hung” — for that you need a trace dump
(sequence of events) or a core file (current state).
valgrind — useful, but at 10-50x slowdown
valgrind --tool=memcheck catches use-after-free,
uninitialized reads, and double-frees. It is the right tool for
“this bdev_io was used after free” bugs. The
problem: it slows the target down by 10-50x. You cannot run it
in production. You can run it in CI, or in a staging test that
exercises the buggy path.
A common pattern: write a small C program that exercises the
bdev module's submit path 10,000 times, link against the SPDK
libraries, run under valgrind. If there is a use
after free, valgrind prints the exact line where the freed
memory was last touched and the exact line of the free.
strace — mostly useless for SPDK
strace traces syscalls. The reactor model
deliberately avoids syscalls on the hot path, so
strace will show you a lot of
epoll_wait calls and almost nothing else. The one
time it is useful: when an SPDK target is hanging and you
suspect it is blocked on a read from a socket.
strace -p $PID -e trace=read,write,recv,send will
show the live socket I/O.
A second useful case: strace -f -e trace=open,openat -p $PID
to find which file the target is opening right now. This is the
fastest way to answer “why is my config not being
picked up?”.
/proc/PID/maps — hugepage setup verification
/proc/PID/maps shows the process's virtual memory
layout. For SPDK the interesting thing is the hugepage regions.
A correct setup looks like:
7f4d00000000-7f5d00000000 rw-p 00000000 00:0d 3145740 /dev/hugepages/...
...Multiple hugepage regions, one per allocated hugepage. If the
target cannot allocate enough hugepages, you see only the
regular anonymous mappings and no /dev/hugepages
regions. From
app/spdk_top/spdk_top.c:1107 , the
framework assumes the hugepages are already mounted.
The mapping also reveals the libraries that were loaded. A
target that was supposed to use the SPDK asan build but
accidentally linked against the release build will have
different shared objects in the map. Comparing
/proc/PID/maps against the expected list of
.so files is a quick way to find that mistake.
The decision tree: symptom → tool
The hardest part of being an SPDK operator is not knowing any one of these tools. It is knowing which one to reach for when the alert fires at 2 am. Here is a flowchart that maps symptoms to tools.
flowchart TD S0["SPDK target is broken.
Pick a symptom."] --> S1["Q1: Is the process
still running?"] S1 -- "no, it crashed" --> D1["core file
gdb plus spdk_print_bdevs"] S1 -- "yes" --> S2["Q2: Is the JSON-RPC
socket responding?"] S2 -- "no" --> D2["strace -p PID
to see what syscall
is hung"] S2 -- "yes" --> S3["Q3: Is one reactor
pinned at 100%
in spdk_top?"] S3 -- "yes" --> D3["spdk_top POLLERS tab
sort by run count desc
find the runaway poller"] S3 -- "no" --> S4["Q4: Is a specific RPC
hung but others work?"] S4 -- "yes" --> D4["spdk_trace -s app -p pid
check the tracepoint
fires just before the hang"] S4 -- "no" --> S5["Q5: Do you need to know
the I/O rate in production?"] S5 -- "yes" --> D5["bpftrace on the bdev_io
submit USDT probe"] S5 -- "no" --> S6["Q6: Is something crashing
on load (shared lib
mismatch)?"] S6 -- "yes" --> D6["cat /proc/PID/maps
compare to expected
list of .so files"] S6 -- "no" --> S7["Q7: Is something freeing
memory it is still using?"] S7 -- "yes" --> D7["valgrind --tool=memcheck
on a minimal repro"] classDef crash fill:#f5d6e0,stroke:#8a1c4f; classDef tui fill:#d6f5d6,stroke:#2a6f2a; classDef trace fill:#cfe1ff,stroke:#1c4f8a; classDef os fill:#fdf2cf,stroke:#8a6f1a; class D1,D2 crash class D3,D4 tui class D5,D6,D7 trace class S0 os
fig. 2 The first three questions are the only ones that matter: is the process alive, is the RPC socket alive, is one reactor pinned. After that, each branch is a single tool. The exceptions are “no to all three”, which means the target is in a state that needs deeper inspection — see 9.3.
Edge cases: production overhead, containers, stripped binaries
Tracing overhead in production
The trace framework is not free. Each fire is a TSC read, an
lcore lookup, an argument count assert, a memcpy into the
circular buffer, and a smp_wmb. On a hot bdev_io_submit path
that fires millions of times per second, the cost is
measurable. The recommendation: enable tracepoints only when
you are actively chasing a bug. The target
app.c parses -e all at startup and
does not disable it later.
A common production pattern is to leave the framework
compiled in but with no groups enabled (the default). The
cost is the per-lcore TSC-read and the empty branch. You can
then enable a specific group at runtime via the JSON-RPC
trace_enable_tpoint_group RPC, fire I/O, dump
the trace, and disable the group. This is the
“on-demand” workflow.
USDT in containers
USDT probes require the kernel to allow attaching to userspace
probes. In a default Docker container, the
perf_event_open syscall (which bpftrace uses
internally) returns EPERM. The fix is to grant the container
the right capabilities:
docker run --cap-add=SYS_ADMIN --cap-add=SYS_PTRACE ...Or, in a Kubernetes Pod, the pod securityContext needs
capabilities.add: ["SYS_ADMIN", "SYS_PTRACE"].
The second caveat: the SPDK binary inside the container must
be built with --with-usdt. If the binary was
copied in from an external build, the probes may not be
present in the target.
gdb macros for stripped binaries
The gdb macros walk global symbol tables. Stripped binaries have no symbol table, and the macros will fail to load with the error in doc/gdb_macros.md:230 : “RuntimeError: No symbol table is loaded.”
If you need to debug a production binary that was stripped at
install time, rebuild locally with
./configure --enable-debug --disable-strip and
use the unstripped version. Or use the
file command on the binary: a non-stripped SPDK
binary reads “not stripped”; a stripped one
reads “stripped”.
What trips people up
“I ran
spdk_traceand it returned no events.” The SHM file is at/dev/shm/<app>_trace.pid<pid>and the PID is the SPDK target's PID, not the trace tool's PID. Theapp.c: ... Use 'spdk_trace -s nvmf -p 24147'line at target startup gives you both.“I attached
bpftraceand the probe never fired.” The binary was not built with--with-usdt. Runreadelf -n ./build/bin/nvmf_tgtand check for theSPDK_PROBEsection. If it is missing, rebuild.“
spdk_load_macrosgives a RuntimeError.” You attached gdb to a process that does not have SPDK's globals loaded, or the binary is stripped. Use a non-stripped build, or load the binary into gdb withfile ./build/bin/nvmf_tgtfirst.“
perfshows 99% in__lll_lock_wait.” The target is blocking on a pthread mutex. This is a bug — SPDK threads are not supposed to take pthread mutexes on the hot path. The trace dump will tell you which RPC started the lock. (The likely culprit is the JSON-RPC server.)“
valgrindreports ‘still reachable’ for huge amounts of memory.” This is normal for SPDK — mempools and the JSON-RPC server hold onto memory intentionally. The leaks that matter are the “definitely lost” and “indirectly lost” categories.
Why it matters
spdk_top is the day-to-day tool. The three tools
on this page are the incident-response tools. The decision
tree is short because the questions are short: is the
process alive, is the RPC socket alive, is a reactor pinned.
Past that, you are doing forensics — and forensics is
tracing + gdb + USDT in some combination.
The next page, 9.3 —
failure mode retrospectives, is the capstone. It walks
through four real production incidents on baremetal
ex9, what tool was used to diagnose each, and
the rules to follow so the same bug class doesn't happen
again.