Reading spdk_top the way the source wants you to.
spdk_top is top for an SPDK process. It runs as a
TUI on top of ncurses, connects to the same JSON-RPC socket
that scripts/rpc.py uses, and re-polls three RPCs on a timer:
thread_get_stats, thread_get_pollers, and
framework_get_reactors. Every number on the screen comes
from one of those three. The hard part is not starting it — the hard
part is reading what the numbers mean while a target is misbehaving.
This page walks you through every column, every key, and the three
patterns that catch 90% of the bugs you'll ever see.
- What
spdk_topis and how to start it - The three tabs, the three RPCs
- THREADS tab: every column, decoded
- POLLERS tab: per-poller run counts and busy/idle
- CORES tab: what the OS sees vs. what the reactor sees
- Keys: sort, refresh, columns, help, total vs. interval
- The fields that matter during a real incident
- Three patterns you will see again and again
- Edge cases: shutdown, missing fields, negative numbers
- What trips people up
What spdk_top is and how to start it
spdk_top is a self-contained ncurses binary that ships in
app/spdk_top/. There is nothing magical about it. It opens
the SPDK JSON-RPC socket (the same one you talk to with
scripts/rpc.py), issues three read-only RPCs on a configurable
interval, and renders the response. No part of it runs inside the
target. It is purely a viewer of state that the target has already
published.
The source is one file:
The RPC_MAX_* defines are hard upper bounds on the number of
objects spdk_top will display. RPC_MAX_THREADS=1024
is the cap on the THREADS tab, not a cap on what the target supports.
On a busy nvmf_tgt with 127 poll groups, you may approach
that limit on the POLLERS tab — and that is itself a useful signal.
To start it, point it at a target's JSON-RPC socket:
/var/diskengine/spdk/build/bin/spdk_top \
-s /var/tmp/spdk.sockThe default refresh is 1 second. Press r to change it; the
valid range is 0 to 255 seconds. 0 means "as fast as
possible" — every 10 ms — and is only useful when you are trying to
catch a race.
The three tabs, the three RPCs
Each tab is fed by exactly one RPC. The mapping is fixed in
data_thread_routine in
app/spdk_top/spdk_top.c:3040 :
| Tab | RPC | What it returns |
|---|---|---|
THREADS (key 1) | thread_get_stats | One row per spdk_thread — name, core, three poller counts, busy/idle ticks |
POLLERS (key 2) | thread_get_pollers | One row per registered poller, grouped by thread, classified by type (active / timed / paused) |
CORES (key 3) | framework_get_reactors | One row per lcore, with kernel-side busy/sys/irq/us time and the lightweight threads scheduled on it |
The data thread runs every refresh_rate microseconds, fetches
all three RPCs in order, and stores the latest copy in three global
arrays. The UI thread renders those arrays on a separate timer. The
two threads coordinate on pthread_mutex_t g_thread_lock.
Three things to notice in this loop. (1) Cores are fetched first because
their count is stable; if you fetch threads first and the reactor set
changed in flight, the cross-reference thread -> core
would lag by one tick. (2) The lock is dropped during
usleep, so a slow RPC does not freeze the UI. (3) When
g_sleep_time is 0 the refresh is 10 ms; the comment in the
source explicitly says this is for "as fast as possible" redraws.
flowchart LR A[spdk_top data thread] --> B["thread_get_stats
(THREADS)"] A --> C["thread_get_pollers
(POLLERS)"] A --> D["framework_get_reactors
(CORES)"] A --> E["framework_get_scheduler
(scheduler pop-up)"] B --> F[g_threads_info] C --> G[g_pollers_info] D --> H[g_cores_info] F --> I[UI thread renders] G --> I H --> I J[Key 'h' or 'g'] --> K[help / scheduler pop-up] classDef rpc fill:#cfe1ff,stroke:#1c4f8a; classDef store fill:#d6f5d6,stroke:#2a6f2a; classDef ui fill:#fdf2cf,stroke:#8a6f1a; class B,C,D,E rpc class F,G,H store class I,K ui
fig. 1 The four RPCs spdk_top issues on each refresh,
the three globals it stores the responses in, and the UI thread that
renders them. The scheduler pop-up (key g) is a separate
read; it does not have its own tab.
THREADS tab: every column, decoded
The THREADS tab is the one you'll spend the most time on. Each row is
one spdk_thread, identified by name and by the lcore that
reactor pinned it to. The columns are declared in
app/spdk_top/spdk_top.c:92 as
enum column_threads_type and rendered with
draw_thread_tab_row at line 1335.
| Column | Source field | What it actually means |
|---|---|---|
| Thread name | thread.name | reactor_N for the per-reactor spdk_thread, or a named user thread (e.g. app_thread, vbdev_passthru_0). Names longer than 26 chars get truncated to .... |
| Core | core_num | The lcore index the reactor that owns this thread is currently running on. A -1 means the thread is not currently scheduled on a reactor. This happens during thread migration under a non-static scheduler. |
| Active pollers | active_pollers_count | Number of registered active (busy) pollers — see 2.3. Each one runs every reactor iteration until it returns a value < 0. |
| Timed pollers | timed_pollers_count | Number of pollers registered with spdk_poller_register_named(... period). They fire on a fixed wall-clock period. |
| Paused pollers | paused_pollers_count | Number of pollers that exist but are currently paused. They take up a slot in the list but do not fire. |
| Idle [us] | idle - last_idle (in interval mode) or idle (in total mode) | How many microseconds this thread spent idle in the last refresh window. A thread that is “idle” is one whose spdk_thread_poll() call returned 0 — there was nothing to do. |
| Busy [us] | busy - last_busy | How many microseconds the thread spent doing real work in the last refresh window. busy + idle is the wall-clock time the thread was on-core. |
| CPU % | derived: busy * 10000 / (busy + idle), displayed as 0.00–100.00 | Percent of the wall-clock window the thread spent doing work. On a quiet target this is < 5%. On a saturated one it pegs at 99.99%. |
| Status | free / running / sleeping / idle / unmatched | Inline indicator flag (a coloured marker in the source) showing the thread's current state. unmatched means the thread exists but the reactor it is registered to is gone — usually a shutdown remnant. |
The single most important pair of columns is Busy [us] and
Idle [us]. They are absolute times, not percentages, and the
sum is the time the thread was scheduled on a reactor. If the sum does
not match the refresh interval (e.g. the tab says 1 s
refresh but a row reads Busy 1.2 s, Idle 0), one of three
things is true: (1) the data is from a previous refresh and the UI has
not yet redrawn; (2) the thread is migrating between reactors (the
clock is shared); or (3) the thread is overrunning its window, which
is what a runaway active poller looks like.
POLLERS tab: per-poller run counts and busy/idle
Each row is a single registered poller. The columns are declared at
app/spdk_top/spdk_top.c:105 as
enum column_pollers_type:
| Column | What it actually means |
|---|---|
| Poller name | The string passed to spdk_poller_register_named(... name). Anonymous pollers show as their function pointer or a generic name. |
| Type | Active (busy poller), Timed (fixed period), or Paused. The classification comes from the JSON field state in thread_get_pollers. |
| On thread | The owning spdk_thread name. Useful to confirm pollers are on the right thread — a poller on reactor_3 is fine; a poller named vtophys_poll on reactor_3 is fine too, but if you see it on the same thread as a vhost-user controller, you have a threading violation. |
| Run count | Cumulative number of times the poller has been invoked. In interval mode (default) it is the delta from the previous refresh. This is the “is the poller firing?” number. |
| Period [us] | The configured period in microseconds for a timed poller, or 0 for an active poller. 0us does not mean the poller is broken — it means the poller runs every iteration of the reactor. |
| Status (busy count) | The number of times the poller returned SPDK_POLLER_BUSY (i.e. did real work and wants to be re-polled immediately). In interval mode it is a delta. A poller with a high busy count is either a hot poller by design (NVMe completion scanning) or a runaway poller that never returns idle — see the runaway pattern below. |
The Run count and Status (busy count)
columns tell a story when you sort by them. Sort by busy count,
descending, and the first row is the poller that is doing the most
work right now. If that poller is, say, the nvmf poll group's
poller, the target is doing useful I/O. If it is a
vtophys_poll running thousands of times per second, you
have a DMA mapping leak.
The poller “period = 0” question
You will see pollers with Period [us] = 0. This is normal
for active pollers. The convention is: an active poller is registered
without a period and runs as often as the reactor iterates. A timed
poller has a non-zero period and is bucketed by wall-clock deadline.
Paused pollers are still in the list but their run count stops
incrementing.
If you sort by Run count and the top row reads
Period 0 with a delta of millions of calls per second,
you have a runaway. Compare the run count delta to
reactor_iterations: in a healthy target, an active
poller's run count cannot exceed the reactor's iteration count, and
is usually much lower because most iterations find no work. A
poller that runs every single iteration is doing 10s of millions of
calls per second on a single core.
CORES tab: what the OS sees vs. what the reactor sees
The CORES tab is the only one that talks to the kernel — the data
comes from framework_get_reactors which reads
/proc/self/stat for the SPDK process. The column
declaration is in
app/spdk_top/spdk_top.c:115 . The columns
you actually care about:
| Column | What it means |
|---|---|
| Core | The lcore index (0..N-1). |
| Threads | Number of spdk_threads currently scheduled on this core. |
| Pollers | Total pollers (active + timed + paused) registered across all threads on this core. |
| Busy % | Thread-side busy time, same source as THREADS tab. |
| Status | Reactor state. idle means the reactor is sleeping because there are no threads scheduled on it. |
| Intr | Whether the core is currently inside an interrupt handler (Y / N). Useful for confirming that a stalled reactor is not blocked on a kernel interrupt. |
| Sys % / Irq % | Kernel-side time spent in system and IRQ contexts. High Sys % with high Busy % means the reactor is in heavy I/O submission. High Irq % with high Busy % is rare for SPDK (no interrupts) but indicates the OS is being asked to do something. |
| Freq [MHz] | Current core frequency. Modern CPUs throttle aggressively — a reactor at 2.4 GHz on a part that is rated for 3.6 GHz is thermal-throttled. |
A core that shows Threads = 0, Status = idle, Busy % = 0 is a wasted core. The scheduler can hand it to a thread that is over-subscribed elsewhere, but if it stays that way for minutes, the target is misconfigured.
Keys: sort, refresh, columns, help, total vs. interval
The full key list is rendered in the help window opened by h:
The two keys that change meaning of the numbers are t
(total vs. interval) and c (column toggle). In
interval mode (the default), the busy and idle
columns show deltas from the previous refresh. In
total mode, they show cumulative values since the
process started. Most debugging wants interval mode; the cumulative
numbers are useful for understanding the long-run shape of the
workload.
The [g] scheduler pop-up shows the active reactor
scheduler name, its period, and the active governor (e.g.
static, dynamic, or nothing).
Press g to bring it up, Esc to close. This
is the only place to see the scheduler's period without writing an
RPC.
The fields that matter during a real incident
Most of the time spdk_top is just a "is the target
running?" check. The screen has 50+ fields, but during an incident
you only care about three. If you remember nothing else from this
page, remember these.
Field 1: Reactor's "iter" rate (THREADS tab, busy + idle)
How fast the poll loop is running. The Busy [us] and
Idle [us] pair on the THREADS tab, when summed, give the
wall-clock time the thread was on-core. A thread that is “at
100% but doing nothing” shows up as Busy 1000000, Idle
0 on a 1-second refresh — and that is the diagnostic for a
runaway poller. Compare across cores: a healthy target with N cores
has roughly equal Busy across all reactor threads. A target where
one core is at 99% and the others are at 5% has a problem pinned to
that core.
Field 2: Bdev queue depth (THREADS tab, queue depth not directly visible)
spdk_top does not show per-bdev queue depth by default.
The way you read it is through the poller. Sort POLLERS by busy count
descending — the top row is the bdev module that has the most
outstanding I/O. A bdev that is saturated has a steady, high busy
count; a bdev that is idle has a busy count of 0.
For raw queue depth you need the RPC bdev_get_iostat
(a separate command — see 9.2).
spdk_top deliberately keeps the bdev view out of its
default tabs because it is a TUI, and per-bdev tables change width
with the number of bdevs, which makes the column layout unstable.
Field 3: Pollers' run count (POLLERS tab, run count column)
The most reliable signal of a misbehaving poller. Sort by Run count descending and look at the top five. A poller with a million-delta per second is one of three things: a hot poller by design (NVMe completion scanner), a runaway tight loop (the bug you're chasing), or a poller stuck on a slow resource (an IO channel that is not freeing).
Cross-check by sorting by Status (busy count). A
hot poller that returns SPDK_POLLER_BUSY every
iteration is one that is doing real work. A poller that runs a
million times per second but whose busy count is 0 is a poller
whose callback is just returning — which is, in some cases, a
different problem (e.g. a poll loop on a closed file descriptor
that returns “no events” every iteration).
Three patterns you will see again and again
Pattern 1: “Reactor is at 100% but doing nothing”
The symptom: a single reactor thread is pegged at 99.99% busy, the
others are at 1–5%. POLLERS sorted by run count shows one poller
with a delta in the millions. The diagnostic is straightforward —
that poller is in a tight loop. The most common cause in production
code is a missing SPDK_POLLER_IDLE return.
The fix: read the poller's callback. A correct active poller looks like:
int my_poller(void *arg) {
if (work_available()) {
do_work();
return SPDK_POLLER_BUSY;
}
return SPDK_POLLER_IDLE;
}A bug shows up as a poller that does “work” but
unconditionally returns SPDK_POLLER_BUSY — the
reactor keeps calling it, it keeps saying “busy,” the
reactor keeps calling it, forever. The spdk_top view
is the first place this shows up.
Pattern 2: “Bdev queue depth is high but IOPS is low”
The symptom: a bdev module shows a high poller busy count (many submissions happening) but the THREADS tab shows the underlying reactor is at low busy time. The likely cause is that the backend is slower than the front: the module is submitting I/O to the device, the device is queuing I/O, and completions are coming back slower than the rate of submission.
This is healthy behaviour for a target under saturation, but if the busy count on the bdev poller grows linearly over many refreshes, the poller is starving other pollers on the same reactor. The fix is rarely in the poller — it is in the application’s IO depth limit. The poller is the symptom; the queue depth is the problem.
Pattern 3: “Poller period is 0us”
The symptom: a poller on the POLLERS tab reads
Period [us] = 0. The diagnostic is
context-dependent. For an active poller this is correct — by
design, active pollers run every reactor iteration. For a
timed poller this is a bug; a timed poller with period 0
has been registered with period_us = 0, which the
runtime accepts but the poller will be classified as active.
The classifier (see enum spdk_poller_type in
app/spdk_top/spdk_top.c:131 ) is decided
by the registration path, not by the period alone. A
spdk_poller_register_named(..., period_us=0, ...) ends
up as SPDK_TIMED_POLLER in the type column even
though the period is zero. If you see a poller with
Type=Timed and Period=0, the application
is calling register with the wrong argument.
Edge cases: shutdown, missing fields, negative numbers
What you see during shutdown
During a clean SPDK shutdown the data thread will continue to fetch
thread_get_stats while threads are being torn down.
Threads transition to Status = unmatched (the source
sets this when a thread's owning reactor is gone), and the count
of pollers on a thread that is in the middle of destruction
flickers. This is normal — it is the target tearing itself down —
but it looks alarming if you don't know.
If spdk_top shows a frozen frame with
ERROR occurred while getting threads data at the
bottom, the JSON-RPC server itself has been torn down before the
TUI. spdk_top cannot tell you this on its own — the
bottom message is the only signal. If the underlying target is
gone, exit and restart spdk_top against the next
target.
Missing fields in some configs
The Freq [MHz] column on the CORES tab reads
0 on kernels where reading the per-core frequency is
not permitted (some hardened profiles, some container runtimes).
The Sys % and Irq % columns can read
0 for the same reason. These are not bugs in the
target; they are the OS refusing to give the process the
information.
The bdev view is not on the default tabs. To see per-bdev queue
depth and IOPS you need the JSON-RPC bdev_get_iostat
and a separate tool. spdk_top is a thread/poller
monitor, not a bdev monitor.
Negative numbers
You will not see truly negative numbers on
spdk_top’s screen — the values are stored as
uint64_t and rendered as
%PRIu64. But on a thread that has been migrated
between reactors, the per-thread deltas can be inconsistent
(the thread was on reactor A for 0.4 s and then moved to reactor
B for 0.6 s, and the per-reactor counters do not sum to the
per-thread counters). This shows up as a sum mismatch between
THREADS and CORES: the sum of “thread busy” on a
single thread is greater than the “reactor busy”
on the core the thread ended up on. The interpretation is
“the thread moved”, not “the counter is
wrong”.
What trips people up
“spdk_top says 0% CPU but the process is pinned.” The percent column is the reactor's view. A reactor that has called
spdk_thread_poll()and is now waiting for the next event timer shows 0%, but the process is using 100% of one core. Always cross-checktop -H -p $SPDK_PIDon the host.“The THREADS tab has more rows than lcores.” That is correct. One reactor per lcore, but many
spdk_threads per reactor. An nvmf_tgt with 16 poll groups can have 16+ threads on each reactor, all scheduled round-robin.“I sorted by busy count and the top poller is named <anonymous>.” That poller was registered without a name. To find it, look at the
On threadcolumn, then runthread_get_pollerswith that thread's id and see the raw output.“The refresh rate is 0 and the screen flickers.” That is the intended
g_sleep_time = 0mode. The screen redraws every 10 ms. Use it only when chasing a race.“spdk_top opens, shows a frame, then exits.” The JSON-RPC socket path is wrong, or the target is not running, or the socket is owned by a different user.
ls -l /var/tmp/spdk.sockfirst.
Why it matters
spdk_top is the only first-line inspection tool
that gives you a continuous, polled view of the reactor and
poller state. It is safe to run in production — it issues
read-only RPCs and never modifies state. The three patterns
above (runaway poller, bdev saturation, period = 0 by mistake)
account for most of the “target is slow” incidents
you will see.
The next page, 9.2 —
tracing, USDT, gdb macros, is what you reach for when
spdk_top is not enough — when you need to see the
sequence of events that led to a single hung RPC, or the
per-bdev IOPS that the TUI deliberately leaves off the
screen.