Layer 9 · Operations & debugging

Reading spdk_top the way the source wants you to.

spdk_top is top for an SPDK process. It runs as a TUI on top of ncurses, connects to the same JSON-RPC socket that scripts/rpc.py uses, and re-polls three RPCs on a timer: thread_get_stats, thread_get_pollers, and framework_get_reactors. Every number on the screen comes from one of those three. The hard part is not starting it — the hard part is reading what the numbers mean while a target is misbehaving. This page walks you through every column, every key, and the three patterns that catch 90% of the bugs you'll ever see.

~12 min read2 diagramsprerequisites: 2.1 · 2.2 · 3.1

On this page

What spdk_top is and how to start it
The three tabs, the three RPCs
THREADS tab: every column, decoded
POLLERS tab: per-poller run counts and busy/idle
CORES tab: what the OS sees vs. what the reactor sees
Keys: sort, refresh, columns, help, total vs. interval
The fields that matter during a real incident
Three patterns you will see again and again
Edge cases: shutdown, missing fields, negative numbers
What trips people up

What `spdk_top` is and how to start it

spdk_top is a self-contained ncurses binary that ships in app/spdk_top/. There is nothing magical about it. It opens the SPDK JSON-RPC socket (the same one you talk to with scripts/rpc.py), issues three read-only RPCs on a configurable interval, and renders the response. No part of it runs inside the target. It is purely a viewer of state that the target has already published.

The source is one file:

spdk_v26_01_migration/app/spdk_top/spdk_top.c · lines 1-90 spdk_top.c — the entry points, the ncurses setup, the tab enum

#include "spdk/stdinc.h"
#include "spdk/jsonrpc.h"
#include "spdk/rpc.h"
#include "spdk/event.h"
#include "spdk/util.h"
#include "spdk/env.h"

#if defined __has_include
#if __has_include(<ncurses/panel.h>)
#include <ncurses/ncurses.h>
#include <ncurses/panel.h>
#include <ncurses/menu.h>
#else
#include <ncurses.h>
#include <panel.h>
#include <menu.h>
#endif
#endif

#define RPC_MAX_THREADS 1024
#define RPC_MAX_POLLERS 1024
#define RPC_MAX_CORES 1024
#define MAX_THREADS 4096

enum tabs {
    THREADS_TAB,
    POLLERS_TAB,
    CORES_TAB,
    NUMBER_OF_TABS,
};

The RPC_MAX_* defines are hard upper bounds on the number of objects spdk_top will display. RPC_MAX_THREADS=1024 is the cap on the THREADS tab, not a cap on what the target supports. On a busy nvmf_tgt with 127 poll groups, you may approach that limit on the POLLERS tab — and that is itself a useful signal.

To start it, point it at a target's JSON-RPC socket:

/var/diskengine/spdk/build/bin/spdk_top \
    -s /var/tmp/spdk.sock

The default refresh is 1 second. Press r to change it; the valid range is 0 to 255 seconds. 0 means "as fast as possible" — every 10 ms — and is only useful when you are trying to catch a race.

The three tabs, the three RPCs

Each tab is fed by exactly one RPC. The mapping is fixed in data_thread_routine in app/spdk_top/spdk_top.c:3040 :

Tab	RPC	What it returns
THREADS (key `1`)	`thread_get_stats`	One row per `spdk_thread` — name, core, three poller counts, busy/idle ticks
POLLERS (key `2`)	`thread_get_pollers`	One row per registered poller, grouped by thread, classified by type (active / timed / paused)
CORES (key `3`)	`framework_get_reactors`	One row per lcore, with kernel-side busy/sys/irq/us time and the lightweight threads scheduled on it

The data thread runs every refresh_rate microseconds, fetches all three RPCs in order, and stores the latest copy in three global arrays. The UI thread renders those arrays on a separate timer. The two threads coordinate on pthread_mutex_t g_thread_lock.

spdk_v26_01_migration/app/spdk_top/spdk_top.c · lines 3040-3085 data_thread_routine — the polling loop that drives spdk_top

static void *
data_thread_routine(void *arg)
{
    int rc;
    uint64_t refresh_rate;

    while (1) {
        pthread_mutex_lock(&g_thread_lock);
        if (g_quit_app) {
            pthread_mutex_unlock(&g_thread_lock);
            break;
        }

        if (g_sleep_time == 0) {
            /* Give display thread time to redraw all windows */
            refresh_rate = SPDK_SEC_TO_USEC / 100;
        } else {
            refresh_rate = g_sleep_time * SPDK_SEC_TO_USEC;
        }
        pthread_mutex_unlock(&g_thread_lock);

        /* Get data from RPC for each object type.
         * Start with cores since their number should not change. */
        rc = get_cores_data();
        ...
        rc = get_thread_data();
        ...
        rc = get_pollers_data();
        ...
        rc = get_scheduler_data();
        ...

        usleep(refresh_rate);
    }
    return NULL;
}

Three things to notice in this loop. (1) Cores are fetched first because their count is stable; if you fetch threads first and the reactor set changed in flight, the cross-reference thread -> core would lag by one tick. (2) The lock is dropped during usleep, so a slow RPC does not freeze the UI. (3) When g_sleep_time is 0 the refresh is 10 ms; the comment in the source explicitly says this is for "as fast as possible" redraws.

flowchart LR
A[spdk_top data thread] --> B["thread_get_stats
(THREADS)"]
A --> C["thread_get_pollers
(POLLERS)"]
A --> D["framework_get_reactors
(CORES)"]
A --> E["framework_get_scheduler
(scheduler pop-up)"]
B --> F[g_threads_info]
C --> G[g_pollers_info]
D --> H[g_cores_info]
F --> I[UI thread renders]
G --> I
H --> I
J[Key 'h' or 'g'] --> K[help / scheduler pop-up]

classDef rpc fill:#cfe1ff,stroke:#1c4f8a;
classDef store fill:#d6f5d6,stroke:#2a6f2a;
classDef ui fill:#fdf2cf,stroke:#8a6f1a;
class B,C,D,E rpc
class F,G,H store
class I,K ui

fig. 1 — spdk_top's data path · tap or scroll to zoom · ↗ for fullscreen

fig. 1 The four RPCs spdk_top issues on each refresh, the three globals it stores the responses in, and the UI thread that renders them. The scheduler pop-up (key g) is a separate read; it does not have its own tab.

THREADS tab: every column, decoded

The THREADS tab is the one you'll spend the most time on. Each row is one spdk_thread, identified by name and by the lcore that reactor pinned it to. The columns are declared in app/spdk_top/spdk_top.c:92 as enum column_threads_type and rendered with draw_thread_tab_row at line 1335.

Column	Source field	What it actually means
Thread name	`thread.name`	`reactor_N` for the per-reactor spdk_thread, or a named user thread (e.g. `app_thread`, `vbdev_passthru_0`). Names longer than 26 chars get truncated to `...`.
Core	`core_num`	The lcore index the reactor that owns this thread is currently running on. A `-1` means the thread is not currently scheduled on a reactor. This happens during thread migration under a non-static scheduler.
Active pollers	`active_pollers_count`	Number of registered active (busy) pollers — see 2.3. Each one runs every reactor iteration until it returns a value < 0.
Timed pollers	`timed_pollers_count`	Number of pollers registered with `spdk_poller_register_named(... period)`. They fire on a fixed wall-clock period.
Paused pollers	`paused_pollers_count`	Number of pollers that exist but are currently paused. They take up a slot in the list but do not fire.
Idle [us]	`idle - last_idle` (in interval mode) or `idle` (in total mode)	How many microseconds this thread spent idle in the last refresh window. A thread that is “idle” is one whose `spdk_thread_poll()` call returned 0 — there was nothing to do.
Busy [us]	`busy - last_busy`	How many microseconds the thread spent doing real work in the last refresh window. `busy + idle` is the wall-clock time the thread was on-core.
CPU %	derived: `busy * 10000 / (busy + idle)`, displayed as 0.00–100.00	Percent of the wall-clock window the thread spent doing work. On a quiet target this is < 5%. On a saturated one it pegs at 99.99%.
Status	free / running / sleeping / idle / unmatched	Inline indicator flag (a coloured marker in the source) showing the thread's current state. `unmatched` means the thread exists but the reactor it is registered to is gone — usually a shutdown remnant.

The single most important pair of columns is Busy [us] and Idle [us]. They are absolute times, not percentages, and the sum is the time the thread was scheduled on a reactor. If the sum does not match the refresh interval (e.g. the tab says 1 s refresh but a row reads Busy 1.2 s, Idle 0), one of three things is true: (1) the data is from a previous refresh and the UI has not yet redrawn; (2) the thread is migrating between reactors (the clock is shared); or (3) the thread is overrunning its window, which is what a runaway active poller looks like.

POLLERS tab: per-poller run counts and busy/idle

Each row is a single registered poller. The columns are declared at app/spdk_top/spdk_top.c:105 as enum column_pollers_type:

Column	What it actually means
Poller name	The string passed to `spdk_poller_register_named(... name)`. Anonymous pollers show as their function pointer or a generic name.
Type	`Active` (busy poller), `Timed` (fixed period), or `Paused`. The classification comes from the JSON field `state` in `thread_get_pollers`.
On thread	The owning `spdk_thread` name. Useful to confirm pollers are on the right thread — a poller on `reactor_3` is fine; a poller named `vtophys_poll` on `reactor_3` is fine too, but if you see it on the same thread as a vhost-user controller, you have a threading violation.
Run count	Cumulative number of times the poller has been invoked. In interval mode (default) it is the delta from the previous refresh. This is the “is the poller firing?” number.
Period [us]	The configured period in microseconds for a timed poller, or `0` for an active poller. `0us` does not mean the poller is broken — it means the poller runs every iteration of the reactor.
Status (busy count)	The number of times the poller returned `SPDK_POLLER_BUSY` (i.e. did real work and wants to be re-polled immediately). In interval mode it is a delta. A poller with a high busy count is either a hot poller by design (NVMe completion scanning) or a runaway poller that never returns idle — see the runaway pattern below.

The Run count and Status (busy count) columns tell a story when you sort by them. Sort by busy count, descending, and the first row is the poller that is doing the most work right now. If that poller is, say, the nvmf poll group's poller, the target is doing useful I/O. If it is a vtophys_poll running thousands of times per second, you have a DMA mapping leak.

The poller “period = 0” question

You will see pollers with Period [us] = 0. This is normal for active pollers. The convention is: an active poller is registered without a period and runs as often as the reactor iterates. A timed poller has a non-zero period and is bucketed by wall-clock deadline. Paused pollers are still in the list but their run count stops incrementing.

If you sort by Run count and the top row reads Period 0 with a delta of millions of calls per second, you have a runaway. Compare the run count delta to reactor_iterations: in a healthy target, an active poller's run count cannot exceed the reactor's iteration count, and is usually much lower because most iterations find no work. A poller that runs every single iteration is doing 10s of millions of calls per second on a single core.

CORES tab: what the OS sees vs. what the reactor sees

The CORES tab is the only one that talks to the kernel — the data comes from framework_get_reactors which reads /proc/self/stat for the SPDK process. The column declaration is in app/spdk_top/spdk_top.c:115 . The columns you actually care about:

Column	What it means
Core	The lcore index (0..N-1).
Threads	Number of `spdk_thread`s currently scheduled on this core.
Pollers	Total pollers (active + timed + paused) registered across all threads on this core.
Busy %	Thread-side busy time, same source as THREADS tab.
Status	Reactor state. `idle` means the reactor is sleeping because there are no threads scheduled on it.
Intr	Whether the core is currently inside an interrupt handler (`Y` / `N`). Useful for confirming that a stalled reactor is not blocked on a kernel interrupt.
Sys % / Irq %	Kernel-side time spent in system and IRQ contexts. High Sys % with high Busy % means the reactor is in heavy I/O submission. High Irq % with high Busy % is rare for SPDK (no interrupts) but indicates the OS is being asked to do something.
Freq [MHz]	Current core frequency. Modern CPUs throttle aggressively — a reactor at 2.4 GHz on a part that is rated for 3.6 GHz is thermal-throttled.

A core that shows Threads = 0, Status = idle, Busy % = 0 is a wasted core. The scheduler can hand it to a thread that is over-subscribed elsewhere, but if it stays that way for minutes, the target is misconfigured.

Keys: sort, refresh, columns, help, total vs. interval

The full key list is rendered in the help window opened by h:

spdk_v26_01_migration/app/spdk_top/spdk_top.c · lines 3113-3150 help_window_display — every key, with its description

print_left(help_win, ++row, col, HELP_WIN_WIDTH, "MENU options", COLOR_PAIR(5));
print_left(help_win, ++row, col, HELP_WIN_WIDTH, "[q] Quit\t\t- quit this application", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[Tab] Next tab\t- switch to next tab", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[1-3] Select tab\t- switch to THREADS, POLLERS or CORES tab", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[PgUp] Previous page\t- scroll up to previous page", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[PgDown] Next page\t- scroll down to next page", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[Up] Arrow key\t- go to previous data row", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[Down] Arrow key\t- go to next data row", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[Right] Arrow key\t- go to second sorting window", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[Left] Arrow key\t- close second sorting window", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[c] Columns\t\t- choose data columns to display", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[s] Sorting\t\t- change sorting by column", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[r] Refresh rate\t- set refresh rate <0, 255> in seconds", ...);
print_left(help_win, ++row, desc_second_row_col,  HELP_WIN_WIDTH, "that value in seconds", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[Enter] Item details\t- show current data row details (Enter to open, Esc to close)", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[t] Total/Interval\t- switch to display data measured from the start of SPDK", ...);
print_left(help_win, ++row, desc_second_row_col,  HELP_WIN_WIDTH,
           "application or last refresh", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH,
           "[g] Scheduler pop-up - display current scheduler information", ...);
print_left(help_win, ++row, col,  HELP_WIN_WIDTH, "[h] Help\t\t- show this help window", ...);

The two keys that change meaning of the numbers are t (total vs. interval) and c (column toggle). In interval mode (the default), the busy and idle columns show deltas from the previous refresh. In total mode, they show cumulative values since the process started. Most debugging wants interval mode; the cumulative numbers are useful for understanding the long-run shape of the workload.

The [g] scheduler pop-up shows the active reactor scheduler name, its period, and the active governor (e.g. static, dynamic, or nothing). Press g to bring it up, Esc to close. This is the only place to see the scheduler's period without writing an RPC.

The fields that matter during a real incident

Most of the time spdk_top is just a "is the target running?" check. The screen has 50+ fields, but during an incident you only care about three. If you remember nothing else from this page, remember these.

STEP 01

Reactor iter rate

Sort THREADS by Busy [us] desc; a stuck reactor pegs at 99.99%

→

STEP 02

Bdev queue depth

(not on the default tab — see 'field not visible' below)

→

STEP 03

Poller run counts

Sort POLLERS by Run count desc; a runaway poller has a million-delta per second

Field 1: Reactor's "iter" rate (THREADS tab, busy + idle)

How fast the poll loop is running. The Busy [us] and Idle [us] pair on the THREADS tab, when summed, give the wall-clock time the thread was on-core. A thread that is “at 100% but doing nothing” shows up as Busy 1000000, Idle 0 on a 1-second refresh — and that is the diagnostic for a runaway poller. Compare across cores: a healthy target with N cores has roughly equal Busy across all reactor threads. A target where one core is at 99% and the others are at 5% has a problem pinned to that core.

Field 2: Bdev queue depth (THREADS tab, queue depth not directly visible)

spdk_top does not show per-bdev queue depth by default. The way you read it is through the poller. Sort POLLERS by busy count descending — the top row is the bdev module that has the most outstanding I/O. A bdev that is saturated has a steady, high busy count; a bdev that is idle has a busy count of 0.

For raw queue depth you need the RPC bdev_get_iostat (a separate command — see 9.2). spdk_top deliberately keeps the bdev view out of its default tabs because it is a TUI, and per-bdev tables change width with the number of bdevs, which makes the column layout unstable.

Field 3: Pollers' run count (POLLERS tab, run count column)

The most reliable signal of a misbehaving poller. Sort by Run count descending and look at the top five. A poller with a million-delta per second is one of three things: a hot poller by design (NVMe completion scanner), a runaway tight loop (the bug you're chasing), or a poller stuck on a slow resource (an IO channel that is not freeing).

Cross-check by sorting by Status (busy count). A hot poller that returns SPDK_POLLER_BUSY every iteration is one that is doing real work. A poller that runs a million times per second but whose busy count is 0 is a poller whose callback is just returning — which is, in some cases, a different problem (e.g. a poll loop on a closed file descriptor that returns “no events” every iteration).

Three patterns you will see again and again

Pattern 1: “Reactor is at 100% but doing nothing”

The symptom: a single reactor thread is pegged at 99.99% busy, the others are at 1–5%. POLLERS sorted by run count shows one poller with a delta in the millions. The diagnostic is straightforward — that poller is in a tight loop. The most common cause in production code is a missing SPDK_POLLER_IDLE return.

The fix: read the poller's callback. A correct active poller looks like:

  int my_poller(void *arg) {
      if (work_available()) {
          do_work();
          return SPDK_POLLER_BUSY;
      }
      return SPDK_POLLER_IDLE;
  }

A bug shows up as a poller that does “work” but unconditionally returns SPDK_POLLER_BUSY — the reactor keeps calling it, it keeps saying “busy,” the reactor keeps calling it, forever. The spdk_top view is the first place this shows up.

Pattern 2: “Bdev queue depth is high but IOPS is low”

The symptom: a bdev module shows a high poller busy count (many submissions happening) but the THREADS tab shows the underlying reactor is at low busy time. The likely cause is that the backend is slower than the front: the module is submitting I/O to the device, the device is queuing I/O, and completions are coming back slower than the rate of submission.

This is healthy behaviour for a target under saturation, but if the busy count on the bdev poller grows linearly over many refreshes, the poller is starving other pollers on the same reactor. The fix is rarely in the poller — it is in the application’s IO depth limit. The poller is the symptom; the queue depth is the problem.

Pattern 3: “Poller period is 0us”

The symptom: a poller on the POLLERS tab reads Period [us] = 0. The diagnostic is context-dependent. For an active poller this is correct — by design, active pollers run every reactor iteration. For a timed poller this is a bug; a timed poller with period 0 has been registered with period_us = 0, which the runtime accepts but the poller will be classified as active.

The classifier (see enum spdk_poller_type in app/spdk_top/spdk_top.c:131 ) is decided by the registration path, not by the period alone. A spdk_poller_register_named(..., period_us=0, ...) ends up as SPDK_TIMED_POLLER in the type column even though the period is zero. If you see a poller with Type=Timed and Period=0, the application is calling register with the wrong argument.

Edge cases: shutdown, missing fields, negative numbers

What you see during shutdown

During a clean SPDK shutdown the data thread will continue to fetch thread_get_stats while threads are being torn down. Threads transition to Status = unmatched (the source sets this when a thread's owning reactor is gone), and the count of pollers on a thread that is in the middle of destruction flickers. This is normal — it is the target tearing itself down — but it looks alarming if you don't know.

If spdk_top shows a frozen frame with ERROR occurred while getting threads data at the bottom, the JSON-RPC server itself has been torn down before the TUI. spdk_top cannot tell you this on its own — the bottom message is the only signal. If the underlying target is gone, exit and restart spdk_top against the next target.

Missing fields in some configs

The Freq [MHz] column on the CORES tab reads 0 on kernels where reading the per-core frequency is not permitted (some hardened profiles, some container runtimes). The Sys % and Irq % columns can read 0 for the same reason. These are not bugs in the target; they are the OS refusing to give the process the information.

The bdev view is not on the default tabs. To see per-bdev queue depth and IOPS you need the JSON-RPC bdev_get_iostat and a separate tool. spdk_top is a thread/poller monitor, not a bdev monitor.

Negative numbers

You will not see truly negative numbers on spdk_top’s screen — the values are stored as uint64_t and rendered as %PRIu64. But on a thread that has been migrated between reactors, the per-thread deltas can be inconsistent (the thread was on reactor A for 0.4 s and then moved to reactor B for 0.6 s, and the per-reactor counters do not sum to the per-thread counters). This shows up as a sum mismatch between THREADS and CORES: the sum of “thread busy” on a single thread is greater than the “reactor busy” on the core the thread ended up on. The interpretation is “the thread moved”, not “the counter is wrong”.

What trips people up

“spdk_top says 0% CPU but the process is pinned.” The percent column is the reactor's view. A reactor that has called spdk_thread_poll() and is now waiting for the next event timer shows 0%, but the process is using 100% of one core. Always cross-check top -H -p $SPDK_PID on the host.
“The THREADS tab has more rows than lcores.” That is correct. One reactor per lcore, but many spdk_threads per reactor. An nvmf_tgt with 16 poll groups can have 16+ threads on each reactor, all scheduled round-robin.
“I sorted by busy count and the top poller is named <anonymous>.” That poller was registered without a name. To find it, look at the On thread column, then run thread_get_pollers with that thread's id and see the raw output.
“The refresh rate is 0 and the screen flickers.” That is the intended g_sleep_time = 0 mode. The screen redraws every 10 ms. Use it only when chasing a race.
“spdk_top opens, shows a frame, then exits.” The JSON-RPC socket path is wrong, or the target is not running, or the socket is owned by a different user. ls -l /var/tmp/spdk.sock first.

Why it matters

spdk_top is the only first-line inspection tool that gives you a continuous, polled view of the reactor and poller state. It is safe to run in production — it issues read-only RPCs and never modifies state. The three patterns above (runaway poller, bdev saturation, period = 0 by mistake) account for most of the “target is slow” incidents you will see.

The next page, 9.2 — tracing, USDT, gdb macros, is what you reach for when spdk_top is not enough — when you need to see the sequence of events that led to a single hung RPC, or the per-bdev IOPS that the TUI deliberately leaves off the screen.