Layer 2 · Threading & reactor

spdk_thread — the logical unit of work.

A spdk_thread is a logical thread of execution that the framework multiplexes onto a reactor. It's not a pthread. It's a struct with a mailbox. When you want to do work, you send a message to the thread; the thread's reactor will run it. When the thread has nothing to do, it sits idle and other threads on the same core get all the CPU.

~15 min read1 diagramprerequisite: Layer 2.1

On this page

Why a thread-on-top-of-a-reactor
The struct, end to end
Creating a thread — and the lifetime rules
The mailbox: spdk_thread_send_msg
How a Go JSON-RPC call ends up on an spdk_thread
Pollers vs. messages vs. threads
Migration: when a thread hops reactors
Edge cases & what trips people up

Why a thread-on-top-of-a-reactor

In 2.1 you saw that a reactor is one pthread per core, running a tight poll loop. That's a powerful primitive, but it's also rigid: a single reactor is one execution context. It has one spdk_get_thread(), one TLS variable, one "current io channel per subsystem" cache.

Real applications need many execution contexts. An SPDK-based NVMe-oF target has:

an RPC handler thread (for JSON-RPC requests)
one nvmf target thread per core (for I/O submission)
a poller thread per subsystem (bdev, copy, etc.)
one thread per active TCP connection, in some transports

Each of these needs its own state. The bdev subsystem, for example, caches an spdk_io_channel per execution context — and an "execution context" here means "the thing that submitted the I/O." If two threads on the same reactor shared a channel, they'd contend on the same submission queue and the bdev's poller would have no idea who was waiting for what completion.

The metaphor that helps: think of spdk_thread as a goroutine, and a reactor as a worker M:N scheduled onto a pthread. The mapping is the same idea. The reason the framework uses this M:N model is that with one pthread per core, you can have many "threads" of execution without paying the cost of a pthread for each.

The struct, end to end

spdk_v26_01_migration/lib/thread/thread.c · lines 114-171 struct spdk_thread — the heart of the framework

struct spdk_thread {
    uint64_t                  tsc_last;
    struct spdk_thread_stats  stats;
    TAILQ_HEAD(active_pollers_head, spdk_poller)  active_pollers;
    RB_HEAD(timed_pollers_tree, spdk_poller)      timed_pollers;
    struct spdk_poller                            *first_timed_poller;
    TAILQ_HEAD(paused_pollers_head, spdk_poller)  paused_pollers;
    struct spdk_thread_post_poller_handler        pp_handlers[SPDK_THREAD_MAX_POST_POLLER_HANDLERS];
    struct spdk_ring        *messages;
    uint8_t                  num_pp_handlers;
    int                      msg_fd;
    SLIST_HEAD(, spdk_msg)   msg_cache;
    size_t                   msg_cache_count;
    spdk_msg_fn              critical_msg;
    uint64_t                 id;
    uint64_t                 next_poller_id;
    enum spdk_thread_state   state;
    int                      pending_unregister_count;
    uint32_t                 for_each_count;
    RB_HEAD(io_channel_tree, spdk_io_channel)    io_channels;
    TAILQ_ENTRY(spdk_thread)                     tailq;
    char                     name[SPDK_MAX_THREAD_NAME_LEN + 1];
    struct spdk_cpuset       cpumask;
    uint64_t                 exit_timeout_tsc;
    int32_t                  lock_count;
    bool                     is_bound;
    bool                     in_interrupt;
    bool                     poller_unregistered;
    struct spdk_fd_group     *fgrp;
    uint16_t                 trace_id;
    uint8_t                  reserved[6];
    uint8_t                  ctx[0];
};

The fields, grouped by what they're for.

Pollers

active_pollers — the TAILQ of busy (period = 0) pollers. thread_poll() walks this on every reactor tick.
timed_pollers — a red-black tree keyed on next_run_tick. The cached minimum is first_timed_poller, so the reactor can check "do I have any timer work to do right now?" in O(1).
paused_pollers — pollers that have been paused via spdk_poller_pause(). They don't fire until resumed.

Messaging

messages — the MPSC ring buffer of struct spdk_msg entries waiting to be run on this thread. The single most important field in the struct.
msg_cache — a per-thread cache of pre-allocated spdk_msg entries, taken from the global spdk_msg_mempool. The thread tries to satisfy send_msg calls from its cache first, to avoid the mempool lock.
critical_msg — a one-shot slot for spdk_thread_send_critical_msg(). The "critical" variant is what you call from a signal handler; it preempts any running poller at the next thread_poll() call.

Lifecycle

state — one of SPDK_THREAD_STATE_RUNNING, EXITING, or EXITED. The reactor only polls RUNNING threads.
id — monotonic 64-bit ID assigned at creation. The scheduler uses this as the key for thread lookup.
name — human-readable name, e.g. "nvmf_tgt_poller". Used in logs, in tracepoints, in spdk_top.
cpumask — the suggested cores the thread would like to run on. The scheduler may override.

Thread-local state

io_channels — the red-black tree of spdk_io_channels this thread has acquired. This is the per-thread state that makes lockless I/O possible.
ctx[] — a flexible array member at the end. The framework reserves sizeof(struct spdk_lw_thread) bytes past the spdk_thread for the scheduler's per-thread state. The spdk_thread_get_ctx() API hands you a pointer to it.
lock_count — count of spdk_spinlock_ts held by this thread. SPDK refuses to migrate a thread that holds a lock (see 2.4).
is_bound — if true, the scheduler won't migrate this thread to a different core. Set via spdk_thread_bind().
in_interrupt — true if this thread is currently in interrupt-driven mode (the optional --interrupt-mode).

Creating a thread — and the lifetime rules

A thread is created with spdk_thread_create():

spdk_v26_01_migration/lib/thread/thread.c · lines 527-636 spdk_thread_create() — the entry point

struct spdk_thread *
spdk_thread_create(const char *name, const struct spdk_cpuset *cpumask)
{
    struct spdk_thread *thread, *null_thread;
    size_t size = SPDK_ALIGN_CEIL(sizeof(*thread) + g_ctx_sz, SPDK_CACHE_LINE_SIZE);
    struct spdk_msg *msgs[SPDK_MSG_MEMPOOL_CACHE_SIZE];
    int rc = 0, i;

    rc = posix_memalign((void **)&thread, SPDK_CACHE_LINE_SIZE, size);
    if (rc != 0) {
        SPDK_ERRLOG("Unable to allocate memory for thread\n");
        return NULL;
    }
    memset(thread, 0, size);

    if (cpumask) {
        spdk_cpuset_copy(&thread->cpumask, cpumask);
    } else {
        spdk_cpuset_negate(&thread->cpumask);
    }

    RB_INIT(&thread->io_channels);
    TAILQ_INIT(&thread->active_pollers);
    RB_INIT(&thread->timed_pollers);
    TAILQ_INIT(&thread->paused_pollers);
    SLIST_INIT(&thread->msg_cache);
    thread->msg_cache_count = 0;

    thread->tsc_last = spdk_get_ticks();
    thread->next_poller_id = 1;
    thread->messages = spdk_ring_create(SPDK_RING_TYPE_MP_SC, 65536, SPDK_ENV_NUMA_ID_ANY);
    if (!thread->messages) {
        SPDK_ERRLOG("Unable to allocate memory for message ring\n");
        free(thread);
        return NULL;
    }

    rc = spdk_mempool_get_bulk(g_spdk_msg_mempool, (void **)msgs, SPDK_MSG_MEMPOOL_CACHE_SIZE);
    if (rc == 0) {
        for (i = 0; i < SPDK_MSG_MEMPOOL_CACHE_SIZE; i++) {
            SLIST_INSERT_HEAD(&thread->msg_cache, msgs[i], link);
            thread->msg_cache_count++;
        }
    }

    if (name) {
        snprintf(thread->name, sizeof(thread->name), "%s", name);
    } else {
        snprintf(thread->name, sizeof(thread->name), "%p", thread);
    }

    thread->trace_id = spdk_trace_register_owner(OWNER_TYPE_THREAD, thread->name);

    pthread_mutex_lock(&g_devlist_mutex);
    if (g_thread_id == 0) {
        SPDK_ERRLOG("Thread ID rolled over. Further thread creation is not allowed.\n");
        pthread_mutex_unlock(&g_devlist_mutex);
        _free_thread(thread);
        return NULL;
    }
    thread->id = g_thread_id++;
    TAILQ_INSERT_TAIL(&g_threads, thread, tailq);
    g_thread_count++;
    pthread_mutex_unlock(&g_devlist_mutex);

    if (spdk_interrupt_mode_is_enabled()) {
        thread->in_interrupt = true;
        rc = thread_interrupt_create(thread);
        if (rc != 0) {
            _free_thread(thread);
            return NULL;
        }
    }

    if (g_new_thread_fn) {
        rc = g_new_thread_fn(thread);
    } else if (g_thread_op_supported_fn && g_thread_op_supported_fn(SPDK_THREAD_OP_NEW)) {
        rc = g_thread_op_fn(thread, SPDK_THREAD_OP_NEW);
    }

    if (rc != 0) {
        _free_thread(thread);
        return NULL;
    }

    thread->state = SPDK_THREAD_STATE_RUNNING;

    null_thread = NULL;
    __atomic_compare_exchange_n(&g_app_thread, &null_thread, thread, false,
                                __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);

    return thread;
}

Key invariants being established here:

Cache-line aligned. posix_memalign with SPDK_CACHE_LINE_SIZE. Because the thread struct is going to be touched by another core (the reactor that ends up running it), it must not share a cache line with anything else.
Default cpumask is "all cores". spdk_cpuset_negate is a cpuset with all bits set — i.e. the thread is willing to run on any core. The scheduler will choose one.
65536-slot message ring. Each thread allocates its own MPSC ring buffer for incoming messages. 65536 is enormous. If you fill it, something has gone very wrong; see spdk_ring_enqueue failing in spdk_thread_send_msg at lib/thread/thread.c:1452 which aborts the process.
Pre-fill the message cache. The thread tries to grab SPDK_MSG_MEMPOOL_CACHE_SIZE (1024) message objects at creation time, so subsequent spdk_thread_send_msg() calls can be lock-free from the local cache.
First thread becomes the app thread. The __atomic_compare_exchange_n at the bottom atomically installs the first thread ever created as the "app thread."
The framework gets a hook. The g_thread_op_fn callback at lib/thread/thread.c:1353 is the reactor subsystem's chance to actually place the new thread on a reactor's threads list. With the static scheduler, this calls _reactor_schedule_thread().

Once created, a thread is "live" — it is on some reactor's threads list, the reactor is calling spdk_thread_poll() on it, and any spdk_thread_send_msg() targeted at it will be delivered.

When you're done with a thread, you call spdk_thread_exit() — but only from the thread itself, and only after all of its I/O channels have been spdk_put_io_channel()'d and all its pollers have been spdk_poller_unregister()'d. The exit sequence is asynchronous: spdk_thread_exit() just flips the state to EXITING and starts a 5-second timeout. Subsequent reactor iterations check thread_exit() at

lib/thread/thread.c:672

to see whether all the cleanup has actually happened.

The mailbox: `spdk_thread_send_msg`

This is the workhorse of the framework. Every cross-thread handoff that isn't a poller goes through spdk_thread_send_msg():

spdk_v26_01_migration/lib/thread/thread.c · lines 1415-1461 spdk_thread_send_msg() — the universal handoff

int
spdk_thread_send_msg(const struct spdk_thread *thread, spdk_msg_fn fn, void *ctx)
{
    struct spdk_thread *local_thread;
    struct spdk_msg *msg;
    int rc;

    assert(thread != NULL);

    if (spdk_unlikely(thread->state == SPDK_THREAD_STATE_EXITED)) {
        SPDK_ERRLOG("Thread %s is marked as exited.\n", thread->name);
        abort();
    }

    local_thread = _get_thread();

    msg = NULL;
    if (local_thread != NULL) {
        if (local_thread->msg_cache_count > 0) {
            msg = SLIST_FIRST(&local_thread->msg_cache);
            assert(msg != NULL);
            SLIST_REMOVE_HEAD(&local_thread->msg_cache, link);
            local_thread->msg_cache_count--;
        }
    }

    if (msg == NULL) {
        msg = spdk_mempool_get(g_spdk_msg_mempool);
        if (!msg) {
            SPDK_ERRLOG("msg could not be allocated\n");
            abort();
        }
    }

    msg->fn = fn;
    msg->arg = ctx;

    rc = spdk_ring_enqueue(thread->messages, (void **)&msg, 1, NULL);
    if (rc != 1) {
        SPDK_ERRLOG("msg could not be enqueued\n");
        abort();
    }

    thread_send_msg_notification(thread);

    return 0;
}

This is the entire mechanism. Step by step.

Sanity check the target. If the target is already EXITED, that's a use-after-free bug; abort.
Try the local cache. The caller might itself be an spdk_thread, in which case it has its own msg_cache (the SLIST). Pop one off; don't take a global lock.
Fall back to the mempool. If the caller isn't on an SPDK thread (e.g. a Go runtime goroutine calling into CGo), or the local cache is empty, take from the global mempool.
Fill the message. fn and arg are the function and its context. fn will be called as fn(ctx).
Enqueue. spdk_ring_enqueue() is a lock-free MPSC ring buffer push. Returns success or abort.
Notify the target. In poll mode (the default), this is a no-op — the target's reactor will pick up the message on its next spdk_thread_poll() anyway. In interrupt mode, it writes to the target's msg_fd.

Critical detail: the message is delivered asynchronously. The caller continues immediately. The fn runs on the target thread, on its reactor, at some point in the future. The caller cannot assume the message has been delivered by the time spdk_thread_send_msg() returns. If the caller needs a "wait for completion" pattern, the convention is to send a follow-up message back to the caller.

There is one more variant: spdk_thread_send_critical_msg():

spdk_v26_01_migration/lib/thread/thread.c · lines 1463-1476 spdk_thread_send_critical_msg() — the signal-handler-safe variant

int
spdk_thread_send_critical_msg(struct spdk_thread *thread, spdk_msg_fn fn)
{
    spdk_msg_fn expected = NULL;

    if (!__atomic_compare_exchange_n(&thread->critical_msg, &expected, fn, false,
                                     __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
        abort();
    }

    thread_send_msg_notification(thread);

    return 0;
}

Two differences from the normal variant:

Lock-free atomic CAS. There is no ring buffer. The critical message is stored in a single spdk_msg_fn slot (thread->critical_msg), and the CAS ensures only one critical message is outstanding at a time. If you try to send a second one before the first has been run, you abort.
Runs first. The very first thing thread_poll() does, on every reactor tick, is check thread->critical_msg and run it before draining normal messages. This is the "interrupt the poller" path — used by SIGINT handlers to kick off shutdown, for example.

How a Go JSON-RPC call ends up on an `spdk_thread`

The connection between the two layers is worth tracing once, end to end, because it's the abstraction boundary your diskengine code crosses constantly.

STEP 01

Go code

<code>spdkClient.BdevLvolCreate(...)</code> in diskengine

→

STEP 02

JSON-RPC encode

<code>Client.Call</code> serializes to JSON, writes to Unix socket

→

STEP 03

SPDK RPC server

<code>spdk_jsonrpc_server</code> thread reads the socket

→

STEP 04

Dispatch

RPC handler runs on a poller thread (the framework routes it)

→

STEP 05

bdev_lvol_create

Handler submits bdev I/O via the bdev module's submit callback

→

STEP 06

Completion

bdev poller fires, completes the bdev_io

→

STEP 07

RPC response

Handler sends the response, the RPC server writes to the socket

→

STEP 08

Go code resumes

<code>Client.Call</code> unblocks with the result

The detail that matters is step 4. The RPC framework doesn't run the handler on a thread you chose — it runs the handler on whatever thread the framework dispatches JSON-RPC work to (typically the "app thread" or a dedicated RPC thread). The handler then may decide to send the work to yet another thread (e.g. the nvmf target's submit thread) via spdk_thread_send_msg(). This is what the "every I/O channel is bound to one thread" rule looks like in practice: the bdev module's submit callback has to be called on the thread that owns the channel.

The diskengine side is intentionally simple: the Go code in Client.Call:43 does a synchronous request/response and waits. It has no concept of "which reactor am I on" because it's not on one. The threading model is entirely internal to the SPDK process. From Go's perspective, SPDK is a service that responds to JSON-RPC requests.

Pollers vs. messages vs. threads

Three abstractions, three use cases. Mixing them up is the most common architectural mistake.

Abstraction	What it does	When to use it
Poller	A function that runs repeatedly on the thread, on a period (or as fast as possible).	When you need to poll a state, complete I/O, recheck a queue, etc. Anything that needs to run on every reactor iteration or on a timer.
Message	A one-shot function delivered to the thread's mailbox, run on the next reactor iteration.	When you want a callback to run "soon, on this thread" without registering a recurring poller. RPC handlers, I/O completions, deferred cleanup, state transitions.
Thread	A logical unit of work that has its own state (pollers, channels, message ring).	When you have a subsystem with long-lived state. The bdev module's submit callback, the nvmf target's poller, the RPC server's request thread — each is its own `spdk_thread`.

Rule of thumb: if you're tempted to "just register a 1 ms poller to do this one thing," you're almost always better off sending a message instead. Pollers are for recurring work; messages are for one-shots.

Migration: when a thread hops reactors

With the static scheduler, threads never migrate. With the dynamic scheduler (gpm), they can. Here's the mechanism:

spdk_v26_01_migration/lib/event/reactor.c · lines 1324-1351 _reactor_request_thread_reschedule() — the migration request

static void
_reactor_request_thread_reschedule(struct spdk_thread *thread)
{
    struct spdk_lw_thread *lw_thread;
    struct spdk_reactor *reactor;
    uint32_t current_core;

    assert(thread == spdk_get_thread());

    lw_thread = spdk_thread_get_ctx(thread);

    assert(lw_thread != NULL);
    lw_thread->resched = true;
    lw_thread->lcore = SPDK_ENV_LCORE_ID_ANY;

    current_core = spdk_env_get_current_core();
    reactor = spdk_reactor_get(current_core);
    assert(reactor != NULL);

    if (spdk_unlikely(spdk_cpuset_get_cpu(&reactor->notify_cpuset, reactor->lcore))) {
        uint64_t notify = 1;

        if (write(reactor->resched_fd, &notify, sizeof(notify)) < 0) {
            SPDK_ERRLOG("failed to notify reschedule: %s.\n", spdk_strerror(errno));
        }
    }
}

Migration is initiated by spdk_thread_set_cpumask(): the thread's cpumask changes, the framework sets resched = true and lcore = ANY, and the next reactor iteration removes the thread from the current reactor's list and calls _reactor_schedule_thread() to place it on a new one.

What this means for you: even on the dynamic scheduler, you should not hold a per-thread resource (an spdk_io_channel from a different thread, a pointer into another thread's state) for longer than a single reactor iteration. The pointer you stashed at iteration N might be invalid by iteration N+1. If you need cross-thread state, the safe pattern is to send a message to the owning thread to operate on the state for you.

Edge cases & what trips people up

1. `spdk_thread_send_msg()` from the target thread itself

The function checks nothing; it cheerfully enqueues a message on the very thread that just called it. The message will sit in the ring until the next reactor iteration, and then run. This is a recipe for deadlock if your fn is waiting for the message to be delivered. The pattern "send a message to self, then wait for it to be processed" is broken — by definition, you can't both be the producer and the consumer. Use a regular function call (or spdk_thread_exec_msg() at include/spdk/thread.h:547 , which detects the local case and runs the function immediately).

2. Calling `spdk_get_io_channel()` on a thread that doesn't exist

The function at lib/thread/thread.c:2376 does thread = _get_thread(); if (!thread) ... abort(). If you're in a pthread that the framework didn't set up — for example, a Go goroutine that crossed the CGo boundary — tls_thread is NULL and you abort. There is no implicit "current thread" for non-SPDK threads. Everything inside SPDK requires you to be on a known spdk_thread.

3. The first `spdk_thread_create()` sets the app thread

The atomic compare-and-exchange at lib/thread/thread.c:632 means "the first thread wins." If you create thread A, then create thread B, then create thread C, all of A, B, and C are normal threads, but A is the "app thread" because it was first. Framework init and fini must happen from the app thread.

4. What happens when the target's reactor is busy

Your spdk_thread_send_msg() succeeds — the message is in the ring. The target thread is still mid-poller on something slow. The message waits. Send-and-forget has unbounded latency. The framework gives you spdk_thread_send_critical_msg() for "I really need this to run now" but that still waits for the current poller to return. There is no preemption. If you need a back-pressure mechanism, the framework gives you the ring's fill level — check spdk_ring_count() before sending, or design your message handlers to be fast.

5. Migration while a poller is running

The migration check is at lib/event/reactor.c:922 , in reactor_post_process_lw_thread(). It runs after the thread's pollers, not during. So a poller is guaranteed to run to completion on the current reactor. After it returns, the thread might get moved. If your poller stashes a pointer to reactor-local data and assumes the data is still valid in the next iteration, you're wrong. Each iteration is "fresh." Persist data on the spdk_thread struct, not on the reactor.

6. Foreign threads, foreign locks

If a Go goroutine calls into the C side via CGo and that path tries to take an spdk_spinlock, it will trip the SPIN_ERR_NOT_SPDK_THREAD assertion at lib/thread/thread.c:3273 . The lock expects to be held by an spdk_thread. If you need a lock that a Go goroutine can take, take a pthread_mutex on the Go side and design the C side to never block waiting for it. The same is true for spdk_io_channel — the channel is bound to a thread, and "the thread" is the spdk_thread that acquired it, not whatever pthread happens to be running.

7. Holding an `spdk_thread *` across reactor iterations

The spdk_thread pointer is stable for the lifetime of the thread. The thread can be destroyed (via spdk_thread_exit + spdk_thread_destroy), and once it's destroyed the pointer is dangling. If you're tempted to "just keep the pointer in a global and send a message to it later," ask yourself: who guarantees it's still alive? The answer in practice is the framework's for_each_count / pending_unregister_count machinery, which is why spdk_for_each_thread() bumps those counts and refuses to unregister a thread that's the target of an in-flight iteration. Read

lib/thread/thread.c:2049

if you ever write a spdk_for_each_thread of your own.

8. The diskengine never knows which thread it talked to

Look at BdevLvolCreate:97 . The Go code just gets back a UUID string. It has no idea which spdk_thread the bdev module ran on, which reactor processed the request, or how many polls it took. This is the abstraction working as designed. If you ever find yourself wanting to "pass an spdk_thread pointer back to Go and use it later," stop. The pointer is meaningless outside the SPDK process.

What to take away

An spdk_thread is the unit of "where does this I/O submission come from." It's a struct, a name, a list of pollers, an io_channel tree, and a message ring. The reactor loop walks the threads. The thread's mailbox delivers cross-thread work. Pollers run on schedule; messages run on demand. The combination gives you a goroutine-like model on top of a pthread-per-core runtime, with the property that no syscall can yield your CPU to someone else.

The next page — 2.3 — spdk_io_channel + pollers — looks at the per-thread state that actually caches the I/O submission path. The spdk_thread is the thing; the spdk_io_channel is what the thing owns.