Layer 2 · Threading & reactor

spdk_io_channel & pollers.

An spdk_io_channel is per-thread state for a subsystem. Pollers are the recurring functions that get work done. Together, they are how an spdk_thread actually talks to an NVMe controller, an lvol, a TCP socket, or any other subsystem — and they are why the hot path doesn't need a single mutex.

~15 min read2 diagramsprerequisite: Layer 2.2

On this page

Why channels exist (the lockless hot path)
What an spdk_io_channel actually is
get_io_channel / put_io_channel — the lifecycle
Pollers, in depth
How a bdev_io "becomes" a poller event
The thread-local cache pattern
Edge cases & what trips people up

Why channels exist (the lockless hot path)

Imagine you have an NVMe controller. The controller has a submission queue. To submit an I/O, you need to:

Write a command to the next slot in the submission queue.
Bump the doorbell register so the controller sees it.
Wait for the completion queue to report a result.

The submission queue is a single-producer, single-consumer ring. If you have one thread submitting, the implementation is straightforward: a head pointer, a tail pointer, an inlined CAS. The problem starts when you have many threads submitting. If two threads race on step 1, they'll both write to the same slot and overwrite each other.

There are three standard solutions:

Solution	How	Cost at high IOPS
Global mutex	One big lock around the whole submission path.	Unworkable. All threads serialize on the lock. IOPS collapses to one thread's throughput.
Lockless MPSC ring	Use a multi-producer ring buffer; the consumer (the controller's interrupt / poller) only sees a contiguous tail.	Works, but you still need per-thread tail pointers to avoid CAS contention on the head.
Per-thread submission queue	Each thread gets its own SQ. The controller sees many SQs and round-robins between them.	Best. Zero contention in the steady state.

NVMe supports option 3 natively — it has a "number of queues" feature that lets you create up to 64K submission/completion queue pairs. But you have to actually create and manage them. The spdk_io_channel is SPDK's abstraction over "the per-thread submission state for this subsystem." When a thread does spdk_get_io_channel(bdev), the bdev module allocates a new NVMe submission queue and hands the thread a pointer to its context.

What an `spdk_io_channel` actually is

Look at the size constant:

spdk_v26_01_migration/include/spdk/thread.h · line 173 SPDK_IO_CHANNEL_STRUCT_SIZE — the framework's view of a channel

#define SPDK_IO_CHANNEL_STRUCT_SIZE  96

The framework reserves 96 bytes for the channel struct. The subsystem is free to put whatever it wants in the "context" bytes that follow.

And the public API for getting context bytes:

spdk_v26_01_migration/include/spdk/thread.h · lines 752-762 spdk_io_channel_get_ctx() — your per-thread state

static inline void *
spdk_io_channel_get_ctx(struct spdk_io_channel *ch)
{
    if (spdk_unlikely(!ch)) {
        assert(false);
        return NULL;
    }

    return (uint8_t *)ch + SPDK_IO_CHANNEL_STRUCT_SIZE;
}

spdk_io_channel is an opaque struct; the subsystem-specific state lives in the bytes after it. You get a pointer to those bytes via spdk_io_channel_get_ctx() and cast to your own struct. This is the classic "trailing flexible array" trick — the framework owns the head, the subsystem owns the tail.

The framework's view of a channel (the head part) tracks three things:

struct io_device *dev — back-pointer to the registered io_device.
struct spdk_thread *thread — back-pointer to the thread that owns this channel. This is what makes the "channels are bound to one thread" rule enforceable.
refcount — how many spdk_io_channel_ref() calls are outstanding. spdk_get_io_channel() returns refcount = 1; further calls return the same channel with refcount incremented. The channel is only destroyed when refcount reaches 0.

The subsystem's view (the tail part) is whatever the subsystem needs to operate. For the NVMe bdev, it's the admin and I/O submission queue pairs, the per-thread data buffer pool, and so on. For a TCP socket-based subsystem, it might just be the file descriptor and a small receive buffer.

`get_io_channel` / `put_io_channel` — the lifecycle

The "get" and "put" pattern comes straight from reference-counted resource management. The implementation is at lib/thread/thread.c:2358 :

spdk_v26_01_migration/lib/thread/thread.c · lines 2358-2464 spdk_get_io_channel() — acquire (or refcount) a per-thread channel

struct spdk_io_channel *
spdk_get_io_channel(void *io_device)
{
    struct spdk_io_channel *ch;
    struct thread_link *thr_link;
    struct spdk_thread *thread;
    struct io_device *dev;
    int rc;
    bool do_remove_dev = false;

    pthread_mutex_lock(&g_devlist_mutex);
    dev = io_device_get(io_device);
    if (dev == NULL) {
        SPDK_ERRLOG("could not find io_device %p\n", io_device);
        pthread_mutex_unlock(&g_devlist_mutex);
        return NULL;
    }

    thread = _get_thread();
    if (!thread) {
        SPDK_ERRLOG("No thread allocated\n");
        pthread_mutex_unlock(&g_devlist_mutex);
        return NULL;
    }

    if (spdk_unlikely(thread->state == SPDK_THREAD_STATE_EXITED)) {
        SPDK_ERRLOG("Thread %s is marked as exited\n", thread->name);
        pthread_mutex_unlock(&g_devlist_mutex);
        return NULL;
    }

    ch = thread_get_io_channel(thread, dev);
    if (ch != NULL) {
        ch->ref++;

        pthread_mutex_unlock(&g_devlist_mutex);
        spdk_trace_record(TRACE_THREAD_IOCH_GET, 0, 0,
                          (uint64_t)spdk_io_channel_get_ctx(ch), ch->ref);
        return ch;
    }

    ch = calloc(1, sizeof(*ch) + dev->ctx_size);
    /* ... allocate, link into thread's io_channel_tree,
     *     call dev->create_cb ... */
}

The two paths through this function:

Path 1: channel already exists on this thread for this device

Look up the io_device in the global device tree (under g_devlist_mutex).
Get the current spdk_thread from TLS.
Look up an existing channel in thread->io_channels (the red-black tree, keyed on the device pointer).
If found, increment ref and return. No allocation, no callback, no work.

Path 2: first channel for this device on this thread

Same lookup, but no existing channel found.
calloc(1, sizeof(*ch) + dev->ctx_size) — allocate the channel head + the subsystem's per-thread state in one contiguous block.
Insert into the thread's io_channel_tree.
Increment the device's refcnt (so the device knows there's at least one thread using it).
Call dev->create_cb — the subsystem's allocator for per-thread state. For the NVMe bdev, this creates the per-thread SQ/CQ pairs.

Either way, you get back a struct spdk_io_channel * that's valid for as long as you hold a reference.

The "put" path is the symmetric operation:

spdk_v26_01_migration/lib/thread/thread.c · lines 2531-2561 spdk_put_io_channel() — release a reference

void
spdk_put_io_channel(struct spdk_io_channel *ch)
{
    struct spdk_thread *thread;

    thread = spdk_get_thread();
    if (!thread) {
        SPDK_ERRLOG("called from non-SPDK thread\n");
        assert(false);
        return;
    }

    if (ch->thread != thread) {
        wrong_thread(__func__, "ch", ch->thread, thread);
        return;
    }

    ch->ref--;

    if (ch->ref == 0) {
        ch->destroy_ref++;
        spdk_thread_send_msg(thread, put_io_channel, ch);
    }
}

Three things to notice.

Same-thread check. If you try to put a channel that belongs to a different thread, the function calls wrong_thread() and aborts. Channels are not portable across threads. Period.
Refcount, not immediate destroy. The refcount is decremented synchronously, but the actual destroy is deferred via a spdk_thread_send_msg call. This is because the destroy involves calling dev->destroy_cb, which might block on resource cleanup (flushing NVMe queues, closing FDs), and we don't want to do that synchronously on the caller's hot path.
The destroy happens via a message. The put_io_channel function at lib/thread/thread.c:2466 actually runs on the reactor, in the same thread's reactor iteration. The framework guarantees this is the next safe moment to touch the channel's resources.

The asymmetry is worth pausing on. The acquire path is synchronous and runs on the calling reactor. The release path is asynchronous and defers the heavy lifting to the next reactor iteration. This means:

If you get and put a channel repeatedly in a hot path, you can afford it — the put is a refcount decrement, not a free.
If you get and never put, the channel leaks. At thread exit, you'll see the "thread %s still has channel for io_device %s" error at lib/thread/thread.c:408 .
If you put with refcount = 0 and then continue to use the channel pointer, you'll touch memory that's been freed. Don't do that.

Pollers, in depth

A poller is just a function pointer. But the framework's poller machinery is more sophisticated than you might expect. There are five poller states, and the transitions matter:

spdk_v26_01_migration/lib/thread/thread.c · lines 51-92 spdk_poller — the full state machine

enum spdk_poller_state {
    /* The poller is registered with a thread but not currently
     * executing its fn. */
    SPDK_POLLER_STATE_WAITING,

    /* The poller is currently running its fn. */
    SPDK_POLLER_STATE_RUNNING,

    /* The poller was unregistered during the execution of its fn. */
    SPDK_POLLER_STATE_UNREGISTERED,

    /* The poller is in the process of being paused.  It will be
     * paused during the next time it's supposed to be executed. */
    SPDK_POLLER_STATE_PAUSING,

    /* The poller is registered but currently paused.  It's on the
     * paused_pollers list. */
    SPDK_POLLER_STATE_PAUSED,
};

struct spdk_poller {
    TAILQ_ENTRY(spdk_poller)  tailq;
    RB_ENTRY(spdk_poller)     node;

    enum spdk_poller_state    state;

    uint64_t   period_ticks;
    uint64_t   next_run_tick;
    uint64_t   run_count;
    uint64_t   busy_count;
    uint64_t   id;
    spdk_poller_fn            fn;
    void                     *arg;
    struct spdk_thread       *thread;
    struct spdk_interrupt    *intr;
    spdk_poller_set_interrupt_mode_cb set_intr_cb_fn;
    void                     *set_intr_cb_arg;

    char  name[SPDK_MAX_POLLER_NAME_LEN + 1];
};

The states:

WAITING: registered, will fire on the next reactor tick (if a busy poller) or when the timer expires.
RUNNING: the reactor is currently inside poller->fn(). Transitions to UNREGISTERED, PAUSING, or back to WAITING when the function returns.
UNREGISTERED: someone called spdk_poller_unregister() on the poller. The next iteration of the poller loop frees the poller. This is a deferred free — you cannot free the poller synchronously because the reactor is currently walking a TAILQ that contains it.
PAUSING: someone called spdk_poller_pause() while the poller was RUNNING. The current iteration finishes, and the next time the poller would be scheduled, it gets moved to the paused_pollers list with state = PAUSED.
PAUSED: the poller is not running. It will be moved back to the active or timed queue when someone calls spdk_poller_resume().

The execution of a poller is at lib/thread/thread.c:980 , and it's worth reading for the state transitions:

spdk_v26_01_migration/lib/thread/thread.c · lines 980-1040 thread_execute_poller() — single poller execution

static inline int
thread_execute_poller(struct spdk_thread *thread, struct spdk_poller *poller)
{
    int rc;

    switch (poller->state) {
    case SPDK_POLLER_STATE_UNREGISTERED:
        TAILQ_REMOVE(&thread->active_pollers, poller, tailq);
        free(poller);
        return 0;
    case SPDK_POLLER_STATE_PAUSING:
        TAILQ_REMOVE(&thread->active_pollers, poller, tailq);
        TAILQ_INSERT_TAIL(&thread->paused_pollers, poller, tailq);
        poller->state = SPDK_POLLER_STATE_PAUSED;
        return 0;
    case SPDK_POLLER_STATE_WAITING:
        break;
    default:
        assert(false);
        break;
    }

    poller->state = SPDK_POLLER_STATE_RUNNING;
    rc = poller->fn(poller->arg);

    SPIN_ASSERT(thread->lock_count == 0, SPIN_ERR_HOLD_DURING_SWITCH);

    poller->run_count++;
    if (rc > 0) {
        poller->busy_count++;
    }

    switch (poller->state) {
    case SPDK_POLLER_STATE_UNREGISTERED:
        TAILQ_REMOVE(&thread->active_pollers, poller, tailq);
        free(poller);
        break;
    case SPDK_POLLER_STATE_PAUSING:
        TAILQ_REMOVE(&thread->active_pollers, poller, tailq);
        TAILQ_INSERT_TAIL(&thread->paused_pollers, poller, tailq);
        poller->state = SPDK_POLLER_STATE_PAUSED;
        break;
    case SPDK_POLLER_STATE_PAUSED:
    case SPDK_POLLER_STATE_WAITING:
        break;
    case SPDK_POLLER_STATE_RUNNING:
        poller->state = SPDK_POLLER_STATE_WAITING;
        break;
    default:
        assert(false);
        break;
    }

    return rc;
}

Two switch statements. The first is before poller->fn(); the second is after. The state machine is re-entered on every iteration, and the framework uses the state to decide whether to actually call the poller's function or just shuffle it between lists.

The SPIN_ASSERT line is interesting: it's the framework checking that the poller didn't return with an spdk_spinlock held. Returning from a poller with a lock held is a deadlock. The assertion catches it in debug builds; in release, it silently corrupts state.

The two flavors of poller:

Type	Period argument	Where it lives	When it fires
Busy / active	`0` (or omitted)	`active_pollers` TAILQ	Every reactor iteration. As fast as the reactor can fire it.
Timed / periodic	non-zero microseconds	`timed_pollers` red-black tree, keyed on `next_run_tick`	When `now >= next_run_tick`. The red-black tree keeps the next-to-fire poller at `first_timed_poller` for O(1) peek.
Paused	(any)	`paused_pollers` TAILQ	Never. Until you call `spdk_poller_resume()`.

The periodic case is interesting because of the period conversion. You pass microseconds, but the reactor compares against TSC ticks. The conversion is at lib/thread/thread.c:1691 :

static uint64_t
convert_us_to_ticks(uint64_t us)
{
    uint64_t quotient, remainder, ticks;

    if (us) {
        quotient = us / SPDK_SEC_TO_USEC;
        remainder = us % SPDK_SEC_TO_USEC;
        ticks = spdk_get_ticks_hz();

        return ticks * quotient + (ticks * remainder) / SPDK_SEC_TO_USEC;
    } else {
        return 0;
    }
}

The math is "convert µs to seconds, then multiply ticks-per-second, with overflow protection for µs values > 1 second." Period = 0 returns 0 ticks, which the framework uses to mean "this is a busy poller, not a timed one." The two paths diverge in thread_insert_poller() at lib/thread/thread.c:955 .

How a bdev_io "becomes" a poller event

This is the part that ties everything together. Imagine your Go code calls BdevLvolCreate:97 , which translates to the SPDK bdev_lvol_create RPC, which the framework dispatches to a poller, which eventually returns a UUID. What happened?

sequenceDiagram
participant Go as Go (diskengine)
participant Rpc as RPC handler thread
participant Bp as bdev poller
participant Sub as bdev submit
participant Ctl as NVMe controller

Go->>Rpc: JSON-RPC bdev_lvol_create
Rpc->>Rpc: dispatch on app thread
Rpc->>Sub: spdk_bdev_write (via channel)
Sub->>Ctl: ring doorbell, write to SQ
Ctl-->>Bp: interrupt / completion
Bp->>Bp: drain CQ, find cpl
Bp->>Rpc: spdk_bdev_io_complete (via msg)
Rpc-->>Go: JSON-RPC response

fig. 1 — bdev_io submission, completion, and the poller that ties them together · tap or scroll to zoom · ↗ for fullscreen

fig. 1 A single bdev_io flows through the framework. Submission is on the caller's thread (the channel makes that safe). Completion is on the bdev poller (the only thread polling the NVMe completion queue). The handler bridges the two via spdk_thread_send_msg().

The key insight: the bdev_io moves between threads. It's submitted on the RPC handler's thread (which has the bdev's io_channel for that subsystem), and it completes on the bdev's poller thread (which is the only thread that should touch the NVMe controller's completion queue). The spdk_bdev_io struct carries enough state to be valid on either thread, and the framework's "channels are bound to one thread" rule prevents the two threads from accidentally sharing a single channel.

The code path is at the heart of Layer 4 (bdev framework), but the threading side of it is this:

Submit side: the handler is on the RPC thread, which has an spdk_io_channel for the bdev subsystem. It calls the bdev's submit callback with the channel. The bdev writes to the channel's per-thread SQ. The call returns immediately. The bdev_io is now in flight.
Completion side: the bdev's poller fires (every reactor tick, because it's a busy poller). It drains the NVMe CQ. For each completed entry, it calls spdk_bdev_io_complete(), which in turn sends a message to the original submitter's thread via spdk_thread_send_msg().
Resume side: the submitter's reactor runs the completion message. The bdev_io is now in a "complete" state. The RPC handler's continuation runs, formats the JSON response, and sends it back to the Go side.

Two of those three steps happen on the bdev poller's thread. The third happens on the submitter's thread. Neither thread ever blocks on the other. That's the whole point.

The thread-local cache pattern

"Channel" is the per-thread state for a subsystem. "I/O channel as thread-local cache" is the design pattern. Anywhere you see a hot path that needs per-thread state, the framework expects you to put it in a channel's context.

Two real examples:

NVMe bdev channel context

The NVMe bdev allocates a fresh struct spdk_nvme_qpair per channel. Each qpair is an independent SQ/CQ pair, polled by the owning thread's bdev poller. The submission side is just "write to qpair->sq_tdbl." No locks.

iobuf channel cache

The iobuf module (see struct spdk_iobuf_pool_cache at include/spdk/thread.h:1143 ) is a mempool with a per-thread cache. When a thread does spdk_iobuf_get(), it first checks cache->cache (a per-thread SLIST of free buffers). If the cache has one, the call returns it in O(1) without touching the global pool. The iobuf channel is the cache.

Without channels

Every thread would need to either:

Take a global mutex on every I/O submission (kills throughput),
Maintain its own private state in thread_local and reinitialize it on thread creation (ugly, no lifecycle management), or
Pass the state through every function call as an argument (verbose, error-prone).

With channels

struct spdk_io_channel does it all:

Created lazily on first get;
Freed automatically on last put;
Bound to exactly one thread (can't be used wrong);
Refcounted (multiple owners OK);
Scoped by the underlying io_device's registration.

Edge cases & what trips people up

1. Channels are per-(thread, device), not per-thread

A common mental model: "I have one bdev, so each thread has one channel." Wrong. You have one channel per (thread, io_device) pair. If your thread acquires channels for two different bdevs, you have two channels. If you then acquire the same bdev again, you get the same channel (refcount up). The lookup uses the io_device pointer as the key in the thread's red-black tree, at lib/thread/thread.c:2354 .

2. `spdk_get_io_channel` calls the create_cb synchronously

The very first get on a (thread, device) pair invokes dev->create_cb() on the calling thread, in the calling reactor iteration. If your create_cb is slow (e.g. NVMe admin command to create a new queue pair), your hot path stalls. Plan for it: pre-acquire channels during init, not during the first I/O.

3. The channel is bound to the `spdk_thread`, not the pthread

The dynamic scheduler can move an spdk_thread from one pthread (reactor) to another. The channel moves with it, because the channel is owned by the spdk_thread, not by the pthread. From the channel's perspective, nothing changed. From an outside observer's perspective, the "core" that "owns" the channel changed. This is invisible to channel users. You don't need to do anything. But you should not cache "core number" in your application code.

4. Pollers run on the same thread as the channel they need

The bdev's poller is what drains the NVMe CQ. The bdev's poller runs on its own spdk_thread. That thread has its own spdk_io_channel for the bdev subsystem. The poller uses that channel to submit I/O. You must register pollers on the same thread whose channel you want them to use. If you register a "completion poller" on a thread that doesn't have the right channel, you can't use the channel — wrong_thread abort.

5. `put_io_channel` can run after your function returns

The actual destroy is deferred via spdk_thread_send_msg() at lib/thread/thread.c:2559 . The destroy runs on the next reactor iteration after your put returns. If you put and then immediately try to do another operation that assumes the channel still exists, you'll race the destroy. Don't. Acquire, use, release — and don't touch the channel after release.

6. Pollers and messages: a poller can't safely send itself a message

If a poller is currently running on thread T, and the poller calls spdk_thread_send_msg(T, ...), the message is enqueued on T's message ring. It will be picked up on the next reactor iteration, not during the current one. The poller is still running. If the message's function is waiting for the poller to finish, you deadlock. Use spdk_thread_exec_msg() at include/spdk/thread.h:547 , which detects the local case and runs the function immediately.

7. Refcounted channels survive "weird" puts

If you call spdk_get_io_channel() five times on the same (thread, device) pair, you get back the same pointer with ref == 5. You need five spdk_put_io_channel() calls before the channel is destroyed. The "extra" gets are useful when you want to hand a channel to a poller (which won't release it for a long time) while keeping your own reference for the duration of a function call. The refcount is the contract. Mismatched gets/puts leak.

8. Pollers fire on the reactor's clock, not yours

A periodic poller with a 1000 µs period fires approximately every 1000 µs, measured in TSC ticks. The reactor's clock is spdk_get_ticks(), which is monotonic. It is not wall-clock time. Under heavy load, your poller might fire at 1100 µs or 1200 µs intervals because the reactor is busy with other work. If you need wall-clock precision, use a busy poller and check the time yourself.

9. `spdk_poller_unregister` doesn't free synchronously

spdk_poller_unregister(&p) at

lib/thread/thread.c:1823

sets the poller's state to UNREGISTERED and writes NULL back into the caller's pointer. The poller is not freed yet. The framework will free it on the next reactor iteration, when it walks the list and sees the state. The pattern is:

struct spdk_poller *my_poller;

void
init_thread(void)
{
    my_poller = SPDK_POLLER_REGISTER(my_fn, NULL, 1000);
}

void
cleanup_thread(void)
{
    spdk_poller_unregister(&my_poller);  /* sets my_poller = NULL */
    /* Don't touch my_poller after this point. */
}

If you try to free(my_poller) manually, you'll double-free. The framework owns the lifetime after unregister returns.

10. The diskengine client doesn't see channels

The Go code at Client.Call:43 doesn't have a concept of "my bdev channel." It just issues RPCs and gets responses. The channel lifecycle is internal to the C side. If you ever try to maintain channel state across RPCs in Go, you're going to have a bad time — see 2.2 on why. Every RPC handler must acquire and release its own channels.

What to take away

Channels are per-thread state. Pollers are recurring work. Together, they let SPDK do millions of I/Os per second without ever taking a lock. The "channel as thread-local cache" pattern is the single most important design idiom in the framework: any subsystem that wants to operate lockless puts its per-thread state in a channel context, and the framework handles the lifecycle.

The next page — 2.4 — The threading rules — is where all of this gets codified into the "you must be on the right thread" rules. Every bug you've ever had with channels or pollers is going to be diagnosable in terms of those rules.