Layer 2 · Threading & reactor

spdk_io_channel & pollers.

An spdk_io_channel is per-thread state for a subsystem. Pollers are the recurring functions that get work done. Together, they are how an spdk_thread actually talks to an NVMe controller, an lvol, a TCP socket, or any other subsystem — and they are why the hot path doesn't need a single mutex.

~15 min read2 diagramsprerequisite: Layer 2.2
On this page
  1. Why channels exist (the lockless hot path)
  2. What an spdk_io_channel actually is
  3. get_io_channel / put_io_channel — the lifecycle
  4. Pollers, in depth
  5. How a bdev_io "becomes" a poller event
  6. The thread-local cache pattern
  7. Edge cases & what trips people up

Why channels exist (the lockless hot path)

Imagine you have an NVMe controller. The controller has a submission queue. To submit an I/O, you need to:

  1. Write a command to the next slot in the submission queue.
  2. Bump the doorbell register so the controller sees it.
  3. Wait for the completion queue to report a result.

The submission queue is a single-producer, single-consumer ring. If you have one thread submitting, the implementation is straightforward: a head pointer, a tail pointer, an inlined CAS. The problem starts when you have many threads submitting. If two threads race on step 1, they'll both write to the same slot and overwrite each other.

There are three standard solutions:

SolutionHowCost at high IOPS
Global mutexOne big lock around the whole submission path.Unworkable. All threads serialize on the lock. IOPS collapses to one thread's throughput.
Lockless MPSC ringUse a multi-producer ring buffer; the consumer (the controller's interrupt / poller) only sees a contiguous tail.Works, but you still need per-thread tail pointers to avoid CAS contention on the head.
Per-thread submission queueEach thread gets its own SQ. The controller sees many SQs and round-robins between them.Best. Zero contention in the steady state.

NVMe supports option 3 natively — it has a "number of queues" feature that lets you create up to 64K submission/completion queue pairs. But you have to actually create and manage them. The spdk_io_channel is SPDK's abstraction over "the per-thread submission state for this subsystem." When a thread does spdk_get_io_channel(bdev), the bdev module allocates a new NVMe submission queue and hands the thread a pointer to its context.

What an spdk_io_channel actually is

Look at the size constant:

And the public API for getting context bytes:

The framework's view of a channel (the head part) tracks three things:

  • struct io_device *dev — back-pointer to the registered io_device.
  • struct spdk_thread *thread — back-pointer to the thread that owns this channel. This is what makes the "channels are bound to one thread" rule enforceable.
  • refcount — how many spdk_io_channel_ref() calls are outstanding. spdk_get_io_channel() returns refcount = 1; further calls return the same channel with refcount incremented. The channel is only destroyed when refcount reaches 0.

The subsystem's view (the tail part) is whatever the subsystem needs to operate. For the NVMe bdev, it's the admin and I/O submission queue pairs, the per-thread data buffer pool, and so on. For a TCP socket-based subsystem, it might just be the file descriptor and a small receive buffer.

get_io_channel / put_io_channel — the lifecycle

The "get" and "put" pattern comes straight from reference-counted resource management. The implementation is at lib/thread/thread.c:2358 :

The "put" path is the symmetric operation:

The asymmetry is worth pausing on. The acquire path is synchronous and runs on the calling reactor. The release path is asynchronous and defers the heavy lifting to the next reactor iteration. This means:

  • If you get and put a channel repeatedly in a hot path, you can afford it — the put is a refcount decrement, not a free.
  • If you get and never put, the channel leaks. At thread exit, you'll see the "thread %s still has channel for io_device %s" error at lib/thread/thread.c:408 .
  • If you put with refcount = 0 and then continue to use the channel pointer, you'll touch memory that's been freed. Don't do that.

Pollers, in depth

A poller is just a function pointer. But the framework's poller machinery is more sophisticated than you might expect. There are five poller states, and the transitions matter:

The execution of a poller is at lib/thread/thread.c:980 , and it's worth reading for the state transitions:

The two flavors of poller:

TypePeriod argumentWhere it livesWhen it fires
Busy / active0 (or omitted)active_pollers TAILQEvery reactor iteration. As fast as the reactor can fire it.
Timed / periodicnon-zero microsecondstimed_pollers red-black tree, keyed on next_run_tickWhen now >= next_run_tick. The red-black tree keeps the next-to-fire poller at first_timed_poller for O(1) peek.
Paused(any)paused_pollers TAILQNever. Until you call spdk_poller_resume().

The periodic case is interesting because of the period conversion. You pass microseconds, but the reactor compares against TSC ticks. The conversion is at lib/thread/thread.c:1691 :

static uint64_t
convert_us_to_ticks(uint64_t us)
{
    uint64_t quotient, remainder, ticks;

    if (us) {
        quotient = us / SPDK_SEC_TO_USEC;
        remainder = us % SPDK_SEC_TO_USEC;
        ticks = spdk_get_ticks_hz();

        return ticks * quotient + (ticks * remainder) / SPDK_SEC_TO_USEC;
    } else {
        return 0;
    }
}

The math is "convert µs to seconds, then multiply ticks-per-second, with overflow protection for µs values > 1 second." Period = 0 returns 0 ticks, which the framework uses to mean "this is a busy poller, not a timed one." The two paths diverge in thread_insert_poller() at lib/thread/thread.c:955 .

How a bdev_io "becomes" a poller event

This is the part that ties everything together. Imagine your Go code calls BdevLvolCreate:97 , which translates to the SPDK bdev_lvol_create RPC, which the framework dispatches to a poller, which eventually returns a UUID. What happened?

sequenceDiagram
participant Go as Go (diskengine)
participant Rpc as RPC handler thread
participant Bp as bdev poller
participant Sub as bdev submit
participant Ctl as NVMe controller

Go->>Rpc: JSON-RPC bdev_lvol_create
Rpc->>Rpc: dispatch on app thread
Rpc->>Sub: spdk_bdev_write (via channel)
Sub->>Ctl: ring doorbell, write to SQ
Ctl-->>Bp: interrupt / completion
Bp->>Bp: drain CQ, find cpl
Bp->>Rpc: spdk_bdev_io_complete (via msg)
Rpc-->>Go: JSON-RPC response
fig. 1 — bdev_io submission, completion, and the poller that ties them together · tap or scroll to zoom · ↗ for fullscreen

fig. 1   A single bdev_io flows through the framework. Submission is on the caller's thread (the channel makes that safe). Completion is on the bdev poller (the only thread polling the NVMe completion queue). The handler bridges the two via spdk_thread_send_msg().

The key insight: the bdev_io moves between threads. It's submitted on the RPC handler's thread (which has the bdev's io_channel for that subsystem), and it completes on the bdev's poller thread (which is the only thread that should touch the NVMe controller's completion queue). The spdk_bdev_io struct carries enough state to be valid on either thread, and the framework's "channels are bound to one thread" rule prevents the two threads from accidentally sharing a single channel.

The code path is at the heart of Layer 4 (bdev framework), but the threading side of it is this:

  1. Submit side: the handler is on the RPC thread, which has an spdk_io_channel for the bdev subsystem. It calls the bdev's submit callback with the channel. The bdev writes to the channel's per-thread SQ. The call returns immediately. The bdev_io is now in flight.
  2. Completion side: the bdev's poller fires (every reactor tick, because it's a busy poller). It drains the NVMe CQ. For each completed entry, it calls spdk_bdev_io_complete(), which in turn sends a message to the original submitter's thread via spdk_thread_send_msg().
  3. Resume side: the submitter's reactor runs the completion message. The bdev_io is now in a "complete" state. The RPC handler's continuation runs, formats the JSON response, and sends it back to the Go side.

Two of those three steps happen on the bdev poller's thread. The third happens on the submitter's thread. Neither thread ever blocks on the other. That's the whole point.

The thread-local cache pattern

"Channel" is the per-thread state for a subsystem. "I/O channel as thread-local cache" is the design pattern. Anywhere you see a hot path that needs per-thread state, the framework expects you to put it in a channel's context.

Two real examples:

NVMe bdev channel context

The NVMe bdev allocates a fresh struct spdk_nvme_qpair per channel. Each qpair is an independent SQ/CQ pair, polled by the owning thread's bdev poller. The submission side is just "write to qpair->sq_tdbl." No locks.

iobuf channel cache

The iobuf module (see struct spdk_iobuf_pool_cache at include/spdk/thread.h:1143 ) is a mempool with a per-thread cache. When a thread does spdk_iobuf_get(), it first checks cache->cache (a per-thread SLIST of free buffers). If the cache has one, the call returns it in O(1) without touching the global pool. The iobuf channel is the cache.

Without channels

Every thread would need to either:

  • Take a global mutex on every I/O submission (kills throughput),
  • Maintain its own private state in thread_local and reinitialize it on thread creation (ugly, no lifecycle management), or
  • Pass the state through every function call as an argument (verbose, error-prone).
With channels

struct spdk_io_channel does it all:

  • Created lazily on first get;
  • Freed automatically on last put;
  • Bound to exactly one thread (can't be used wrong);
  • Refcounted (multiple owners OK);
  • Scoped by the underlying io_device's registration.

Edge cases & what trips people up

1. Channels are per-(thread, device), not per-thread

A common mental model: "I have one bdev, so each thread has one channel." Wrong. You have one channel per (thread, io_device) pair. If your thread acquires channels for two different bdevs, you have two channels. If you then acquire the same bdev again, you get the same channel (refcount up). The lookup uses the io_device pointer as the key in the thread's red-black tree, at lib/thread/thread.c:2354 .

2. spdk_get_io_channel calls the create_cb synchronously

The very first get on a (thread, device) pair invokes dev->create_cb() on the calling thread, in the calling reactor iteration. If your create_cb is slow (e.g. NVMe admin command to create a new queue pair), your hot path stalls. Plan for it: pre-acquire channels during init, not during the first I/O.

3. The channel is bound to the spdk_thread, not the pthread

The dynamic scheduler can move an spdk_thread from one pthread (reactor) to another. The channel moves with it, because the channel is owned by the spdk_thread, not by the pthread. From the channel's perspective, nothing changed. From an outside observer's perspective, the "core" that "owns" the channel changed. This is invisible to channel users. You don't need to do anything. But you should not cache "core number" in your application code.

4. Pollers run on the same thread as the channel they need

The bdev's poller is what drains the NVMe CQ. The bdev's poller runs on its own spdk_thread. That thread has its own spdk_io_channel for the bdev subsystem. The poller uses that channel to submit I/O. You must register pollers on the same thread whose channel you want them to use. If you register a "completion poller" on a thread that doesn't have the right channel, you can't use the channel — wrong_thread abort.

5. put_io_channel can run after your function returns

The actual destroy is deferred via spdk_thread_send_msg() at lib/thread/thread.c:2559 . The destroy runs on the next reactor iteration after your put returns. If you put and then immediately try to do another operation that assumes the channel still exists, you'll race the destroy. Don't. Acquire, use, release — and don't touch the channel after release.

6. Pollers and messages: a poller can't safely send itself a message

If a poller is currently running on thread T, and the poller calls spdk_thread_send_msg(T, ...), the message is enqueued on T's message ring. It will be picked up on the next reactor iteration, not during the current one. The poller is still running. If the message's function is waiting for the poller to finish, you deadlock. Use spdk_thread_exec_msg() at include/spdk/thread.h:547 , which detects the local case and runs the function immediately.

7. Refcounted channels survive "weird" puts

If you call spdk_get_io_channel() five times on the same (thread, device) pair, you get back the same pointer with ref == 5. You need five spdk_put_io_channel() calls before the channel is destroyed. The "extra" gets are useful when you want to hand a channel to a poller (which won't release it for a long time) while keeping your own reference for the duration of a function call. The refcount is the contract. Mismatched gets/puts leak.

8. Pollers fire on the reactor's clock, not yours

A periodic poller with a 1000 µs period fires approximately every 1000 µs, measured in TSC ticks. The reactor's clock is spdk_get_ticks(), which is monotonic. It is not wall-clock time. Under heavy load, your poller might fire at 1100 µs or 1200 µs intervals because the reactor is busy with other work. If you need wall-clock precision, use a busy poller and check the time yourself.

9. spdk_poller_unregister doesn't free synchronously

spdk_poller_unregister(&p) at

lib/thread/thread.c:1823

sets the poller's state to UNREGISTERED and writes NULL back into the caller's pointer. The poller is not freed yet. The framework will free it on the next reactor iteration, when it walks the list and sees the state. The pattern is:

struct spdk_poller *my_poller;

void
init_thread(void)
{
    my_poller = SPDK_POLLER_REGISTER(my_fn, NULL, 1000);
}

void
cleanup_thread(void)
{
    spdk_poller_unregister(&my_poller);  /* sets my_poller = NULL */
    /* Don't touch my_poller after this point. */
}

If you try to free(my_poller) manually, you'll double-free. The framework owns the lifetime after unregister returns.

10. The diskengine client doesn't see channels

The Go code at Client.Call:43 doesn't have a concept of "my bdev channel." It just issues RPCs and gets responses. The channel lifecycle is internal to the C side. If you ever try to maintain channel state across RPCs in Go, you're going to have a bad time — see 2.2 on why. Every RPC handler must acquire and release its own channels.

What to take away

Channels are per-thread state. Pollers are recurring work. Together, they let SPDK do millions of I/Os per second without ever taking a lock. The "channel as thread-local cache" pattern is the single most important design idiom in the framework: any subsystem that wants to operate lockless puts its per-thread state in a channel context, and the framework handles the lifecycle.

The next page — 2.4 — The threading rules — is where all of this gets codified into the "you must be on the right thread" rules. Every bug you've ever had with channels or pollers is going to be diagnosable in terms of those rules.