spdk_io_channel & pollers.
An spdk_io_channel is per-thread state for a
subsystem. Pollers are the recurring functions that get work
done. Together, they are how an spdk_thread actually
talks to an NVMe controller, an lvol, a TCP socket, or any other
subsystem — and they are why the hot path doesn't need a single
mutex.
- Why channels exist (the lockless hot path)
- What an
spdk_io_channelactually is get_io_channel/put_io_channel— the lifecycle- Pollers, in depth
- How a bdev_io "becomes" a poller event
- The thread-local cache pattern
- Edge cases & what trips people up
Why channels exist (the lockless hot path)
Imagine you have an NVMe controller. The controller has a submission queue. To submit an I/O, you need to:
- Write a command to the next slot in the submission queue.
- Bump the doorbell register so the controller sees it.
- Wait for the completion queue to report a result.
The submission queue is a single-producer, single-consumer ring. If you have one thread submitting, the implementation is straightforward: a head pointer, a tail pointer, an inlined CAS. The problem starts when you have many threads submitting. If two threads race on step 1, they'll both write to the same slot and overwrite each other.
There are three standard solutions:
| Solution | How | Cost at high IOPS |
|---|---|---|
| Global mutex | One big lock around the whole submission path. | Unworkable. All threads serialize on the lock. IOPS collapses to one thread's throughput. |
| Lockless MPSC ring | Use a multi-producer ring buffer; the consumer (the controller's interrupt / poller) only sees a contiguous tail. | Works, but you still need per-thread tail pointers to avoid CAS contention on the head. |
| Per-thread submission queue | Each thread gets its own SQ. The controller sees many SQs and round-robins between them. | Best. Zero contention in the steady state. |
NVMe supports option 3 natively — it has a "number of
queues" feature that lets you create up to 64K
submission/completion queue pairs. But you have to actually
create and manage them. The spdk_io_channel
is SPDK's abstraction over "the per-thread submission
state for this subsystem." When a thread does
spdk_get_io_channel(bdev), the bdev module
allocates a new NVMe submission queue and hands the
thread a pointer to its context.
What an spdk_io_channel actually is
Look at the size constant:
And the public API for getting context bytes:
The framework's view of a channel (the head part) tracks three things:
struct io_device *dev— back-pointer to the registeredio_device.struct spdk_thread *thread— back-pointer to the thread that owns this channel. This is what makes the "channels are bound to one thread" rule enforceable.- refcount — how many
spdk_io_channel_ref()calls are outstanding.spdk_get_io_channel()returns refcount = 1; further calls return the same channel with refcount incremented. The channel is only destroyed when refcount reaches 0.
The subsystem's view (the tail part) is whatever the subsystem needs to operate. For the NVMe bdev, it's the admin and I/O submission queue pairs, the per-thread data buffer pool, and so on. For a TCP socket-based subsystem, it might just be the file descriptor and a small receive buffer.
get_io_channel / put_io_channel — the lifecycle
The "get" and "put" pattern comes straight from reference-counted resource management. The implementation is at lib/thread/thread.c:2358 :
The "put" path is the symmetric operation:
The asymmetry is worth pausing on. The acquire path is synchronous and runs on the calling reactor. The release path is asynchronous and defers the heavy lifting to the next reactor iteration. This means:
- If you
getandputa channel repeatedly in a hot path, you can afford it — the put is a refcount decrement, not a free. - If you
getand neverput, the channel leaks. At thread exit, you'll see the"thread %s still has channel for io_device %s"error at lib/thread/thread.c:408 . - If you
putwith refcount = 0 and then continue to use the channel pointer, you'll touch memory that's been freed. Don't do that.
Pollers, in depth
A poller is just a function pointer. But the framework's poller machinery is more sophisticated than you might expect. There are five poller states, and the transitions matter:
The execution of a poller is at lib/thread/thread.c:980 , and it's worth reading for the state transitions:
The two flavors of poller:
| Type | Period argument | Where it lives | When it fires |
|---|---|---|---|
| Busy / active | 0 (or omitted) | active_pollers TAILQ | Every reactor iteration. As fast as the reactor can fire it. |
| Timed / periodic | non-zero microseconds | timed_pollers red-black tree,
keyed on next_run_tick | When now >= next_run_tick.
The red-black tree keeps the next-to-fire
poller at first_timed_poller for
O(1) peek. |
| Paused | (any) | paused_pollers TAILQ | Never. Until you call
spdk_poller_resume(). |
The periodic case is interesting because of the period conversion. You pass microseconds, but the reactor compares against TSC ticks. The conversion is at lib/thread/thread.c:1691 :
static uint64_t
convert_us_to_ticks(uint64_t us)
{
uint64_t quotient, remainder, ticks;
if (us) {
quotient = us / SPDK_SEC_TO_USEC;
remainder = us % SPDK_SEC_TO_USEC;
ticks = spdk_get_ticks_hz();
return ticks * quotient + (ticks * remainder) / SPDK_SEC_TO_USEC;
} else {
return 0;
}
}The math is "convert µs to seconds, then multiply
ticks-per-second, with overflow protection for µs
values > 1 second." Period = 0 returns 0 ticks,
which the framework uses to mean "this is a busy
poller, not a timed one." The two paths diverge
in thread_insert_poller() at
lib/thread/thread.c:955 .
How a bdev_io "becomes" a poller event
This is the part that ties everything together. Imagine
your Go code calls
BdevLvolCreate:97 , which
translates to the SPDK bdev_lvol_create
RPC, which the framework dispatches to a poller, which
eventually returns a UUID. What happened?
sequenceDiagram participant Go as Go (diskengine) participant Rpc as RPC handler thread participant Bp as bdev poller participant Sub as bdev submit participant Ctl as NVMe controller Go->>Rpc: JSON-RPC bdev_lvol_create Rpc->>Rpc: dispatch on app thread Rpc->>Sub: spdk_bdev_write (via channel) Sub->>Ctl: ring doorbell, write to SQ Ctl-->>Bp: interrupt / completion Bp->>Bp: drain CQ, find cpl Bp->>Rpc: spdk_bdev_io_complete (via msg) Rpc-->>Go: JSON-RPC response
fig. 1 A single bdev_io flows through the framework.
Submission is on the caller's thread (the channel makes that
safe). Completion is on the bdev poller (the only thread
polling the NVMe completion queue). The handler bridges the
two via spdk_thread_send_msg().
The key insight: the bdev_io moves between
threads. It's submitted on the RPC handler's
thread (which has the bdev's io_channel for that
subsystem), and it completes on the bdev's poller
thread (which is the only thread that should touch
the NVMe controller's completion queue). The
spdk_bdev_io struct carries
enough state to be valid on either thread,
and the framework's "channels are bound to one
thread" rule prevents the two threads from
accidentally sharing a single channel.
The code path is at the heart of Layer 4 (bdev framework), but the threading side of it is this:
- Submit side: the handler is on
the RPC thread, which has an
spdk_io_channelfor the bdev subsystem. It calls the bdev'ssubmitcallback with the channel. The bdev writes to the channel's per-thread SQ. The call returns immediately. The bdev_io is now in flight. - Completion side: the bdev's
poller fires (every reactor tick, because
it's a busy poller). It drains the NVMe CQ.
For each completed entry, it calls
spdk_bdev_io_complete(), which in turn sends a message to the original submitter's thread viaspdk_thread_send_msg(). - Resume side: the submitter's reactor runs the completion message. The bdev_io is now in a "complete" state. The RPC handler's continuation runs, formats the JSON response, and sends it back to the Go side.
Two of those three steps happen on the bdev poller's thread. The third happens on the submitter's thread. Neither thread ever blocks on the other. That's the whole point.
The thread-local cache pattern
"Channel" is the per-thread state for a subsystem. "I/O channel as thread-local cache" is the design pattern. Anywhere you see a hot path that needs per-thread state, the framework expects you to put it in a channel's context.
Two real examples:
NVMe bdev channel context
The NVMe bdev allocates a fresh
struct spdk_nvme_qpair per channel.
Each qpair is an independent SQ/CQ pair, polled
by the owning thread's bdev poller. The
submission side is just "write to qpair->sq_tdbl."
No locks.
iobuf channel cache
The iobuf module (see
struct spdk_iobuf_pool_cache at
include/spdk/thread.h:1143 )
is a mempool with a per-thread cache. When a
thread does spdk_iobuf_get(), it
first checks cache->cache (a
per-thread SLIST of free buffers). If the cache
has one, the call returns it in O(1) without
touching the global pool. The iobuf channel
is the cache.
Every thread would need to either:
- Take a global mutex on every I/O submission (kills throughput),
- Maintain its own private state in
thread_localand reinitialize it on thread creation (ugly, no lifecycle management), or - Pass the state through every function call as an argument (verbose, error-prone).
struct spdk_io_channel does it
all:
- Created lazily on first
get; - Freed automatically on last
put; - Bound to exactly one thread (can't be used wrong);
- Refcounted (multiple owners OK);
- Scoped by the underlying
io_device's registration.
Edge cases & what trips people up
1. Channels are per-(thread, device), not per-thread
A common mental model: "I have one bdev, so each
thread has one channel." Wrong. You have
one channel per (thread, io_device) pair.
If your thread acquires channels for two
different bdevs, you have two channels. If
you then acquire the same bdev again, you get
the same channel (refcount up). The lookup
uses the io_device pointer as the
key in the thread's red-black tree, at
lib/thread/thread.c:2354 .
2. spdk_get_io_channel calls the create_cb synchronously
The very first get on a (thread,
device) pair invokes dev->create_cb()
on the calling thread, in the calling reactor
iteration. If your create_cb is slow
(e.g. NVMe admin command to create a new
queue pair), your hot path stalls. Plan for
it: pre-acquire channels during init, not
during the first I/O.
3. The channel is bound to the spdk_thread, not the pthread
The dynamic scheduler can move an
spdk_thread from one pthread
(reactor) to another. The channel moves with
it, because the channel is owned by the
spdk_thread, not by the pthread.
From the channel's perspective, nothing
changed. From an outside observer's
perspective, the "core" that "owns" the
channel changed. This is invisible
to channel users. You don't need to
do anything. But you should not cache
"core number" in your application code.
4. Pollers run on the same thread as the channel they need
The bdev's poller is what drains the NVMe CQ.
The bdev's poller runs on its own
spdk_thread. That thread has its
own spdk_io_channel for the bdev
subsystem. The poller uses that channel to
submit I/O. You must register pollers
on the same thread whose channel you want
them to use. If you register a
"completion poller" on a thread that doesn't
have the right channel, you can't use the
channel — wrong_thread abort.
5. put_io_channel can run after your function returns
The actual destroy is deferred via
spdk_thread_send_msg() at
lib/thread/thread.c:2559 .
The destroy runs on the next reactor iteration
after your put returns.
If you put and then immediately
try to do another operation that assumes the
channel still exists, you'll race the
destroy. Don't. Acquire,
use, release — and don't touch the channel
after release.
6. Pollers and messages: a poller can't safely send itself a message
If a poller is currently running on thread T,
and the poller calls
spdk_thread_send_msg(T, ...), the
message is enqueued on T's message ring. It
will be picked up on the next reactor
iteration, not during the current one. The
poller is still running. If the message's
function is waiting for the poller to finish,
you deadlock. Use
spdk_thread_exec_msg() at
include/spdk/thread.h:547 ,
which detects the local case and runs the
function immediately.
7. Refcounted channels survive "weird" puts
If you call spdk_get_io_channel()
five times on the same (thread, device) pair,
you get back the same pointer with
ref == 5. You need five
spdk_put_io_channel() calls
before the channel is destroyed. The
"extra" gets are useful when you want to
hand a channel to a poller (which won't
release it for a long time) while keeping
your own reference for the duration of a
function call. The refcount is the
contract. Mismatched gets/puts
leak.
8. Pollers fire on the reactor's clock, not yours
A periodic poller with a 1000 µs period
fires approximately every 1000 µs, measured
in TSC ticks. The reactor's clock is
spdk_get_ticks(), which is
monotonic. It is not wall-clock time.
Under heavy load, your poller might fire at
1100 µs or 1200 µs intervals because the
reactor is busy with other work. If you
need wall-clock precision, use a busy poller
and check the time yourself.
9. spdk_poller_unregister doesn't free synchronously
spdk_poller_unregister(&p) at
sets the poller's state to
UNREGISTERED and writes NULL back
into the caller's pointer. The poller is
not freed yet. The framework will
free it on the next reactor iteration, when
it walks the list and sees the state. The
pattern is:
struct spdk_poller *my_poller;
void
init_thread(void)
{
my_poller = SPDK_POLLER_REGISTER(my_fn, NULL, 1000);
}
void
cleanup_thread(void)
{
spdk_poller_unregister(&my_poller); /* sets my_poller = NULL */
/* Don't touch my_poller after this point. */
}If you try to free(my_poller)
manually, you'll double-free. The framework
owns the lifetime after unregister
returns.
10. The diskengine client doesn't see channels
The Go code at Client.Call:43 doesn't have a concept of "my bdev channel." It just issues RPCs and gets responses. The channel lifecycle is internal to the C side. If you ever try to maintain channel state across RPCs in Go, you're going to have a bad time — see 2.2 on why. Every RPC handler must acquire and release its own channels.
What to take away
Channels are per-thread state. Pollers are recurring work. Together, they let SPDK do millions of I/Os per second without ever taking a lock. The "channel as thread-local cache" pattern is the single most important design idiom in the framework: any subsystem that wants to operate lockless puts its per-thread state in a channel context, and the framework handles the lifecycle.
The next page — 2.4 — The threading rules — is where all of this gets codified into the "you must be on the right thread" rules. Every bug you've ever had with channels or pollers is going to be diagnosable in terms of those rules.