Poll-mode vs interrupt-driven I/O.
The first two pages explained why SPDK runs in userspace and what an NVMe device looks like. This page is the operational one: how SPDK actually knows when an I/O is done. The answer is that it asks, in a tight loop, forever, on a thread pinned to one CPU. Why? Because asking is faster than being told, when "being told" costs a context switch.
- How interrupts actually work on a modern system
- The cost of an interrupt, line by line
- How polling works — and what "polling" means here
- The latency-vs-CPU trade-off
- Tail latency: why p99.9 is the only number that matters
- When polling breaks down
- SPDK pollers — a preview of Layer 2
- The "always busy" assumption
- The kernel scheduler still bites
- Source walkthrough: process_completions
- Edge cases: what trips people up
How interrupts actually work on a modern system
"Interrupts" is one of those words that everyone uses and almost no one defines. Let's pin it down.
An interrupt is a signal from a device to the CPU
saying "I need attention." The CPU stops what it's doing, saves enough
state to come back, runs a handler, then returns. For a PCIe device
like an NVMe controller, the modern form is MSI-X:
the device writes a small message to a specific address, the
interrupt controller (typically an APIC on-chip or
IOMMU-attached) routes the message to a specific CPU
core, and the core's interrupt handler runs.
The operating system's role is to:
- Map the device's MSI-X vectors to handlers at device-init time.
- Provide a top-half (the interrupt handler itself, which runs with
interrupts disabled and must be quick) and a bottom-half
(
tasklet,softirq, or work item, which does the real work). - Schedule the bottom-half on a CPU, respecting affinity and preemption rules.
That's a lot of moving parts for "the device is done." And every step costs time.
The cost of an interrupt, line by line
Let's time the path from "device posts a completion" to "host code runs the completion handler." This is what an interrupt-driven kernel driver does, on every single I/O.
On a quiet system, steps 1–8 take ~5–10 µs end-to-end. On a loaded system with many devices sharing interrupt lines, softirq congestion, and CPU contention, that number can balloon to 50–200 µs. And that's the average. The tail is worse.
How polling works — and what "polling" means here
"Polling" sounds lazy, but in this context it is the opposite. The host runs a tight loop on a dedicated CPU core, and the loop has exactly one job: look at the CQ, see if there are new completions, process them, repeat.
That's it. No interrupts. No context switches. No kernel mediation. No
scheduler. No preemption. The thread runs until it has nothing to do,
then runs spdk_thread_poll() (which you'll meet in
2.1) to see if anything else
arrived, and immediately starts looking again.
- It does not sleep. The caller is expected to call it again immediately if it returned a non-zero count.
- It does not take a lock. The queue pair is owned by exactly one thread (enforced by convention; SPDK's threading model is the topic of 2.4).
- It does not allocate. Tracker reuse is pre-allocated at queue-pair construction.
- It does not cross an address-space boundary. Everything is in the same process, on the same core, touching the same cache lines.
The latency-vs-CPU trade-off
The deal SPDK makes is straightforward:
CPU: low. The thread sleeps when there is no work.
Latency: variable. Wake-up cost dominates; p99.9 is bad.
Throughput: bounded by IRQ rate. Each I/O consumes kernel and scheduler time.
CPU: 100% of one core per qpair. The thread is always running.
Latency: tight, predictable. p50 and p99.9 differ by single-digit µs.
Throughput: bounded by the device. The bottleneck is the SSD, not the kernel.
The trade is honest: give up an entire CPU core (or more, if you have many qpairs) in exchange for latency you can reason about. The core isn't wasted — it is spent. Every microsecond it runs is a microsecond the application is moving data, not waiting for an interrupt.
How much CPU is "100% of one core" in practice? On a modern Xeon running 4 KB random reads at 1 M IOPS, a single poller might use 60–90% of one core. Two pollers on the same qpair (a common pattern for hot devices) get you closer to saturation. The math is throughput × per-I/O cost: if each completion takes 200 ns of CPU, 1 M IOPS costs 0.2 CPU-seconds per second, which is 20% of a core. Add admin work, poller overhead, and the application itself, and you're at 50–80%.
Tail latency: why p99.9 is the only number that matters
Database people, storage people, and SREs have a saying: amateurs optimize average latency, pros optimize tail latency. The reason is that a single 100 ms outlier in a million I/Os doesn't change the average much (mean shifts by 0.1 µs) but it can stall every query that's waiting on that I/O. The query's own tail is now 100 ms.
Interrupt-driven I/O produces ugly tails because:
Interrupt coalescing — the device may wait up to a configurable interval before firing an interrupt, hoping to batch multiple completions. At low load, your I/O waits the whole interval for no reason.
IRQ steering — the interrupt may be delivered to a CPU that's already busy. It queues; you wait.
Softirq scheduling — the bottom-half runs at a priority below real-time processes, and may be preempted by them.
Cross-CPU cache misses — the IRQ runs on one CPU, the thread that wants the completion may be on another. The data has to be fetched across cores.
Polling eliminates the first three by construction — there is no interrupt, no coalescing, no softirq. The fourth is solved by pinning the polling thread to a specific core: the same core that rings the doorbell is the same core that reads the CQ. The cache line is hot.
flowchart LR
subgraph INTR["Interrupt-driven I/O"]
direction TB
I1["Device posts cpl"] --> I2["Fire MSI-X"]
I2 --> I3["IRQ routed to a CPU"]
I3 --> I4["Top-half: ack"]
I4 --> I5["Schedule bottom-half"]
I5 --> I6["Bottom-half runs"]
I6 --> I7["Process cpl"]
end
subgraph POLL["Poll-driven I/O (SPDK)"]
direction TB
P1["Device posts cpl"] --> P2["Poller loop: read CQ"]
P2 --> P3{"new cpl?"}
P3 -- "yes" --> P4["Process cpl"]
P4 --> P2
P3 -- "no" --> P5["Other reactor work"]
P5 --> P2
end
INTR -.->|8 steps, µs tail| POLL fig. 1 The interrupt path has more steps and more variability. The poll path has exactly one step that matters: "is there a cpl?" If yes, process. If no, do other work and check again. The latency distribution flattens because every I/O waits the same minimum amount of time (one CQ read, one branch).
When polling breaks down
Polling is not a free lunch. There are workloads where it is the wrong choice:
Low IOPS
If your workload issues 1 K IOPS, polling wastes 99.9% of the core's time. Each CQ read that returns "no completions" still costs a function call, a branch, and a cache miss. A typical low-IOPS SPDK deployment still has the polling cost, but the engine is mostly sleeping. You'd be better off with a smaller deployment and interrupt mode.
Power-constrained environments
A pinned core running at 100% burns power and refuses to enter deep C-states. On a laptop, this is a battery killer. On a server, the data center pays for it in cooling and electricity. Mobile-class and edge deployments almost always want interrupts.
Mixed-workload hosts
Polling assumes one exclusive use of the core. If the host is also running a database, a control plane, monitoring, etc., pinning SPDK threads to specific cores works only if you have spare cores. "Spare" in a cloud VM rarely means "actually idle."
NVMe over Fabrics (sometimes)
NVMe-oF targets can use polling too, but the round-trip to a remote device is already in the 10–100 µs range. Saving 5 µs of interrupt latency doesn't matter as much, and the polling CPU cost is the same. See 6.1 for when to poll and when not to.
SPDK pollers — a preview of Layer 2
"Polling" in SPDK isn't literally a while(1) { check_cq(); }
busy loop. It's a function that runs occasionally, on a
scheduled basis, by a reactor. The reactor (covered in
2.1) loops over all registered
pollers in round-robin order and calls each one.
A "poller" is a function pointer registered with the reactor:
- NVMe qpair poller — calls
spdk_nvme_qpair_process_completionsfor the qpair pinned to this reactor. - RDMA CQ poller — calls the RDMA completion- processing path for an NVMe-oF target.
- App poller — anything the application wants to run periodically. Periodic timers, retry logic, stats reporting.
The reactor doesn't know what pollers do. It just calls them. The poller decides whether to do work or return quickly. This is what makes SPDK's threading model composable: a reactor runs all the pollers registered to it, on one core, in order, until there's nothing left to do (and then it sleeps until the next event).
The "always busy" assumption
SPDK is designed for a specific deployment model, and the assumption is baked in everywhere:
- You control which cores SPDK runs on. You set
-m 0xFFon the command line (orspdk_app_opts->core_maskprogrammatically) to tell the reactor which cores to claim. SPDK will not negotiate with you about this. - Those cores are dedicated to SPDK. They should not run the kernel's kthreads, other userspace processes, or anything else. The "always busy" assumption is that if SPDK has nothing to do, the reactor sleeps; when work arrives, it wakes up fast.
- Your application threads also live on SPDK cores. They are scheduled by the reactor, not the kernel. They are not preempted by the kernel scheduler mid-operation.
- No syscalls in the hot path. The I/O code path
is straight C — no
read(), noioctl(), nommap(), nomutex. (Locks do exist; they are SPDK's, not the kernel's, and they are designed for cooperative threads.)
If any of these assumptions break — if a kernel thread steals a core, if a syscall gets called, if another process is busy-looping on the same CPU — polling latency goes out the window. The design is correct, but only in the world it assumes.
The kernel scheduler still bites
Even with all the userspace magic, you haven't escaped the kernel. You've just made the boundaries predictable. Three things still touch the kernel:
Setup and teardown. Mapping the PCI BAR, allocating hugepages, registering MSI-X vectors, creating the in-kernel poll-mode driver (DPDK's
igb_uioorvfio-pci) — all of this happens at startup, with syscalls. After that, the hot path stays out.Thread placement. The SPDK reactor creates a pthread per core, then calls
pthread_setaffinity_np()to pin it. The kernel still has the final word on which logical CPUs exist and which runnable threads are scheduled where. On a busy host, even a pinned thread can be descheduled if the kernel decides to.Interrupts from the device. Even in poll mode, the device can still fire interrupts (e.g. on error, on async event). SPDK installs handlers that drain them but don't do real work. A storm of interrupts can still cause the OS to spend time in the kernel, which can perturb the polling core.
The result is that a "100% polling" deployment is more like "100%
polling on a well-isolated host." The kernel is still there, it's
just out of the way. This is why production SPDK deployments are
often on dedicated hosts, dedicated cores, with kernel isolation
features (isolcpus, nohz_full,
rcu_nocbs) tuned to keep kernel work off the SPDK cores.
Source walkthrough: how a poll group processes completions
Here is the part of the poll path that ties a qpair to a reactor. The
actual work is in spdk_nvme_qpair_process_completions
shown above; this is the layer that calls it in a loop, across
multiple qpairs on the same thread.
Edge cases & what trips people up
1. A poll loop spike can starve other work on the same core
If a qpair has 1000 outstanding I/Os and the device just completed them all, the poller will spend many microseconds processing them. Anything else registered on the same reactor (a periodic stats poll, an RPC handler, a timer) waits. The reactor is cooperative: it doesn't preempt, it just runs pollers in order until they return 0 work.
The fix is to bound the work per pass. SPDK does this with
max_completions_cap in
lib/nvme/nvme_pcie_common.c:154 — the poller
reaps at most 1/4 of the queue depth (or 1, whichever is larger) per
call, then returns to the reactor for other work.
2. What happens when nothing is happening
The reactor's poll loop, when every poller returns 0, sleeps on an eventfd (Linux) until one of its registered fd's becomes ready. There is no spinloop cost when idle. But "idle" means "no NVMe completions, no timers, no incoming RPCs." If you have periodic background work (a stats reporter every 100 ms, say), that work keeps the reactor from ever fully sleeping — and the cores run at 100% even when no real I/O is happening.
3. The kernel scheduler can still steal your core
If the host is oversubscribed (more runnable threads than cores), the
kernel scheduler can preempt your pinned SPDK thread. The
isolcpus kernel command-line argument isolates cores
from the general scheduler; nohz_full disables timer
ticks on a core; rcu_nocbs keeps RCU callbacks off
specific cores. Without these, your "100% polling" deployment can
drop to 80% effective CPU without any obvious cause.
4. Page faults can blow your latency
SPDK doesn't normally page fault — its memory is pre-allocated,
hugepage-backed, and pinned. But if you malloc()
something on the hot path (don't), or if your callback calls into
a library that does, you'll see a 10–100 µs outlier that has
nothing to do with NVMe. Profilers will attribute it to "spdk"
but the cause is your code. Use hugepage pools (spdk_malloc)
for any data structure that lives on the I/O path.
5. Interrupt mode is still available — but it's not the default
SPDK has an enable_interrupts option (see
lib/nvme/nvme_pcie.c:1049 and
lib/nvme/nvme_poll_group.c:87 ). The reasons
you'd turn it on:
- You're doing low-rate I/O (e.g. management operations, telemetry) and don't want to pin a core.
- You're integrating with a host that can't isolate cores (e.g. a managed Kubernetes pod).
- You're debugging and want interrupts as a sanity check against silent polling bugs.
SPDK's interrupt mode is less optimized than the kernel's, by design — it exists for compatibility, not for performance. The default is poll mode for a reason.
6. Two pollers on the same qpair can double-count
The poll group has a re-entrancy guard
(group->in_process_completions) precisely to prevent
this. If you call
spdk_nvme_poll_group_process_completions recursively —
e.g. a completion handler that submits a new request which then
triggers a new completion — the inner call returns 0, not the
actual completions. The outer call's state stays consistent. The
tradeoff is that nested completions are processed on the next
reactor pass, not immediately. For most workloads, this delay is
negligible. For ultra-low-latency workloads, you need to think about
it.
Why it matters
Once you internalize "polling wins at high IOPS, interrupts win at low IOPS, the cutoff is somewhere around 50–100 K IOPS per core," a lot of SPDK's design choices stop looking strange:
- One thread per core, no preemption — because preemption is interrupt-driven by another name.
- SPDK allocates its own memory — because page faults are interrupts, and we promised no interrupts.
- SPDK implements its own locks (or avoids them) — because the kernel's locks are designed for preemptive scheduling, and we don't have that.
- The reactor is a single-threaded loop — because cooperative scheduling and polling compose naturally; preemptive scheduling and polling fight each other.
When you see a "low IOPS" deployment of SPDK doing badly, you'll know to look at the polling cost. When you see a "high IOPS" deployment doing badly, you'll know to look at the non-polling parts of the system (syscalls, locks, callbacks, logging). The mental model is the same in both cases: polling is free at high rate, expensive at low rate; interrupts are the opposite.
The next layer (1.1) starts putting these pieces together: what SPDK is, how it's organized, and where the bdev, lvol, nvmf, and vhost frameworks fit. The vocabulary from this primer is the vocabulary the rest of the curriculum assumes you have.