Layer 0 · Prerequisite primer

Poll-mode vs interrupt-driven I/O.

The first two pages explained why SPDK runs in userspace and what an NVMe device looks like. This page is the operational one: how SPDK actually knows when an I/O is done. The answer is that it asks, in a tight loop, forever, on a thread pinned to one CPU. Why? Because asking is faster than being told, when "being told" costs a context switch.

~10 min read1 diagramprerequisite: 0.2 — NVMe at the hardware level

On this page

How interrupts actually work on a modern system
The cost of an interrupt, line by line
How polling works — and what "polling" means here
The latency-vs-CPU trade-off
Tail latency: why p99.9 is the only number that matters
When polling breaks down
SPDK pollers — a preview of Layer 2
The "always busy" assumption
The kernel scheduler still bites
Source walkthrough: process_completions
Edge cases: what trips people up

How interrupts actually work on a modern system

"Interrupts" is one of those words that everyone uses and almost no one defines. Let's pin it down.

An interrupt is a signal from a device to the CPU saying "I need attention." The CPU stops what it's doing, saves enough state to come back, runs a handler, then returns. For a PCIe device like an NVMe controller, the modern form is MSI-X: the device writes a small message to a specific address, the interrupt controller (typically an APIC on-chip or IOMMU-attached) routes the message to a specific CPU core, and the core's interrupt handler runs.

The operating system's role is to:

Map the device's MSI-X vectors to handlers at device-init time.
Provide a top-half (the interrupt handler itself, which runs with interrupts disabled and must be quick) and a bottom-half (tasklet, softirq, or work item, which does the real work).
Schedule the bottom-half on a CPU, respecting affinity and preemption rules.

That's a lot of moving parts for "the device is done." And every step costs time.

The cost of an interrupt, line by line

Let's time the path from "device posts a completion" to "host code runs the completion handler." This is what an interrupt-driven kernel driver does, on every single I/O.

STEP 01

1. cpl arrives

device writes 16 B to CQ in RAM

→

STEP 02

2. MSI-X fired

device writes interrupt message

→

STEP 03

3. APIC delivers

interrupt controller routes to a CPU

→

STEP 04

4. CPU takes IRQ

saves registers, switches to kernel stack

→

STEP 05

5. Top half runs

ack IRQ, schedule bottom-half

→

STEP 06

6. iret

return from interrupt

→

STEP 07

7. Bottom-half

softirq/tasklet, may be preempted

→

STEP 08

8. Handler runs

the actual cpl-processing code

On a quiet system, steps 1–8 take ~5–10 µs end-to-end. On a loaded system with many devices sharing interrupt lines, softirq congestion, and CPU contention, that number can balloon to 50–200 µs. And that's the average. The tail is worse.

How polling works — and what "polling" means here

"Polling" sounds lazy, but in this context it is the opposite. The host runs a tight loop on a dedicated CPU core, and the loop has exactly one job: look at the CQ, see if there are new completions, process them, repeat.

That's it. No interrupts. No context switches. No kernel mediation. No scheduler. No preemption. The thread runs until it has nothing to do, then runs spdk_thread_poll() (which you'll meet in 2.1) to see if anything else arrived, and immediately starts looking again.

spdk_v26_01_migration/lib/nvme/nvme_qpair.c · lines 833-909 spdk_nvme_qpair_process_completions — what a poller actually does

This is the function a reactor thread calls for each qpair it owns. It is the entire "process completions" loop. Notice that the function returns a count — the caller is expected to call it again, immediately, in a loop. That's polling.

int32_t
spdk_nvme_qpair_process_completions(struct spdk_nvme_qpair *qpair, uint32_t max_completions)
{
    int32_t ret;
    struct nvme_request *req, *tmp;

    if (nvme_qpair_is_admin_queue(qpair)) {
        /* Complete any pending register operations */
        nvme_complete_register_operations(qpair);
        /* Process transport-specific events */
        nvme_transport_ctrlr_process_transport_events(qpair->ctrlr);
    }

    if (spdk_unlikely(qpair->ctrlr->is_failed && ...)) {
        // controller in failure state: abort all in-flight I/O
        return -ENXIO;
    }

    // ... error injection handling ...

    qpair->in_completion_context = 1;
    ret = nvme_transport_qpair_process_completions(qpair, max_completions);
    // ret = number of completions reaped
    qpair->in_completion_context = 0;

    if (ret > 0) {
        // Resubmit anything that was waiting for a free tracker
        nvme_qpair_resubmit_requests(qpair, ret);
    }
    return ret;
}

What the function does not do is just as important:

It does not sleep. The caller is expected to call it again immediately if it returned a non-zero count.
It does not take a lock. The queue pair is owned by exactly one thread (enforced by convention; SPDK's threading model is the topic of 2.4).
It does not allocate. Tracker reuse is pre-allocated at queue-pair construction.
It does not cross an address-space boundary. Everything is in the same process, on the same core, touching the same cache lines.

The latency-vs-CPU trade-off

The deal SPDK makes is straightforward:

Interrupt mode

CPU: low. The thread sleeps when there is no work.

Latency: variable. Wake-up cost dominates; p99.9 is bad.

Throughput: bounded by IRQ rate. Each I/O consumes kernel and scheduler time.

Poll mode (SPDK default)

CPU: 100% of one core per qpair. The thread is always running.

Latency: tight, predictable. p50 and p99.9 differ by single-digit µs.

Throughput: bounded by the device. The bottleneck is the SSD, not the kernel.

The trade is honest: give up an entire CPU core (or more, if you have many qpairs) in exchange for latency you can reason about. The core isn't wasted — it is spent. Every microsecond it runs is a microsecond the application is moving data, not waiting for an interrupt.

How much CPU is "100% of one core" in practice? On a modern Xeon running 4 KB random reads at 1 M IOPS, a single poller might use 60–90% of one core. Two pollers on the same qpair (a common pattern for hot devices) get you closer to saturation. The math is throughput × per-I/O cost: if each completion takes 200 ns of CPU, 1 M IOPS costs 0.2 CPU-seconds per second, which is 20% of a core. Add admin work, poller overhead, and the application itself, and you're at 50–80%.

Tail latency: why p99.9 is the only number that matters

Database people, storage people, and SREs have a saying: amateurs optimize average latency, pros optimize tail latency. The reason is that a single 100 ms outlier in a million I/Os doesn't change the average much (mean shifts by 0.1 µs) but it can stall every query that's waiting on that I/O. The query's own tail is now 100 ms.

Interrupt-driven I/O produces ugly tails because:

Interrupt coalescing — the device may wait up to a configurable interval before firing an interrupt, hoping to batch multiple completions. At low load, your I/O waits the whole interval for no reason.
IRQ steering — the interrupt may be delivered to a CPU that's already busy. It queues; you wait.
Softirq scheduling — the bottom-half runs at a priority below real-time processes, and may be preempted by them.
Cross-CPU cache misses — the IRQ runs on one CPU, the thread that wants the completion may be on another. The data has to be fetched across cores.

Polling eliminates the first three by construction — there is no interrupt, no coalescing, no softirq. The fourth is solved by pinning the polling thread to a specific core: the same core that rings the doorbell is the same core that reads the CQ. The cache line is hot.

flowchart LR
subgraph INTR["Interrupt-driven I/O"]
  direction TB
  I1["Device posts cpl"] --> I2["Fire MSI-X"]
  I2 --> I3["IRQ routed to a CPU"]
  I3 --> I4["Top-half: ack"]
  I4 --> I5["Schedule bottom-half"]
  I5 --> I6["Bottom-half runs"]
  I6 --> I7["Process cpl"]
end

subgraph POLL["Poll-driven I/O (SPDK)"]
  direction TB
  P1["Device posts cpl"] --> P2["Poller loop: read CQ"]
  P2 --> P3{"new cpl?"}
  P3 -- "yes" --> P4["Process cpl"]
  P4 --> P2
  P3 -- "no" --> P5["Other reactor work"]
  P5 --> P2
end

INTR -.->|8 steps, µs tail| POLL

fig. 1 — Interrupt vs poll, side by side · tap or scroll to zoom · ↗ for fullscreen

fig. 1 The interrupt path has more steps and more variability. The poll path has exactly one step that matters: "is there a cpl?" If yes, process. If no, do other work and check again. The latency distribution flattens because every I/O waits the same minimum amount of time (one CQ read, one branch).

When polling breaks down

Polling is not a free lunch. There are workloads where it is the wrong choice:

Low IOPS

If your workload issues 1 K IOPS, polling wastes 99.9% of the core's time. Each CQ read that returns "no completions" still costs a function call, a branch, and a cache miss. A typical low-IOPS SPDK deployment still has the polling cost, but the engine is mostly sleeping. You'd be better off with a smaller deployment and interrupt mode.

Power-constrained environments

A pinned core running at 100% burns power and refuses to enter deep C-states. On a laptop, this is a battery killer. On a server, the data center pays for it in cooling and electricity. Mobile-class and edge deployments almost always want interrupts.

Mixed-workload hosts

Polling assumes one exclusive use of the core. If the host is also running a database, a control plane, monitoring, etc., pinning SPDK threads to specific cores works only if you have spare cores. "Spare" in a cloud VM rarely means "actually idle."

NVMe over Fabrics (sometimes)

NVMe-oF targets can use polling too, but the round-trip to a remote device is already in the 10–100 µs range. Saving 5 µs of interrupt latency doesn't matter as much, and the polling CPU cost is the same. See 6.1 for when to poll and when not to.

SPDK pollers — a preview of Layer 2

"Polling" in SPDK isn't literally a while(1) { check_cq(); } busy loop. It's a function that runs occasionally, on a scheduled basis, by a reactor. The reactor (covered in 2.1) loops over all registered pollers in round-robin order and calls each one.

A "poller" is a function pointer registered with the reactor:

NVMe qpair poller — calls spdk_nvme_qpair_process_completions for the qpair pinned to this reactor.
RDMA CQ poller — calls the RDMA completion- processing path for an NVMe-oF target.
App poller — anything the application wants to run periodically. Periodic timers, retry logic, stats reporting.

The reactor doesn't know what pollers do. It just calls them. The poller decides whether to do work or return quickly. This is what makes SPDK's threading model composable: a reactor runs all the pollers registered to it, on one core, in order, until there's nothing left to do (and then it sleeps until the next event).

The "always busy" assumption

SPDK is designed for a specific deployment model, and the assumption is baked in everywhere:

You control which cores SPDK runs on. You set -m 0xFF on the command line (or spdk_app_opts->core_mask programmatically) to tell the reactor which cores to claim. SPDK will not negotiate with you about this.
Those cores are dedicated to SPDK. They should not run the kernel's kthreads, other userspace processes, or anything else. The "always busy" assumption is that if SPDK has nothing to do, the reactor sleeps; when work arrives, it wakes up fast.
Your application threads also live on SPDK cores. They are scheduled by the reactor, not the kernel. They are not preempted by the kernel scheduler mid-operation.
No syscalls in the hot path. The I/O code path is straight C — no read(), no ioctl(), no mmap(), no mutex. (Locks do exist; they are SPDK's, not the kernel's, and they are designed for cooperative threads.)

If any of these assumptions break — if a kernel thread steals a core, if a syscall gets called, if another process is busy-looping on the same CPU — polling latency goes out the window. The design is correct, but only in the world it assumes.

The kernel scheduler still bites

Even with all the userspace magic, you haven't escaped the kernel. You've just made the boundaries predictable. Three things still touch the kernel:

Setup and teardown. Mapping the PCI BAR, allocating hugepages, registering MSI-X vectors, creating the in-kernel poll-mode driver (DPDK's igb_uio or vfio-pci) — all of this happens at startup, with syscalls. After that, the hot path stays out.
Thread placement. The SPDK reactor creates a pthread per core, then calls pthread_setaffinity_np() to pin it. The kernel still has the final word on which logical CPUs exist and which runnable threads are scheduled where. On a busy host, even a pinned thread can be descheduled if the kernel decides to.
Interrupts from the device. Even in poll mode, the device can still fire interrupts (e.g. on error, on async event). SPDK installs handlers that drain them but don't do real work. A storm of interrupts can still cause the OS to spend time in the kernel, which can perturb the polling core.

The result is that a "100% polling" deployment is more like "100% polling on a well-isolated host." The kernel is still there, it's just out of the way. This is why production SPDK deployments are often on dedicated hosts, dedicated cores, with kernel isolation features (isolcpus, nohz_full, rcu_nocbs) tuned to keep kernel work off the SPDK cores.

Source walkthrough: how a poll group processes completions

Here is the part of the poll path that ties a qpair to a reactor. The actual work is in spdk_nvme_qpair_process_completions shown above; this is the layer that calls it in a loop, across multiple qpairs on the same thread.

spdk_v26_01_migration/lib/nvme/nvme_poll_group.c · lines 339-373 spdk_nvme_poll_group_process_completions — drain all qpairs in this group

A poll group is the structure that ties a set of qpairs to a single polling thread. The reactor calls this function on every iteration. The function iterates each transport's contribution and reaps completions.

int64_t
spdk_nvme_poll_group_process_completions(struct spdk_nvme_poll_group *group,
        uint32_t completions_per_qpair, spdk_nvme_disconnected_qpair_cb disconnected_qpair_cb)
{
    struct spdk_nvme_transport_poll_group *tgroup;
    int64_t error_reason = 0, num_completions = 0;

    if (spdk_unlikely(disconnected_qpair_cb == NULL)) {
        return -EINVAL;
    }

    if (spdk_unlikely(group->in_process_completions)) {
        return 0;     // re-entrancy guard
    }
    group->in_process_completions = true;

    STAILQ_FOREACH(tgroup, &group->tgroups, link) {
        int64_t local_completions;

        local_completions = nvme_transport_poll_group_process_completions(
                                tgroup, completions_per_qpair,
                                disconnected_qpair_cb);
        if (spdk_unlikely(local_completions < 0)) {
            if (!error_reason) {
                error_reason = local_completions;
            }
        } else {
            num_completions += local_completions;
            assert(num_completions >= 0);
        }
    }
    group->in_process_completions = false;

    return error_reason ? error_reason : num_completions;
}

The re-entrancy guard at the top is there because completion handlers can themselves call into the poll group (e.g. to submit a follow-up read). Without the guard, the inner call would corrupt the outer call's iteration state.

The function returns the total number of completions reaped. The caller uses this number to decide whether to keep polling or yield to other reactor work. The pattern is:

while (spdk_nvme_poll_group_process_completions(group, ...) > 0) {
    // keep going while we have work
}
// now do other reactor pollers

Edge cases & what trips people up

1. A poll loop spike can starve other work on the same core

If a qpair has 1000 outstanding I/Os and the device just completed them all, the poller will spend many microseconds processing them. Anything else registered on the same reactor (a periodic stats poll, an RPC handler, a timer) waits. The reactor is cooperative: it doesn't preempt, it just runs pollers in order until they return 0 work.

The fix is to bound the work per pass. SPDK does this with max_completions_cap in lib/nvme/nvme_pcie_common.c:154 — the poller reaps at most 1/4 of the queue depth (or 1, whichever is larger) per call, then returns to the reactor for other work.

2. What happens when nothing is happening

The reactor's poll loop, when every poller returns 0, sleeps on an eventfd (Linux) until one of its registered fd's becomes ready. There is no spinloop cost when idle. But "idle" means "no NVMe completions, no timers, no incoming RPCs." If you have periodic background work (a stats reporter every 100 ms, say), that work keeps the reactor from ever fully sleeping — and the cores run at 100% even when no real I/O is happening.

3. The kernel scheduler can still steal your core

If the host is oversubscribed (more runnable threads than cores), the kernel scheduler can preempt your pinned SPDK thread. The isolcpus kernel command-line argument isolates cores from the general scheduler; nohz_full disables timer ticks on a core; rcu_nocbs keeps RCU callbacks off specific cores. Without these, your "100% polling" deployment can drop to 80% effective CPU without any obvious cause.

4. Page faults can blow your latency

SPDK doesn't normally page fault — its memory is pre-allocated, hugepage-backed, and pinned. But if you malloc() something on the hot path (don't), or if your callback calls into a library that does, you'll see a 10–100 µs outlier that has nothing to do with NVMe. Profilers will attribute it to "spdk" but the cause is your code. Use hugepage pools (spdk_malloc) for any data structure that lives on the I/O path.

5. Interrupt mode is still available — but it's not the default

SPDK has an enable_interrupts option (see lib/nvme/nvme_pcie.c:1049 and lib/nvme/nvme_poll_group.c:87 ). The reasons you'd turn it on:

You're doing low-rate I/O (e.g. management operations, telemetry) and don't want to pin a core.
You're integrating with a host that can't isolate cores (e.g. a managed Kubernetes pod).
You're debugging and want interrupts as a sanity check against silent polling bugs.

SPDK's interrupt mode is less optimized than the kernel's, by design — it exists for compatibility, not for performance. The default is poll mode for a reason.

6. Two pollers on the same qpair can double-count

The poll group has a re-entrancy guard (group->in_process_completions) precisely to prevent this. If you call spdk_nvme_poll_group_process_completions recursively — e.g. a completion handler that submits a new request which then triggers a new completion — the inner call returns 0, not the actual completions. The outer call's state stays consistent. The tradeoff is that nested completions are processed on the next reactor pass, not immediately. For most workloads, this delay is negligible. For ultra-low-latency workloads, you need to think about it.

Why it matters

Once you internalize "polling wins at high IOPS, interrupts win at low IOPS, the cutoff is somewhere around 50–100 K IOPS per core," a lot of SPDK's design choices stop looking strange:

One thread per core, no preemption — because preemption is interrupt-driven by another name.
SPDK allocates its own memory — because page faults are interrupts, and we promised no interrupts.
SPDK implements its own locks (or avoids them) — because the kernel's locks are designed for preemptive scheduling, and we don't have that.
The reactor is a single-threaded loop — because cooperative scheduling and polling compose naturally; preemptive scheduling and polling fight each other.

When you see a "low IOPS" deployment of SPDK doing badly, you'll know to look at the polling cost. When you see a "high IOPS" deployment doing badly, you'll know to look at the non-polling parts of the system (syscalls, locks, callbacks, logging). The mental model is the same in both cases: polling is free at high rate, expensive at low rate; interrupts are the opposite.

The next layer (1.1) starts putting these pieces together: what SPDK is, how it's organized, and where the bdev, lvol, nvmf, and vhost frameworks fit. The vocabulary from this primer is the vocabulary the rest of the curriculum assumes you have.