Extra: Linux Sys Internals Flows Flashcards

(15 cards)

1
Q

What are the 12 Linux/Systems Internals Flows?

A

The mental model:
- TLB lookups/misses, syscalls, interrupts, and context switches are the heartbeat (always happening).
- Page faults and network/block I/O are the breathing (workload-driven).
- Everything else is “occasional” to “rare”.

From most to least frequent on a busy production server:
- TLB misses — millions/sec to 10s of millions.
- Syscall flow — millions/sec. Every read, write, open, close, epoll_wait. The most frequent transition in the kernel by far.
- Interrupt flow — hundreds of thousands/sec. Timer ticks (250-1000 Hz × cores), NIC interrupts, disk completions, IPI (inter-processor interrupts). On a busy network box, NAPI (1 int = batches of packets) reduces count, still massive.
- Context switch flow — tens of thousands to hundreds of thousands/sec. Every time a process blocks or its timeslice expires. A busy web server: 50K-200K/sec easily.
- Page fault flow — thousands to tens of thousands/sec. Mostly minor faults (COW, lazy allocation, mmap first-touch). Major faults (actual disk I/O) should be rare on a healthy system — if they’re not, something’s wrong.
- Network packet flow — thousands to millions/sec depending on workload. A 10Gbps NIC doing small packets can hit 14M pps. Each packet traverses the full protocol stack.
- Block I/O flow — hundreds to tens of thousands/sec. SSDs handle ~100K-500K IOPS. Mostly hidden by page cache — on a well-cached system, very few reads actually hit disk.
- VFS path walk + page cache flow — thousands/sec. Every open() walks the path. But dcache makes repeated lookups nearly free (just pointer chasing in memory), so the expensive walks (actual directory reads) are much rarer.
- Signal flow — tens to hundreds/sec normally. SIGCHLD from child exits, SIGALRM from timers, the occasional SIGHUP reload. Spikes during process storms. Low frequency in steady state.
- Process lifecycle flow — tens to hundreds/sec. Each HTTP request in Apache prefork = fork+exit. But most modern servers use persistent processes/threads, so this drops. Container orchestration (K8s) adds some.
- Memory reclaim flow — ideally near zero. kswapd wakes occasionally to maintain free page watermarks. If direct reclaim is firing frequently, you’re in trouble. OOM killer is a once-in-a-crisis event.
- Boot flow — once. Ever. Until the next kernel panic or planned reboot. Months apart in well-run production.

+1 Informational:
- Device hotplug flow — rare in production. Maybe network link flap, USB token, cloud disk attach. Datacenter servers: a few events per day or less. Exception: SR-IOV VF creation in cloud environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Costs of kernel transitions

Shortcut correlation: ns → μs → ms. Three 100x cliffs.

A

Orders of magnitude to remember:

TLB miss:        ~10ns
Syscall:         ~100ns      (10x TLB miss)
Interrupt:       ~1-5μs      (10-50x syscall - hardirq + softirq)
Ctx switch:     ~1-5μs      (10-50x syscall)
                                      / ~3-8μs (processes: CR3+TLB flush)
Minor fault:     ~1-5μs      (same ballpark as context switch)
Maj. fault SSD:  ~0.1ms      (100x context switch)
Maj. fault HDD:  ~10ms       (100x SSD)

Interview-friendly version:
- TLB miss and Syscalls are nanoseconds
- Interrupts, Context switches and Minor faults are microseconds
- Major faults are milliseconds.

Three orders of magnitude between cheapest and most expensive.

The KPTI tax:
- Before Meltdown (2018), a trivial syscall was ~50ns. KPTI added a page table switch on every kernel entry/exit, roughly doubling syscall cost.
- Spectre mitigations (retpolines, IBRS) added more.
- Modern syscalls are ~2-3x more expensive than pre-2018. This is why vDSO matters — gettimeofday() and clock_gettime() never enter the kernel at all, they read a shared page mapped into userspace.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

TLB miss + Page table walk flow

Freq: 0 Cost: 10ns (10 cycles @1Ghz) + minor/major fault

A

TLB misses: millions to tens of millions/sec (10-100x more than Syscalls)

Multi-level look-up tree like the Filesystem walk to indirect blocks. Actually, before that look-up, there is PageWalkCache (PWC) to look at (before L1/2/3/RAM).

CPU executes instruction that accesses virtual address:

  1. check TLB (fully associative lookup, ~1ns) → HIT: physical address returned, done
  2. MISS: hardware page table walker activates
    → read CR3 (PGD physical address)
    → fetch PGD entry (memory access ~3-5ns, likely in L1/L2 cache)
    → fetch PUD entry (memory access)
    → fetch PMD entry (memory access)
    → fetch PTE entry (memory access)
    → PTE valid + permissions OK:
    populate TLB with new entry
    retry instruction (now hits)
    total: ~10-50ns
  3. IF PTE not present / perms violation:
    → PTE not present:
    CPU raises Exception #PF → kernel → page fault flow
    → PTE permission violation:
    CPU raises Exception #PF → kernel → SIGSEGV or CoW

Huge pages shortcuts the walk:
- 4KB pages: CR3 → PGD → PUD → PMD → PTE (4 levels)
- 2MB pages: CR3 → PGD → PUD → PMD (3 levels, PMD → 2MB frame)
- 1GB pages: CR3 → PGD → PUD (2 levels)

Very fast but it adds up. That’s why hugepages (2MB) matter for DBs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Syscall (mode switch) flow

Frequency: 1 - Logical steps: 3, Cost: ~100ns (10x TLB miss)

A

Syscall (mode switch) flow (millions/s):

  1. userspace calls glibc wrapper → SYSCALL instruction
  2. CPU switches to ring 0, saves user registers to pt_regs, kernel entry trampoline (KPTI)
  3. syscall table lookup → handler executes → return value in RAX 4. SYSRET instruction → back to ring 3 userspace.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Interrupt flow

Frequency: 2 - Logical steps: 3, Cost: ~1-5μs (10-50x syscall)

A

Interrupt flow (hundreds of thousands/s)

  1. device asserts IRQ → CPU pauses current execution, saves state
  2. IDT lookup → hardirq handler runs with interrupts disabled → ACK to interrupt controller (APIC) → raise softirq if deferred work needed → hardirq returns
  3. softirq runs (ksoftirqd if too much) → deferred work completes → back to interrupted process unless resched.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Context switch flow

Frequency: 3 - Logical steps: 3, Cost: ~1-5μs (10-50x syscall)

A

Context switch flow (tens-k to hundreds of thousands/sec)

  1. timer tick or voluntary sleep → CFS/EEVDF scheduler picks next task
  2. context_switch() → save kernel registers of current → if threads, no address space switch → else switch address spaces: load new PGD into CR3 (TLB flush) → restore kernel registers of next
  3. new task resumes in kernel (all switches happen during kernel mode: syscall, tick preemption, etc.) → eventually returns to resumed task userspace.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Page fault flow

Freq: 4 - Logical steps: 3, Cost: ~1-5μs (minor); ~100μs to 10ms (major)

A

Page fault flow (thousands to tens of thousands/sec)

  1. access to unmapped/protected virtual address → CPU raises #PF exception → kernel reads CR2 (faulting address) → handle_mm_fault()
  2. walk page tables → determine type
    2.1 minor (allocate frame, zero-fill or CoW copy, update PTE) or
    2.2 major (page not in RAM → block I/O to read from disk/swap)
  3. iret back to faulting instruction which now succeeds.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Network packet flow

Freq: 5 - Logical steps: 4, Cost: ~5-15μs (NIC to buffer)

A

Network packet receive flow (thousands to millions/sec)

  1. packet arrives at NIC → NIC DMAs into ring buffer in RAM → raises hardirq →
  2. driver ACKs and schedules NAPI → softirq calls NAPI poll → driver pulls packets from ring buffer in batches (interrupts disabled, polling mode) → GRO aggregation (merges small pkts)
  3. netfilter/iptables rules → IP routing decision → TCP state machine: sequence validation, ACK processing, congestion window update → data appended to socket receive buffer
  4. wake process blocked in recv()/epoll.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Block I/O flow

Freq: 6 - L.Steps: 4, Cost: ~3-10μs (doorbell) + 50μs-15ms (device)

A

Block I/O flow (hundreds to tens of thousands/sec)

  1. filesystem or page cache calls submit_bio() → bio enters block layer
  2. merge with adjacent requests in scheduler queue (mq-deadline/BFQ/kyber) → cgroup blkio throttling applied → dispatch to device driver
  3. build DMA scatter-gather list → ring device doorbell → device performs transfer → completion interrupt fires
  4. block layer callback → wake waiting process or update page cache.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

VFS path walk flow

Freq: 7 - L.Steps: 3, Cost: ~1-3μs (cached) + bio if not

A

VFS path walk flow (thousands/sec)

  1. open(“/a/b/c”) → start at root dentry (or cwd)
  2. For each level: dcache lookup for “a” → if cache miss: call filesystem’s ->lookup() which reads directory inode from disk → if symlink: follow up to 40 levels → Repeat
    2.1 at each level: check permissions (DAC + LSM/SELinux)
  3. final dentry → allocate struct file → assign fd → return fd to userspace.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Signal flow

Freq: 8 - Steps: 3, Cost: ~1-2μs (similar to ints; excl. handler)

A

Signal flow (tens to hundreds/sec)

  1. signal generated (kill(), kernel event, hardware exception) → add to target’s pending signal set
  2. if target is blocked/sleeping, wake it → on return to userspace, kernel checks pending signals → save current user context to signal frame on user stack → redirect execution to signal handler in userspace → handler runs
  3. sigreturn() syscall restores original context → process resumes where it was interrupted.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Process lifecycle flow

How/when a new process starts to be executed?

Freq: 9 - Steps: 4, Ctx switch Cost vs thread: ~3-8μs / ~1-3μs

A

Process lifecycle flow (tens to hundreds/sec)

  1. fork() → copy_process(): allocate new task_struct, copy/share file table, signal handlers, namespace refs → duplicate page tables with COW flags (no page copying yet) → assign PID → wake_up_new_task() adds to scheduler run queue
  2. child runs → execve(): load ELF binary, flush old address space, map .text/.data/.bss, set up stack with argv/envp, set entry point
  3. process runs → exit() → do_exit(): release resources (fds, mm, signals) → become zombie (TASK_ZOMBIE)
  4. parent calls wait() → kernel reaps task_struct → gone.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Memory reclaim flow

kswapd BG, direct blocking -> called stalls

Freq: 10 - Steps: 3, Cost: ~10-100μs to ~1-10ms

A

Memory reclaim flow (ideally near zero)

  1. free pages drop below low watermark (vm.min_free_kbytes and vm.watermark_scale_factor) → kswapd wakes and reclaims in background until/unless pages reach high watermark
  2. if kswapd can’t keep up and free pages hit min watermark → direct reclaim starts (blocking, allocating process stalls) → scan LRU / MGLRU (6.1+) lists (active/inactive, file/anon) → balance based on vm.swappiness → evict clean file pages instantly → writeback dirty pages → compress cold pages (zswap) → swap out anonymous pages
  3. if still failing: OOM killer selects and kills largest process → free its pages.

Direct reclaim (clean): ~10-100μs ← process stalls here
Direct reclaim (dirty): ~1-10ms ← disaster territory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Boot flow

Freq: 12 -

A

Boot flow (once)

  1. firmware (UEFI/BIOS) POST → find boot device → load bootloader (GRUB)
  2. GRUB starts default entry → loads kernel image + initramfs into RAM → jump to kernel entry
  3. start_kernel(): initialize memory allocator, page tables, IDT, scheduler, timekeeping → decompress initramfs → mount it as temporary root → init (PID 1) in initramfs loads storage drivers, assembles root filesystem → pivot_root() to real root
  4. exec real init (systemd) → systemd starts targets/services → reach login prompt.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Device hotplug flow

Just informational

A

Device hotplug flow (most likely zero)

  1. physical event (USB plug, NVMe insert, network cable) or virtual event (PCI rescan, virtio attach) → hardware signals presence (USB hub interrupt, PCIe hot-plug interrupt, ACPI notification)
  2. kernel bus driver detects new device → enumerate: read device/vendor IDs, capabilities, resource requirements → create struct device, populate sysfs entries under /sys/devices/ → call device model’s bus->match() to find matching driver → call driver’s ->probe(): allocate resources, request IRQs, initialize hardware, register subsystem-specific interfaces (block device, net device, input device) → kernel sends uevent via netlink → udevd receives uevent → evaluates udev rules → creates /dev/ node with correct permissions/ownership/symlinks
  3. triggers any configured actions (mount, network-up, module load).

Removal is the reverse: hardware signals detach → driver’s ->remove() called → tear down interfaces → release resources → remove sysfs entries → udevd cleans up /dev/ nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly