Extra: Linux Sys Internals Flows Flashcards

Question 1

Q

What are the 12 Linux/Systems Internals Flows?

Answer

A

The mental model:
- TLB lookups/misses, syscalls, interrupts, and context switches are the heartbeat (always happening).
- Page faults and network/block I/O are the breathing (workload-driven).
- Everything else is “occasional” to “rare”.

From most to least frequent on a busy production server:
- TLB misses — millions/sec to 10s of millions.
- Syscall flow — millions/sec. Every read, write, open, close, epoll_wait. The most frequent transition in the kernel by far.
- Interrupt flow — hundreds of thousands/sec. Timer ticks (250-1000 Hz × cores), NIC interrupts, disk completions, IPI (inter-processor interrupts). On a busy network box, NAPI (1 int = batches of packets) reduces count, still massive.
- Context switch flow — tens of thousands to hundreds of thousands/sec. Every time a process blocks or its timeslice expires. A busy web server: 50K-200K/sec easily.
- Page fault flow — thousands to tens of thousands/sec. Mostly minor faults (COW, lazy allocation, mmap first-touch). Major faults (actual disk I/O) should be rare on a healthy system — if they’re not, something’s wrong.
- Network packet flow — thousands to millions/sec depending on workload. A 10Gbps NIC doing small packets can hit 14M pps. Each packet traverses the full protocol stack.
- Block I/O flow — hundreds to tens of thousands/sec. SSDs handle ~100K-500K IOPS. Mostly hidden by page cache — on a well-cached system, very few reads actually hit disk.
- VFS path walk + page cache flow — thousands/sec. Every open() walks the path. But dcache makes repeated lookups nearly free (just pointer chasing in memory), so the expensive walks (actual directory reads) are much rarer.
- Signal flow — tens to hundreds/sec normally. SIGCHLD from child exits, SIGALRM from timers, the occasional SIGHUP reload. Spikes during process storms. Low frequency in steady state.
- Process lifecycle flow — tens to hundreds/sec. Each HTTP request in Apache prefork = fork+exit. But most modern servers use persistent processes/threads, so this drops. Container orchestration (K8s) adds some.
- Memory reclaim flow — ideally near zero. kswapd wakes occasionally to maintain free page watermarks. If direct reclaim is firing frequently, you’re in trouble. OOM killer is a once-in-a-crisis event.
- Boot flow — once. Ever. Until the next kernel panic or planned reboot. Months apart in well-run production.

+1 Informational:
- Device hotplug flow — rare in production. Maybe network link flap, USB token, cloud disk attach. Datacenter servers: a few events per day or less. Exception: SR-IOV VF creation in cloud environments.

Question 2

Q

Costs of kernel transitions

Shortcut correlation: ns → μs → ms. Three 100x cliffs.

Answer

A

Orders of magnitude to remember:

TLB miss:        ~10ns
Syscall:         ~100ns      (10x TLB miss)
Interrupt:       ~1-5μs      (10-50x syscall - hardirq + softirq)
Ctx switch:     ~1-5μs      (10-50x syscall)
                                      / ~3-8μs (processes: CR3+TLB flush)
Minor fault:     ~1-5μs      (same ballpark as context switch)
Maj. fault SSD:  ~0.1ms      (100x context switch)
Maj. fault HDD:  ~10ms       (100x SSD)

Interview-friendly version:
- TLB miss and Syscalls are nanoseconds
- Interrupts, Context switches and Minor faults are microseconds
- Major faults are milliseconds.

Three orders of magnitude between cheapest and most expensive.

The KPTI tax:
- Before Meltdown (2018), a trivial syscall was ~50ns. KPTI added a page table switch on every kernel entry/exit, roughly doubling syscall cost.
- Spectre mitigations (retpolines, IBRS) added more.
- Modern syscalls are ~2-3x more expensive than pre-2018. This is why vDSO matters — gettimeofday() and clock_gettime() never enter the kernel at all, they read a shared page mapped into userspace.

Question 3

Q

TLB miss + Page table walk flow

Freq: 0 Cost: 10ns (10 cycles @1Ghz) + minor/major fault

Answer

A

TLB misses: millions to tens of millions/sec (10-100x more than Syscalls)

Multi-level look-up tree like the Filesystem walk to indirect blocks. Actually, before that look-up, there is PageWalkCache (PWC) to look at (before L1/2/3/RAM).

CPU executes instruction that accesses virtual address:

check TLB (fully associative lookup, ~1ns) → HIT: physical address returned, done
MISS: hardware page table walker activates
→ read CR3 (PGD physical address)
→ fetch PGD entry (memory access ~3-5ns, likely in L1/L2 cache)
→ fetch PUD entry (memory access)
→ fetch PMD entry (memory access)
→ fetch PTE entry (memory access)
→ PTE valid + permissions OK:
populate TLB with new entry
retry instruction (now hits)
total: ~10-50ns
IF PTE not present / perms violation:
→ PTE not present:
CPU raises Exception #PF → kernel → page fault flow
→ PTE permission violation:
CPU raises Exception #PF → kernel → SIGSEGV or CoW

Huge pages shortcuts the walk:
- 4KB pages: CR3 → PGD → PUD → PMD → PTE (4 levels)
- 2MB pages: CR3 → PGD → PUD → PMD (3 levels, PMD → 2MB frame)
- 1GB pages: CR3 → PGD → PUD (2 levels)

Very fast but it adds up. That’s why hugepages (2MB) matter for DBs.

Question 4

Q

Syscall (mode switch) flow

Frequency: 1 - Logical steps: 3, Cost: ~100ns (10x TLB miss)

Answer

A

Syscall (mode switch) flow (millions/s):

userspace calls glibc wrapper → SYSCALL instruction
CPU switches to ring 0, saves user registers to pt_regs, kernel entry trampoline (KPTI)
syscall table lookup → handler executes → return value in RAX 4. SYSRET instruction → back to ring 3 userspace.

Question 5

Q

Interrupt flow

Frequency: 2 - Logical steps: 3, Cost: ~1-5μs (10-50x syscall)

Answer

A

Interrupt flow (hundreds of thousands/s)

device asserts IRQ → CPU pauses current execution, saves state
IDT lookup → hardirq handler runs with interrupts disabled → ACK to interrupt controller (APIC) → raise softirq if deferred work needed → hardirq returns
softirq runs (ksoftirqd if too much) → deferred work completes → back to interrupted process unless resched.

Question 6

Q

Context switch flow

Frequency: 3 - Logical steps: 3, Cost: ~1-5μs (10-50x syscall)

Answer

A

Context switch flow (tens-k to hundreds of thousands/sec)

timer tick or voluntary sleep → CFS/EEVDF scheduler picks next task
context_switch() → save kernel registers of current → if threads, no address space switch → else switch address spaces: load new PGD into CR3 (TLB flush) → restore kernel registers of next
new task resumes in kernel (all switches happen during kernel mode: syscall, tick preemption, etc.) → eventually returns to resumed task userspace.

Question 7

Q

Page fault flow

Freq: 4 - Logical steps: 3, Cost: ~1-5μs (minor); ~100μs to 10ms (major)

Answer

A

Page fault flow (thousands to tens of thousands/sec)

access to unmapped/protected virtual address → CPU raises #PF exception → kernel reads CR2 (faulting address) → handle_mm_fault()
walk page tables → determine type
2.1 minor (allocate frame, zero-fill or CoW copy, update PTE) or
2.2 major (page not in RAM → block I/O to read from disk/swap)
iret back to faulting instruction which now succeeds.

Question 8

Q

Network packet flow

Freq: 5 - Logical steps: 4, Cost: ~5-15μs (NIC to buffer)

Answer

A

Network packet receive flow (thousands to millions/sec)

packet arrives at NIC → NIC DMAs into ring buffer in RAM → raises hardirq →
driver ACKs and schedules NAPI → softirq calls NAPI poll → driver pulls packets from ring buffer in batches (interrupts disabled, polling mode) → GRO aggregation (merges small pkts)
netfilter/iptables rules → IP routing decision → TCP state machine: sequence validation, ACK processing, congestion window update → data appended to socket receive buffer
wake process blocked in recv()/epoll.

Question 9

Q

Block I/O flow

Freq: 6 - L.Steps: 4, Cost: ~3-10μs (doorbell) + 50μs-15ms (device)

Answer

A

Block I/O flow (hundreds to tens of thousands/sec)

filesystem or page cache calls submit_bio() → bio enters block layer
merge with adjacent requests in scheduler queue (mq-deadline/BFQ/kyber) → cgroup blkio throttling applied → dispatch to device driver
build DMA scatter-gather list → ring device doorbell → device performs transfer → completion interrupt fires
block layer callback → wake waiting process or update page cache.

Question 10

Q

VFS path walk flow

Freq: 7 - L.Steps: 3, Cost: ~1-3μs (cached) + bio if not

Answer

A

VFS path walk flow (thousands/sec)

open(“/a/b/c”) → start at root dentry (or cwd)
For each level: dcache lookup for “a” → if cache miss: call filesystem’s ->lookup() which reads directory inode from disk → if symlink: follow up to 40 levels → Repeat
2.1 at each level: check permissions (DAC + LSM/SELinux)
final dentry → allocate struct file → assign fd → return fd to userspace.

Question 11

Q

Signal flow

Freq: 8 - Steps: 3, Cost: ~1-2μs (similar to ints; excl. handler)

Answer

A

Signal flow (tens to hundreds/sec)

signal generated (kill(), kernel event, hardware exception) → add to target’s pending signal set
if target is blocked/sleeping, wake it → on return to userspace, kernel checks pending signals → save current user context to signal frame on user stack → redirect execution to signal handler in userspace → handler runs
sigreturn() syscall restores original context → process resumes where it was interrupted.

Question 12

Q

Process lifecycle flow

How/when a new process starts to be executed?

Freq: 9 - Steps: 4, Ctx switch Cost vs thread: ~3-8μs / ~1-3μs

Answer

A

Process lifecycle flow (tens to hundreds/sec)

fork() → copy_process(): allocate new task_struct, copy/share file table, signal handlers, namespace refs → duplicate page tables with COW flags (no page copying yet) → assign PID → wake_up_new_task() adds to scheduler run queue
child runs → execve(): load ELF binary, flush old address space, map .text/.data/.bss, set up stack with argv/envp, set entry point
process runs → exit() → do_exit(): release resources (fds, mm, signals) → become zombie (TASK_ZOMBIE)
parent calls wait() → kernel reaps task_struct → gone.

Question 13

Q

Memory reclaim flow

kswapd BG, direct blocking -> called stalls

Freq: 10 - Steps: 3, Cost: ~10-100μs to ~1-10ms

Answer

A

Memory reclaim flow (ideally near zero)

free pages drop below low watermark (vm.min_free_kbytes and vm.watermark_scale_factor) → kswapd wakes and reclaims in background until/unless pages reach high watermark
if kswapd can’t keep up and free pages hit min watermark → direct reclaim starts (blocking, allocating process stalls) → scan LRU / MGLRU (6.1+) lists (active/inactive, file/anon) → balance based on vm.swappiness → evict clean file pages instantly → writeback dirty pages → compress cold pages (zswap) → swap out anonymous pages
if still failing: OOM killer selects and kills largest process → free its pages.

Direct reclaim (clean): ~10-100μs ← process stalls here
Direct reclaim (dirty): ~1-10ms ← disaster territory

Question 14

Q

Boot flow

Freq: 12 -

Answer

A

Boot flow (once)

firmware (UEFI/BIOS) POST → find boot device → load bootloader (GRUB)
GRUB starts default entry → loads kernel image + initramfs into RAM → jump to kernel entry
start_kernel(): initialize memory allocator, page tables, IDT, scheduler, timekeeping → decompress initramfs → mount it as temporary root → init (PID 1) in initramfs loads storage drivers, assembles root filesystem → pivot_root() to real root
exec real init (systemd) → systemd starts targets/services → reach login prompt.

Question 15

Q

Device hotplug flow

Just informational

Answer

A

Device hotplug flow (most likely zero)

physical event (USB plug, NVMe insert, network cable) or virtual event (PCI rescan, virtio attach) → hardware signals presence (USB hub interrupt, PCIe hot-plug interrupt, ACPI notification)
kernel bus driver detects new device → enumerate: read device/vendor IDs, capabilities, resource requirements → create struct device, populate sysfs entries under /sys/devices/ → call device model’s bus->match() to find matching driver → call driver’s ->probe(): allocate resources, request IRQs, initialize hardware, register subsystem-specific interfaces (block device, net device, input device) → kernel sends uevent via netlink → udevd receives uevent → evaluates udev rules → creates /dev/ node with correct permissions/ownership/symlinks
triggers any configured actions (mount, network-up, module load).

Removal is the reverse: hardware signals detach → driver’s ->remove() called → tear down interfaces → release resources → remove sysfs entries → udevd cleans up /dev/ nodes.

Extra: Linux Sys Internals Flows Flashcards

(15 cards)