Processes_and_Threads_ch24-37 Flashcards

(33 cards)

1
Q

What’s the difference between a process and a thread in Linux? How are threads implemented under the hood?

clone() vs fork()

(Ch 29-33)

A

A process is an instance of an executing program with its own virtual address space, file descriptors, and kernel data structures.
A thread is a unit of execution within a process that shares the address space, file descriptors, and signal handlers with other threads.

What is NOT shared: Stacks, registers, TLS (Thread Local Storage).

clone() vs fork(): Same Address Space - Under the hood, Linux implements threads using the clone() syscall with specific flags (CLONE_VM, CLONE_FILES, CLONE_FS, CLONE_SIGHAND).

Both processes and threads are represented by task_struct in the kernel - threads are essentially processes that share resources. The key difference: fork() creates new address space; clone() with thread flags shares it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s the difference between fork(), vfork(), and clone()? How are they related?

A

fork() and vfork() are glibc wrappers of glibc clone(). They all syscall clone3().

clone() - Fine-grained control over what’s shared (memory, file descriptors, signal handlers, etc.). Use for: creating threads (share everything) or containers (new namespaces). flags parameter controls sharing: CLONE_VM (memory), CLONE_FILES (fd table), CLONE_SIGHAND (signal handlers).

fork() - Creates child process with copy of parent’s address space (copy-on-write). Child gets own memory, file descriptors (duplicated), etc. Use for: general process creation.

vfork() - Thing of the past to call fork+exec() avoiding COW overhead - Today: for Memory-constrained embedded systems. Dangerous if child modifies memory. Creates child that shares parent’s address space until exec() or _exit(). Parent blocks until child exits/execs. Is it still relevant? Mostly no — COW is fast enough.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the key signals every systems programmer should know? Describe their default actions.

8, Catchable / Non-Catchable

(Ch 20, 22, 26)

A

Non-Catchable:
SIGKILL (9): Forceful termination.
SIGSTOP (19): Stop process.

Catchable:
SIGTERM (15): Termination request - allows cleanup.
SIGHUP (1): Hangup - daemon reload.
SIGTSTP (20): Terminal stop (Ctrl+Z).
SIGCONT (18): Resume stopped process.
SIGINT (2) - Ctrl+C
SIGCHLD (17): Child exited/stopped. Default: ignore.

SIGSEGV (11): Segmentation fault. Default: terminate + core.
SIGALRM (14): Timer expired.
SIGPIPE (13): Write to broken pipe. Default: terminate.
SIGQUIT (3): Quit with core dump (Ctrl+\).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a zombie process? How does one get created, and how is it cleaned up?

“buggy” parent

(Ch 2, 6, 24)

A

A zombie is a process that has terminated but whose parent hasn’t yet called wait() to collect its exit status.

Created when: Child calls exit() or is killed, parent does not call wait(), kernel retains minimal info (PID, exit status, resource usage) in process table. The process releases memory, file descriptors, etc., but entry remains.

If parent ignores SIGCHLD or sets SA_NOCLDWAIT, zombies auto-reaped by kernel.

If parent dies first, init (PID 1) adopts orphans and reaps them otherwise Zombies consume process table slots - but minimal resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the run queue in the Linux scheduler? How does scheduler latency work?

cache locality?

(Ch 35)

A

Run queue: per-CPU data structure holding runnable tasks. Each CPU has its own — improves scalability, cache locality.

EEVDF (Linux 6.6+): picks earliest deadline among eligible tasks providing better latency with no tuning. CFS: red-black tree ordered by vruntime requiring tuning.

Run queue length (nr_running): number of runnable tasks on that CPU. Related to but not the same as load average — load average also includes D-state (uninterruptible sleep) tasks.

Scheduler latency: the period over which all runnable tasks should run at least once. EEVDF: latency emerges from deadline calculation — short-burst tasks get earlier deadlines, no explicit latency tunable needed.

Load balancing: kernel periodically migrates tasks between CPU run queues. Considers migration cost (cache warmth), NUMA topology, and cgroup affinity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a daemon process? How do you create one?

(Ch 37)

A

Daemon: background process with no controlling terminal, usually long-lived.

Creation steps:
(1) fork();
(2) parent exits - child of parent init (pid = 1);
(3) Close all open fds;
(4) Become the session/process group leader - setsid() (no controlling terminal);
(5) Set the umask to 0 (or from config);
(6) chdir(‘/’) - don’t hold directory mount busy.
(7) Setup signal handlers.

Examples: httpd, sshd, cron.

Modern days: systemd manages daemon lifecycle with cgroups, signals, socket held for fast init, D-Bus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the default Linux scheduler? How does it work and which parameters it takes into account?

(Ch 35)

A

Linux CPU scheduler: EEVDF (6.6+, replaced CFS).

Goal: Given N processes, each gets 100%/N over time.
Key concepts:

(1) No fixed time slices: calculated dynamically based on number of runnable processes.
(2) Lag tracking: how far behind/ahead of fair share each task is (evolved from vruntime).
(3) Virtual deadline: each task gets a deadline for its current slice. Short-burst tasks get earlier deadlines → better latency without heuristics.
(4) Pick algorithm: among eligible tasks (lag ≥ 0), choose earliest deadline.
(5) Nice values: affect weight which affects lag accumulation and deadline calculation. Lower nice = more weight = longer slices.
(6) Sleeper fairness: emerges naturally from lag — sleeping task falls behind, becomes eligible immediately on wake.

Highest: SCHED_DEADLINE (dedicated runqueue)
Priority 0-99: RT runqueue (SCHED_FIFO, SCHED_RR)
Priority 100-139: EEVDF runqueue (SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE)

CFS (2.6.23-6.5) used vruntime + red-black tree. EEVDF replaced it with lag + deadline, eliminating CFS’s latency heuristic hacks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are other scheduler policies besides EEVDF (NORMAL/BATCH) ? When would you use them?

highest, 0-99, 100-139

(Ch 33, 35)

A

Use chrt command or sched_setscheduler() syscall to set per-task scheduling policy.

Runqueues (kernel internal priority, lower = higher priority):
Highest prio: SCHED_DEADLINE (separate runqueue)
Priority 0-99: RT runqueue (SCHED_FIFO, SCHED_RR)
Priority 100-139: EEVDF runqueue (SCHED_OTHER/SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE)

Details in Priority order (highest first):
SCHED_DEADLINE: Earliest deadline first. Specify runtime, deadline, period. Kernel guarantees CPU time. Highest priority in the system — preempts even RT tasks. For: hard real-time with deadline guarantees.

SCHED_FIFO: Real-time, first-in-first-out. Process runs until it blocks or higher priority arrives. For: deterministic latency. Risk: can starve everything below it.

SCHED_RR: Real-time, round-robin. Like FIFO but time-sliced among same-priority tasks.

SCHED_NORMAL: Default. Managed by EEVDF (6.6+, was CFS). Fair share scheduling.

SCHED_BATCH: Same class as OTHER but no preemption bonus. Kernel treats as non-interactive. For: CPU-intensive batch jobs.

SCHED_IDLE: Lowest priority, only runs when nothing else needs CPU. For: background tasks that shouldn’t impact system.

SCHED_OTHER vs SCHED_NORMAL: They’re the same thing; OTHER is POSIX name, NORMAL is Linux kernel name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why might your service stutter with 50% CPU free?

which type of quota

(Ch 5, 35)

A

CPU cgroups bandwidth throttling:
Cgroup has quota (allowed CPU time per period). If task uses quota before period ends, it’s throttled until next period — even if CPU is idle system-wide.

Why stutter with CPU free:
(1) Container has low quota — uses allowance in burst, then waits.
(2) System-wide CPU free but YOUR cgroup is throttled.
(3) Bursty workload exhausts quota early in period.
(4) Multi-threaded: quota shared across all threads in cgroup.

Diagnosis: cpu.stat → nr_throttled, throttled_time.

Solutions: increase quota, use proportional weight instead (v1: cpu.shares, v2: cpu.weight), spread work evenly.

Config: cgroup v1: cpu.cfs_quota_us + cpu.cfs_period_us. Cgroup v2: cpu.max (“quota period” in one file). “CFS bandwidth throttling” — the name stuck even with EEVDF. Kernel config is still CONFIG_CFS_BANDWIDTH.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the page cache? Cover dirty pages, writeback, vm.dirty_* tunables, and how to manage it with drop_caches.

A

Page cache is kernel’s in-memory cache for file data.

Read: check cache first → cache hit returns immediately; cache miss reads from disk and caches. Write: data written to cache (page marked “dirty”), actual disk write deferred.

Dirty pages: modified cache pages not yet written to disk. Writeback: kernel writeback threads (kworker) write dirty pages to disk. sync() system-wide, fsync() per-file, fdatasync() data + necessary metadata only.

Tunables:
- vm.dirty_background_ratio (default 10%) — % of available memory; when dirty pages exceed this, background writeback starts. Non-blocking.
- vm.dirty_ratio (default 20%) — % of available memory; when dirty pages exceed this, writing processes block and do synchronous writeback. You feel this as latency.
- vm.dirty_expire_centisecs (default 3000 = 30s) — max age before page is eligible for writeback.
- vm.dirty_writeback_centisecs (default 500 = 5s) — how often writeback thread wakes to check.

drop_caches: echo 1 > /proc/sys/vm/drop_caches (page cache), echo 2 (dentries/inodes), echo 3 (both).

Non-destructive: only drops clean pages. Dirty pages must be flushed first (sync).

Useful for benchmarking (cold cache) and freeing memory in emergencies. Not for routine use — kernel manages cache efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens at the kernel level when a process receives SIGTERM vs SIGKILL?

TERM: what if no handler?

(Ch 20)

A

SIGTERM (15): Kernel marks signal pending. When process returns to userspace, kernel checks for handlers. Process can catch, ignore, or block SIGTERM.
If handler installed → handler runs, process can clean up and exit gracefully (or ignore).
If default (no handler) → process terminated by kernel.

SIGKILL (9): Kernel marks signal pending. Kernel IMMEDIATELY starts termination - no handler check. Process state set to TASK_DEAD. All threads terminated. Memory freed, file descriptors closed. Exit status reflects killed-by-signal. Cannot be caught, ignored, or blocked - enforced unconditionally by kernel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain virtual memory in Linux: page tables, MMU, TLB – what exactly occurs on a page fault (minor vs. major)?

Huge pages?

(Ch 6, 48)

A

Page tables: Tries (radix-tree) hierarchical structure mapping virtual→physical addresses.

4-5 level tables as modern filesystems with indirect pages. Each entry has frame number + flags (present, writable, user-accessible, dirty, accessed). 5-level (57-bit, 128 PB) available since kernel 4.14.

MMU: per-core hardware that walks page tables on every memory access. Includes TLB: per-core cache of recent virtual→physical translations in MMU. On hit (1-2 cycles) no page table walk needed.

Page fault occurs when: page not present (to be loaded), permission violation (read-only, NX bit, userspace accessing not permitted ranges).

Minor fault: 1us - page is in memory but not mapped (e.g., new stack page, COW copy) → (Copy memory), update page table, No disk I/O.
Major fault: 10ms: page must be read from disk (swap or file).

Minor fault examples:
- Lazy allocation (first touch): Allocate physical frame, zero it, update page table
- COW (after fork): Copy page, map private copy, update page table
- Page in page cache but not mapped: Just update page table entry (e.g., mmap’d file already cached)

The kernel updates the page table entry (PTE), then flushes the TLB entry for that address. The instruction is then re-executed.

RSP (stack ptr) moves to next virt stack page with no phys page: new pg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does memory reclaim work in Linux? What triggers reclamation?

(add)

A

Memory reclaim frees pages when memory is low.

Triggers:
(1) Direct reclaim: allocation fails, calling process must free pages.
(2) kswapd: background daemon wakes when free memory < low watermark and targets cache pages.
(3) OOM killer: last resort when reclaim fails.

Reclaim process:
(1) Scan LRU lists (active/inactive for anon and file pages).
(2) File-backed clean pages: drop immediately (can re-read).
(3) File-backed dirty pages: write back, then drop.
(4) Anonymous pages: write to swap, then free.
(5) Slab cache shrinking.

Tunables:
- vm.swappiness (0-100) - bias toward swapping anon vs dropping file cache.
- vm.swappiness (101-200) - bias towards swapping mmap mem instead of file-backed for cgroups/containers.
- vm.vfs_cache_pressure - slab reclaim aggressiveness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain Linux job control: SIGTSTP/SIGSTOP/SIGCONT, process groups, sessions, controlling terminal, and orphaned process groups.

A

SIGTSTP (20): Terminal stop signal - sent by Ctrl+Z. Can be caught/ignored. Default: stop process. Allows cleanup before stopping.

SIGSTOP (19): Unconditional stop - cannot be caught/ignored/blocked. Used programmatically.

SIGCONT (18): Resume stopped process.

Job control:
(1) Shell puts background jobs in separate process groups.
(2) Ctrl+Z sends SIGTSTP to foreground process group.
(3) Process stops, shell regains control.
(4) ‘fg’ command: shell sends SIGCONT, moves job to foreground (tcsetpgrp).
(5) ‘bg’ command: shell sends SIGCONT, job runs in background. Session leader (shell) manages foreground process group.

Process groups: Set of related processes (e.g., pipeline). Each has a PGID. Sessions: Set of process groups. Created by setsid(). Session leader = controlling process. Controlling terminal: /dev/tty. Ctrl-C sends SIGINT to foreground process group. Terminal hangup sends SIGHUP to session leader, which conventionally forwards to all children.

Orphaned process groups: A process group becomes orphaned when its session leader (or the last process with a parent outside the group) exits. Kernel sends SIGHUP then SIGCONT to stopped members of orphaned groups — SIGHUP to notify, SIGCONT to wake them so they can handle it. Without this, stopped orphans would be stuck forever.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is eventfd? When would you use it?

How much data for the counter?

(Ch 24, 27, 33)

A

eventfd creates fd for event notification between threads/processes.

eventfd vs pipe/FIFO:
- 8-byte counter — that’s it (no data).
- Single fd (both read and write).
- Used for: lightweight notification/wakeup between threads.

Use cases:
- (1) Thread synchronization: one thread signals another via fd, integrates with epoll.
- (2) Parent-child notification: share fd across fork().
- (3) Event loop integration: wake up epoll from another thread.
- (4) User-space semaphore: with EFD_SEMAPHORE flag.

Flags: EFD_NONBLOCK, EFD_CLOEXEC, EFD_SEMAPHORE. Lighter weight than pipe() for simple notification. Combined with epoll, enables unified event-driven architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the shebang (#!) line and how does the kernel handle it?

(Ch 2, 27)

A

Shebang is the #! at the start of a script file, specifying the interpreter.

The script file itself must have execute permission.

When execve() encounters a file starting with #!, the kernel:
(1) Parses the interpreter path (e.g., #!/bin/bash or #!/usr/bin/env python).
(2) Executes the interpreter with the script path as argument.
(3) Optional arguments after interpreter path are passed too.

Example: #!/usr/bin/awk -f.

Maximum shebang line length is typically 127-255 bytes. Using /usr/bin/env allows finding interpreter via PATH.

17
Q

What are mutexes? How do pthread mutexes work?

mutual exclusion

(Ch 3, 30)

A

Mutex (mutual exclusion): lock ensuring only one thread accesses critical section at a time.

Implementation: typically uses futex syscall - fast path in userspace, syscall only on contention.

Rules: only owner should unlock, don’t lock twice (deadlock), always unlock.

Operations:
pthread_mutex_init() or PTHREAD_MUTEX_INITIALIZER. pthread_mutex_lock() - acquire (blocks if held). pthread_mutex_trylock() - non-blocking attempt. pthread_mutex_unlock() - release. pthread_mutex_destroy() - cleanup.

Mutex types:
PTHREAD_MUTEX_NORMAL (default),
PTHREAD_MUTEX_ERRORCHECK (detects double-lock),
PTHREAD_MUTEX_RECURSIVE (allows re-locking by owner).

18
Q

What is PSI (Pressure Stall Information)? How do you use it to detect resource pressure?

Linux 4.20+; “stall time”

(add)

A

PSI provides real-time metrics showing how long processes wait for resources.

PSI — measures stall time (how long tasks are waiting), not what caused the wait.

Categories: cpu, memory, io. Levels: ‘some’ = at least one task stalled; ‘full’ = all tasks stalled.

Files: /proc/pressure/cpu, /proc/pressure/memory, /proc/pressure/io.

Format: ‘avg10=X avg60=Y avg300=Z total=T’ - percentages over 10s, 60s, 5min windows.

Use cases:
(1) Autoscaling triggers.
(2) Identify I/O bottlenecks.
(3) systemd-oomd uses PSI to anticipate OOM, kicking-in on memory/swap pressure.

Example: memory “some avg10=25.00”
means tasks waited 25% of last 10 seconds.

Can set up poll() triggers for threshold alerts.

19
Q

What is a deadlock? How do you prevent it?

(Ch 2, 30)

A

Deadlock: two or more threads blocked forever, each waiting for resource held by another.

Four conditions (all required):
(1) Mutual exclusion - resource held exclusively.
(2) Hold and wait - hold one resource while waiting for another.
(3) No preemption - can’t force release.
(4) Circular wait - A waits for B, B waits for A.

Detection: tools like helgrind (Valgrind), lockdep (kernel). Best practice: minimize lock scope, avoid holding locks across calls to unknown code.

Prevention strategies:
(1) Lock ordering - always acquire locks in same global order.
(2) Lock timeout - use pthread_mutex_timedlock(), back off on failure.
(3) Try-lock - pthread_mutex_trylock(), release all and retry if fails.
(4) Avoid nested locks when possible.

20
Q

Walk through what happens when you type ‘ls’ in a shell – from Enter to output appearing.

opendir() -> getdents()

(Ch 2, 3, 4)

A

1) Shell reads input, parses ‘ls’, searches PATH for executable.
2) fork(): shell creates child process (copy-on-write address space).
3) In child: execve(‘/bin/ls’, [‘ls’], envp) - replaces process image with ls binary.
4) Kernel loads ELF binary, sets up memory segments, dynamic linker loads libc.
5) ls runs: opendir() → getdents() syscall reads directory entries.
6) ls sorts entries, formats output.
7) write() syscalls send output to stdout (fd 1).
8) Terminal driver receives data, displays on screen.
9) ls calls exit(0).
10) Parent shell’s wait() returns, shell displays next prompt.

21
Q

Compare SSD vs HDD from an operating system perspective. How does Linux I/O handle them differently?

(add)

A

Physical differences: HDD has spinning platters + seek time; SSD has flash memory, no seek penalty.

Random access: HDD slow (seek + rotational latency ~10ms); SSD fast (~0.1ms). Sequential: both fast, HDD competitive.

Linux differences:
(1) Read-ahead: less beneficial for SSD.
(2) Swappiness: SSDs handle random access well, can use swap more.
(3) I/O schedulers: HDD uses mq-deadline (batches seeks); SSD uses none/mq-deadline (no seek optimization needed).
(4) TRIM/discard: SSDs needs garbage collection to reclaim partially deleted blocks (erase unit: 4MB), TRIM marks blocks unused for GC and wear leveling. Mount option ‘discard’ or periodic fstrim.
(5) Wear leveling: SSD concern, but managed by firmware. Check scheduler: cat /sys/block/sda/queue/scheduler.

22
Q

What is syslog? How do daemons log messages? To where?

(Ch 37)

A

syslog is the standard logging system for Unix/Linux (they don’t have a terminal).

Daemons write via Unix domain socket (AF_UNIX) /dev/log. syslogd/rsyslogd daemon receive with kernel assisted mem-to-mem copy and routes log messages.

Configuration: /etc/syslog.conf or /etc/rsyslog.conf routes messages by facility.priority to files, remote servers, or console. Log files: /var/log/messages, /var/log/syslog, /var/log/auth.log. Modern alternative: systemd-journald with journalctl.

Daemons typically: open syslog at startup, log events during operation, avoid writing to stdout/stderr (no terminal).

API:
openlog(ident, option, facility) - initialize.
syslog(priority, format, …) - send message.
closelog() - cleanup.
Facilities: LOG_DAEMON, LOG_AUTH, LOG_KERN, LOG_USER, LOG_LOCAL0-7. Priorities: LOG_EMERG, LOG_ALERT, LOG_CRIT, LOG_ERR, LOG_WARNING, LOG_NOTICE, LOG_INFO, LOG_DEBUG.

23
Q

How does a shell implement a pipeline like ‘ls | wc’? Also compare popen() vs pipe()/fork()/exec().

A

Shell pipeline ‘ls | wc’ implementation:
(1) Shell calls pipe(pipefd) - creates pipe with pipefd[0] (read) and pipefd[1] (write).
(2) Shell forks first child (for ls): Child does dup2(pipefd[1], STDOUT_FILENO) to redirect stdout to pipe write end, closes both pipefd ends, execve(‘/bin/ls’).
(3) Shell forks second child (for wc): Child does dup2(pipefd[0], STDIN_FILENO) to redirect stdin to pipe read end, closes both pipefd ends, execve(‘/usr/bin/wc’).
(4) Parent shell closes both pipe ends (important! otherwise wc never sees EOF).
(5) Parent calls wait() for both children.

Key insight: dup2() duplicates fd, allowing execve’d program to use standard stdin/stdout unaware of redirection.

popen(): Convenience wrapper — opens a pipe to/from a shell command, returns FILE*. Internally does pipe()+fork()+exec(). Simpler but less control: can only read OR write (not both), runs via /bin/sh (extra process), no access to child PID until pclose(). Use pipe/fork/exec directly when you need bidirectional I/O, error handling, or control over the child.

24
Q

What is SIGCHLD? Why is it important and how is it typically handled ?

(Ch 26)

A

SIGCHLD (17): Sent to parent when child terminates, stops, or resumes. Default: ignore.

Importance: Notifies parent to call wait() and collect exit status (prevents zombies).

Handling patterns:
(1) Install handler that calls waitpid(-1, &status, WNOHANG) in loop - reap all terminated children.
(2) Use SA_NOCLDWAIT flag - kernel auto-reaps, no zombies, but lose exit status.
(3) Ignore signal + wait() periodically.

25
What is CPU affinity and what are its benefits? How do you set it? ## Footnote (Ch 2, 6, 35)
CPU affinity restricts which CPUs a process/thread can run on. Benefits: Reproducible performance, cache warmth (data stays in one CPU cache), NUMA locality (memory close to CPU), isolation (dedicate CPUs to critical tasks). Command line: taskset -c 0,2 ./program or taskset -p 0x5 {pid}. Threads: each thread can have own affinity via pthread_setaffinity_np(). Inherited: child inherits parent affinity. Caveats: don't over-constrain (fewer CPUs than threads = contention). Kernel may override for load balancing unless using isolcpus boot parameter. API: cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(0, &mask); CPU_SET(2, &mask); sched_setaffinity(pid, sizeof(mask), &mask); sets process to run only on CPUs 0 and 2. pid=0 means current process. sched_getaffinity() retrieves current mask.
26
What is the per-process kernel stack? Why does each process/thread have one and why is it small? Does it also handle interrupts? ## Footnote (Ch 2, 3, 6)
Every process/thread has its own kernel stack (typically 8KB or 16KB on x86-64, configured via THREAD_SIZE). Used when process enters kernel mode via syscall or interrupt — kernel code executes on this stack, not the userspace stack. Why separate: (1) Userspace stack is untrusted — process could set RSP to invalid address. (2) Kernel needs guaranteed valid stack immediately on mode switch. (3) Isolation — kernel data on stack not visible to userspace. Why per-process: kernel code running on behalf of process A (e.g., in a syscall) can sleep/block — when process B runs, it needs its own kernel stack to enter kernel mode independently. Why small: physical memory cost multiplied by every task in system. Constraints: no large local variables in kernel code, no deep recursion, no VLAs. Stack overflow in kernel → panic or silent memory corruption (guard pages added in Linux 4.9 via VMAP_STACK to detect this). The interrupt stack is separate (per-CPU, not per-process) to avoid deepening a process's kernel stack during nested interrupts. Viewable: /proc/PID/stack shows current kernel stack trace.
27
What is copy-on-write (COW)? How does fork() use it? ## Footnote (Ch 24)
Copy-on-write is a lazy optimization with makes fork() O(1): instead of copying memory immediately, share it read-only until modification. fork() with COW: (1) Child process created with same page table entries as parent. (2) All writable pages marked read-only in both parent and child. (3) Either process tries to write → page fault. (4) Kernel allocates new page, copies content, updates page table. (5) Only modified pages are actually copied. Benefits: fork()+exec() pattern very fast (child replaces memory anyway). Memory efficient for processes that share code/read-only data. Without COW, fork() would be O(memory size). Page fault → PTEs (page tbl entries), not VMA → allocate page frame, copy, upd PTE
28
What happens during an execve() call? How are open file descriptors affected? ## Footnote (Ch 24, 27)
execve() replaces current process image with new program: (1) Old memory segments (text, data, heap, stack) destroyed. (2) New program loaded from executable file. (3) New text, data, bss segments created. (4) Stack initialized with argc, argv, envp. (5) Execution starts at program entry point. File descriptors: By default, open FDs are inherited across exec. FDs with close-on-exec flag (FD_CLOEXEC/O_CLOEXEC) are automatically closed. Memory mappings are unmapped. Signal handlers reset to defaults (SIG_DFL). Pending signals preserved (new program receives them). PID unchanged.
29
What are race conditions after fork()? Can you assume parent or child runs first? Which implications does it have? ## Footnote (Ch 24)
After fork(), execution order between parent and child is undefined. Either may run first, or both simultaneously on multiprocessor. Implications: shared resources need synchronization. Historical variation: Linux 2.4 ran parent first. Linux 2.6-2.6.31 ran child first. Linux 2.6.32+ runs parent first (tunable via /proc/sys/kernel/sched_child_runs_first). Other Unixes vary. Example bug: - parent sets up data, expects child to see it, but child runs first. Solutions: use pipe/socket for synchronization (child writes, parent reads after fork). Use semaphores or mutexes if using shared memory. Use signals (parent waits for child signal). Never assume ordering - always synchronize. Classic example: parent installs signal handler, forks, sends signal to child - but child might exec() before handler installed. ## Footnote Fun note: Two generals on same machine. Network partitions are not the cause of distributed systems complexity, wrong synchronisation logic is ;-)
30
What are resource limits (rlimits)? List the important ones. | _AS is? ## Footnote (Ch 36)
Resource limits cap what processes can consume. Get/set via ulimit OR getrlimit()/setrlimit(). Each has soft (current) and hard (maximum) limit. Soft can be raised to hard; hard can only be lowered (non-root). Inherit: fork preserves limits. exec preserves unless setuid. Can set: - RLIMIT_CPU (max CPU seconds — sends SIGXCPU then SIGKILL) - RLIMIT_AS (max address space / virtual memory) - RLIMIT_NOFILE (max open fds) - RLIMIT_NPROC (max processes) - RLIMIT_CORE (max core dump size) getrusage() — measures what a process has consumed. Read-only, reporting. Reports: - CPU time (user + system) - Max RSS (peak memory) - Page faults (minor/major) - Context switches (voluntary/involuntary) - Block I/O operations
31
How does pthread_mutex handle contention on Linux? Does it spin or sleep? What role does futex play? ## Footnote (Ch 30)
pthread_mutex (default type) is implemented on top of futex (fast userspace mutex). Uncontended path: atomic compare-and-swap (CAS) in userspace only — no kernel involvement, very fast. Contended path: calls futex(FUTEX_WAIT) → thread sleeps in the kernel (does NOT spin by default). When the lock holder releases, it calls futex(FUTEX_WAKE) to wake one sleeping waiter. Key distinction: - Spinlock: thread loops (burns CPU) waiting for the lock. Useful when hold times are very short and context switch cost would exceed spin time. - Futex-based mutex: thread sleeps on contention, avoiding CPU waste but paying the cost of context switches (register save/restore, cache/TLB pollution) on wake. Performance problem under high contention: many threads cycle through wake → schedule → try lock → fail → sleep, creating a storm of context switches. CPU goes up, throughput goes down — the CPU is doing scheduling work, not useful application work. Additionally, the cache line holding the futex word bounces between cores via MESI protocol (cache line ping-pong), adding 40-100+ ns per transfer.
32
What should an SRE know about CPU registers and caches? | Which perf tool can diagnose? ## Footnote (add)
SRE relevance: - context switches save/restore registers — cost grows with AVX state size; - Cache thrashing, NUMA locality can spike latency. Mem/Caches (ns times @4Ghz): - Registers: 1 cycle/core - L1 (4 cycles, ~1ns, 32-64KB per-core); - L2 (12 cycles, ~3-5ns, 256KB-1MB per-core); - L3 (40 cycles, ~10-15ns, 10-50MB shared). - RAM: 100-200 cycles. Cache line = 64 bytes, smallest fetch unit — data locality matters. Key production problems (order of probability): (1) TLB misses — large working sets with random access exhaust TLB entries, huge pages (2MB/1GB) reduce pressure. (2) Cache thrashing — working set exceeds cache size, nonlinear performance cliff. (3) NUMA — remote memory access ~2x latency, check numastat/numactl, pin processes to local node. (4) False sharing — two threads modify different variables on same cache line, MESI protocol bounces it between cores, more threads = slower. Fix: cacheline-aligned padding. Diagnose: perf stat -e cache-misses,LLC-load-misses,dTLB-load-misses,node-load-misses.
33
fork(): what is NOT inherited? ## Footnote (Ch 28.5, 35)
- pending Signals (think of SIGTERM) - File locks - fcntl() - Memory locks - mlock() - timers - alarm() and setitimer() - Threads: but mutex() is inherited - if locked, it remains stuck - mmaps with madvise() DONTFORK - Pending AIO contexts - Priviledged sched policies