What’s the difference between a process and a thread in Linux? How are threads implemented under the hood?
clone() vs fork()
(Ch 29-33)
A process is an instance of an executing program with its own virtual address space, file descriptors, and kernel data structures.
A thread is a unit of execution within a process that shares the address space, file descriptors, and signal handlers with other threads.
What is NOT shared: Stacks, registers, TLS (Thread Local Storage).
clone() vs fork(): Same Address Space - Under the hood, Linux implements threads using the clone() syscall with specific flags (CLONE_VM, CLONE_FILES, CLONE_FS, CLONE_SIGHAND).
Both processes and threads are represented by task_struct in the kernel - threads are essentially processes that share resources. The key difference: fork() creates new address space; clone() with thread flags shares it.
What’s the difference between fork(), vfork(), and clone()? How are they related?
fork() and vfork() are glibc wrappers of glibc clone(). They all syscall clone3().
clone() - Fine-grained control over what’s shared (memory, file descriptors, signal handlers, etc.). Use for: creating threads (share everything) or containers (new namespaces). flags parameter controls sharing: CLONE_VM (memory), CLONE_FILES (fd table), CLONE_SIGHAND (signal handlers).
fork() - Creates child process with copy of parent’s address space (copy-on-write). Child gets own memory, file descriptors (duplicated), etc. Use for: general process creation.
vfork() - Thing of the past to call fork+exec() avoiding COW overhead - Today: for Memory-constrained embedded systems. Dangerous if child modifies memory. Creates child that shares parent’s address space until exec() or _exit(). Parent blocks until child exits/execs. Is it still relevant? Mostly no — COW is fast enough.
What are the key signals every systems programmer should know? Describe their default actions.
8, Catchable / Non-Catchable
(Ch 20, 22, 26)
Non-Catchable:
SIGKILL (9): Forceful termination.
SIGSTOP (19): Stop process.
Catchable:
SIGTERM (15): Termination request - allows cleanup.
SIGHUP (1): Hangup - daemon reload.
SIGTSTP (20): Terminal stop (Ctrl+Z).
SIGCONT (18): Resume stopped process.
SIGINT (2) - Ctrl+C
SIGCHLD (17): Child exited/stopped. Default: ignore.
SIGSEGV (11): Segmentation fault. Default: terminate + core.
SIGALRM (14): Timer expired.
SIGPIPE (13): Write to broken pipe. Default: terminate.
SIGQUIT (3): Quit with core dump (Ctrl+\).
What is a zombie process? How does one get created, and how is it cleaned up?
“buggy” parent
(Ch 2, 6, 24)
A zombie is a process that has terminated but whose parent hasn’t yet called wait() to collect its exit status.
Created when: Child calls exit() or is killed, parent does not call wait(), kernel retains minimal info (PID, exit status, resource usage) in process table. The process releases memory, file descriptors, etc., but entry remains.
If parent ignores SIGCHLD or sets SA_NOCLDWAIT, zombies auto-reaped by kernel.
If parent dies first, init (PID 1) adopts orphans and reaps them otherwise Zombies consume process table slots - but minimal resources.
What is the run queue in the Linux scheduler? How does scheduler latency work?
cache locality?
(Ch 35)
Run queue: per-CPU data structure holding runnable tasks. Each CPU has its own — improves scalability, cache locality.
EEVDF (Linux 6.6+): picks earliest deadline among eligible tasks providing better latency with no tuning. CFS: red-black tree ordered by vruntime requiring tuning.
Run queue length (nr_running): number of runnable tasks on that CPU. Related to but not the same as load average — load average also includes D-state (uninterruptible sleep) tasks.
Scheduler latency: the period over which all runnable tasks should run at least once. EEVDF: latency emerges from deadline calculation — short-burst tasks get earlier deadlines, no explicit latency tunable needed.
Load balancing: kernel periodically migrates tasks between CPU run queues. Considers migration cost (cache warmth), NUMA topology, and cgroup affinity.
What is a daemon process? How do you create one?
(Ch 37)
Daemon: background process with no controlling terminal, usually long-lived.
Creation steps:
(1) fork();
(2) parent exits - child of parent init (pid = 1);
(3) Close all open fds;
(4) Become the session/process group leader - setsid() (no controlling terminal);
(5) Set the umask to 0 (or from config);
(6) chdir(‘/’) - don’t hold directory mount busy.
(7) Setup signal handlers.
Examples: httpd, sshd, cron.
Modern days: systemd manages daemon lifecycle with cgroups, signals, socket held for fast init, D-Bus.
What is the default Linux scheduler? How does it work and which parameters it takes into account?
(Ch 35)
Linux CPU scheduler: EEVDF (6.6+, replaced CFS).
Goal: Given N processes, each gets 100%/N over time.
Key concepts:
(1) No fixed time slices: calculated dynamically based on number of runnable processes.
(2) Lag tracking: how far behind/ahead of fair share each task is (evolved from vruntime).
(3) Virtual deadline: each task gets a deadline for its current slice. Short-burst tasks get earlier deadlines → better latency without heuristics.
(4) Pick algorithm: among eligible tasks (lag ≥ 0), choose earliest deadline.
(5) Nice values: affect weight which affects lag accumulation and deadline calculation. Lower nice = more weight = longer slices.
(6) Sleeper fairness: emerges naturally from lag — sleeping task falls behind, becomes eligible immediately on wake.
Highest: SCHED_DEADLINE (dedicated runqueue)
Priority 0-99: RT runqueue (SCHED_FIFO, SCHED_RR)
Priority 100-139: EEVDF runqueue (SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE)
CFS (2.6.23-6.5) used vruntime + red-black tree. EEVDF replaced it with lag + deadline, eliminating CFS’s latency heuristic hacks.
What are other scheduler policies besides EEVDF (NORMAL/BATCH) ? When would you use them?
highest, 0-99, 100-139
(Ch 33, 35)
Use chrt command or sched_setscheduler() syscall to set per-task scheduling policy.
Runqueues (kernel internal priority, lower = higher priority):
Highest prio: SCHED_DEADLINE (separate runqueue)
Priority 0-99: RT runqueue (SCHED_FIFO, SCHED_RR)
Priority 100-139: EEVDF runqueue (SCHED_OTHER/SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE)
Details in Priority order (highest first):
SCHED_DEADLINE: Earliest deadline first. Specify runtime, deadline, period. Kernel guarantees CPU time. Highest priority in the system — preempts even RT tasks. For: hard real-time with deadline guarantees.
SCHED_FIFO: Real-time, first-in-first-out. Process runs until it blocks or higher priority arrives. For: deterministic latency. Risk: can starve everything below it.
SCHED_RR: Real-time, round-robin. Like FIFO but time-sliced among same-priority tasks.
SCHED_NORMAL: Default. Managed by EEVDF (6.6+, was CFS). Fair share scheduling.
SCHED_BATCH: Same class as OTHER but no preemption bonus. Kernel treats as non-interactive. For: CPU-intensive batch jobs.
SCHED_IDLE: Lowest priority, only runs when nothing else needs CPU. For: background tasks that shouldn’t impact system.
SCHED_OTHER vs SCHED_NORMAL: They’re the same thing; OTHER is POSIX name, NORMAL is Linux kernel name.
Why might your service stutter with 50% CPU free?
which type of quota
(Ch 5, 35)
CPU cgroups bandwidth throttling:
Cgroup has quota (allowed CPU time per period). If task uses quota before period ends, it’s throttled until next period — even if CPU is idle system-wide.
Why stutter with CPU free:
(1) Container has low quota — uses allowance in burst, then waits.
(2) System-wide CPU free but YOUR cgroup is throttled.
(3) Bursty workload exhausts quota early in period.
(4) Multi-threaded: quota shared across all threads in cgroup.
Diagnosis: cpu.stat → nr_throttled, throttled_time.
Solutions: increase quota, use proportional weight instead (v1: cpu.shares, v2: cpu.weight), spread work evenly.
Config: cgroup v1: cpu.cfs_quota_us + cpu.cfs_period_us. Cgroup v2: cpu.max (“quota period” in one file). “CFS bandwidth throttling” — the name stuck even with EEVDF. Kernel config is still CONFIG_CFS_BANDWIDTH.
What is the page cache? Cover dirty pages, writeback, vm.dirty_* tunables, and how to manage it with drop_caches.
Page cache is kernel’s in-memory cache for file data.
Read: check cache first → cache hit returns immediately; cache miss reads from disk and caches. Write: data written to cache (page marked “dirty”), actual disk write deferred.
Dirty pages: modified cache pages not yet written to disk. Writeback: kernel writeback threads (kworker) write dirty pages to disk. sync() system-wide, fsync() per-file, fdatasync() data + necessary metadata only.
Tunables:
- vm.dirty_background_ratio (default 10%) — % of available memory; when dirty pages exceed this, background writeback starts. Non-blocking.
- vm.dirty_ratio (default 20%) — % of available memory; when dirty pages exceed this, writing processes block and do synchronous writeback. You feel this as latency.
- vm.dirty_expire_centisecs (default 3000 = 30s) — max age before page is eligible for writeback.
- vm.dirty_writeback_centisecs (default 500 = 5s) — how often writeback thread wakes to check.
drop_caches: echo 1 > /proc/sys/vm/drop_caches (page cache), echo 2 (dentries/inodes), echo 3 (both).
Non-destructive: only drops clean pages. Dirty pages must be flushed first (sync).
Useful for benchmarking (cold cache) and freeing memory in emergencies. Not for routine use — kernel manages cache efficiently.
What happens at the kernel level when a process receives SIGTERM vs SIGKILL?
TERM: what if no handler?
(Ch 20)
SIGTERM (15): Kernel marks signal pending. When process returns to userspace, kernel checks for handlers. Process can catch, ignore, or block SIGTERM.
If handler installed → handler runs, process can clean up and exit gracefully (or ignore).
If default (no handler) → process terminated by kernel.
SIGKILL (9): Kernel marks signal pending. Kernel IMMEDIATELY starts termination - no handler check. Process state set to TASK_DEAD. All threads terminated. Memory freed, file descriptors closed. Exit status reflects killed-by-signal. Cannot be caught, ignored, or blocked - enforced unconditionally by kernel.
Explain virtual memory in Linux: page tables, MMU, TLB – what exactly occurs on a page fault (minor vs. major)?
Huge pages?
(Ch 6, 48)
Page tables: Tries (radix-tree) hierarchical structure mapping virtual→physical addresses.
4-5 level tables as modern filesystems with indirect pages. Each entry has frame number + flags (present, writable, user-accessible, dirty, accessed). 5-level (57-bit, 128 PB) available since kernel 4.14.
MMU: per-core hardware that walks page tables on every memory access. Includes TLB: per-core cache of recent virtual→physical translations in MMU. On hit (1-2 cycles) no page table walk needed.
Page fault occurs when: page not present (to be loaded), permission violation (read-only, NX bit, userspace accessing not permitted ranges).
Minor fault: 1us - page is in memory but not mapped (e.g., new stack page, COW copy) → (Copy memory), update page table, No disk I/O.
Major fault: 10ms: page must be read from disk (swap or file).
Minor fault examples:
- Lazy allocation (first touch): Allocate physical frame, zero it, update page table
- COW (after fork): Copy page, map private copy, update page table
- Page in page cache but not mapped: Just update page table entry (e.g., mmap’d file already cached)
The kernel updates the page table entry (PTE), then flushes the TLB entry for that address. The instruction is then re-executed.
RSP (stack ptr) moves to next virt stack page with no phys page: new pg
How does memory reclaim work in Linux? What triggers reclamation?
(add)
Memory reclaim frees pages when memory is low.
Triggers:
(1) Direct reclaim: allocation fails, calling process must free pages.
(2) kswapd: background daemon wakes when free memory < low watermark and targets cache pages.
(3) OOM killer: last resort when reclaim fails.
Reclaim process:
(1) Scan LRU lists (active/inactive for anon and file pages).
(2) File-backed clean pages: drop immediately (can re-read).
(3) File-backed dirty pages: write back, then drop.
(4) Anonymous pages: write to swap, then free.
(5) Slab cache shrinking.
Tunables:
- vm.swappiness (0-100) - bias toward swapping anon vs dropping file cache.
- vm.swappiness (101-200) - bias towards swapping mmap mem instead of file-backed for cgroups/containers.
- vm.vfs_cache_pressure - slab reclaim aggressiveness.
Explain Linux job control: SIGTSTP/SIGSTOP/SIGCONT, process groups, sessions, controlling terminal, and orphaned process groups.
SIGTSTP (20): Terminal stop signal - sent by Ctrl+Z. Can be caught/ignored. Default: stop process. Allows cleanup before stopping.
SIGSTOP (19): Unconditional stop - cannot be caught/ignored/blocked. Used programmatically.
SIGCONT (18): Resume stopped process.
Job control:
(1) Shell puts background jobs in separate process groups.
(2) Ctrl+Z sends SIGTSTP to foreground process group.
(3) Process stops, shell regains control.
(4) ‘fg’ command: shell sends SIGCONT, moves job to foreground (tcsetpgrp).
(5) ‘bg’ command: shell sends SIGCONT, job runs in background. Session leader (shell) manages foreground process group.
Process groups: Set of related processes (e.g., pipeline). Each has a PGID. Sessions: Set of process groups. Created by setsid(). Session leader = controlling process. Controlling terminal: /dev/tty. Ctrl-C sends SIGINT to foreground process group. Terminal hangup sends SIGHUP to session leader, which conventionally forwards to all children.
Orphaned process groups: A process group becomes orphaned when its session leader (or the last process with a parent outside the group) exits. Kernel sends SIGHUP then SIGCONT to stopped members of orphaned groups — SIGHUP to notify, SIGCONT to wake them so they can handle it. Without this, stopped orphans would be stuck forever.
What is eventfd? When would you use it?
How much data for the counter?
(Ch 24, 27, 33)
eventfd creates fd for event notification between threads/processes.
eventfd vs pipe/FIFO:
- 8-byte counter — that’s it (no data).
- Single fd (both read and write).
- Used for: lightweight notification/wakeup between threads.
Use cases:
- (1) Thread synchronization: one thread signals another via fd, integrates with epoll.
- (2) Parent-child notification: share fd across fork().
- (3) Event loop integration: wake up epoll from another thread.
- (4) User-space semaphore: with EFD_SEMAPHORE flag.
Flags: EFD_NONBLOCK, EFD_CLOEXEC, EFD_SEMAPHORE. Lighter weight than pipe() for simple notification. Combined with epoll, enables unified event-driven architecture.
What is the shebang (#!) line and how does the kernel handle it?
(Ch 2, 27)
Shebang is the #! at the start of a script file, specifying the interpreter.
The script file itself must have execute permission.
When execve() encounters a file starting with #!, the kernel:
(1) Parses the interpreter path (e.g., #!/bin/bash or #!/usr/bin/env python).
(2) Executes the interpreter with the script path as argument.
(3) Optional arguments after interpreter path are passed too.
Example: #!/usr/bin/awk -f.
Maximum shebang line length is typically 127-255 bytes. Using /usr/bin/env allows finding interpreter via PATH.
What are mutexes? How do pthread mutexes work?
mutual exclusion
(Ch 3, 30)
Mutex (mutual exclusion): lock ensuring only one thread accesses critical section at a time.
Implementation: typically uses futex syscall - fast path in userspace, syscall only on contention.
Rules: only owner should unlock, don’t lock twice (deadlock), always unlock.
Operations:
pthread_mutex_init() or PTHREAD_MUTEX_INITIALIZER. pthread_mutex_lock() - acquire (blocks if held). pthread_mutex_trylock() - non-blocking attempt. pthread_mutex_unlock() - release. pthread_mutex_destroy() - cleanup.
Mutex types:
PTHREAD_MUTEX_NORMAL (default),
PTHREAD_MUTEX_ERRORCHECK (detects double-lock),
PTHREAD_MUTEX_RECURSIVE (allows re-locking by owner).
What is PSI (Pressure Stall Information)? How do you use it to detect resource pressure?
Linux 4.20+; “stall time”
(add)
PSI provides real-time metrics showing how long processes wait for resources.
PSI — measures stall time (how long tasks are waiting), not what caused the wait.
Categories: cpu, memory, io. Levels: ‘some’ = at least one task stalled; ‘full’ = all tasks stalled.
Files: /proc/pressure/cpu, /proc/pressure/memory, /proc/pressure/io.
Format: ‘avg10=X avg60=Y avg300=Z total=T’ - percentages over 10s, 60s, 5min windows.
Use cases:
(1) Autoscaling triggers.
(2) Identify I/O bottlenecks.
(3) systemd-oomd uses PSI to anticipate OOM, kicking-in on memory/swap pressure.
Example: memory “some avg10=25.00”
means tasks waited 25% of last 10 seconds.
Can set up poll() triggers for threshold alerts.
What is a deadlock? How do you prevent it?
(Ch 2, 30)
Deadlock: two or more threads blocked forever, each waiting for resource held by another.
Four conditions (all required):
(1) Mutual exclusion - resource held exclusively.
(2) Hold and wait - hold one resource while waiting for another.
(3) No preemption - can’t force release.
(4) Circular wait - A waits for B, B waits for A.
Detection: tools like helgrind (Valgrind), lockdep (kernel). Best practice: minimize lock scope, avoid holding locks across calls to unknown code.
Prevention strategies:
(1) Lock ordering - always acquire locks in same global order.
(2) Lock timeout - use pthread_mutex_timedlock(), back off on failure.
(3) Try-lock - pthread_mutex_trylock(), release all and retry if fails.
(4) Avoid nested locks when possible.
Walk through what happens when you type ‘ls’ in a shell – from Enter to output appearing.
opendir() -> getdents()
(Ch 2, 3, 4)
1) Shell reads input, parses ‘ls’, searches PATH for executable.
2) fork(): shell creates child process (copy-on-write address space).
3) In child: execve(‘/bin/ls’, [‘ls’], envp) - replaces process image with ls binary.
4) Kernel loads ELF binary, sets up memory segments, dynamic linker loads libc.
5) ls runs: opendir() → getdents() syscall reads directory entries.
6) ls sorts entries, formats output.
7) write() syscalls send output to stdout (fd 1).
8) Terminal driver receives data, displays on screen.
9) ls calls exit(0).
10) Parent shell’s wait() returns, shell displays next prompt.
Compare SSD vs HDD from an operating system perspective. How does Linux I/O handle them differently?
(add)
Physical differences: HDD has spinning platters + seek time; SSD has flash memory, no seek penalty.
Random access: HDD slow (seek + rotational latency ~10ms); SSD fast (~0.1ms). Sequential: both fast, HDD competitive.
Linux differences:
(1) Read-ahead: less beneficial for SSD.
(2) Swappiness: SSDs handle random access well, can use swap more.
(3) I/O schedulers: HDD uses mq-deadline (batches seeks); SSD uses none/mq-deadline (no seek optimization needed).
(4) TRIM/discard: SSDs needs garbage collection to reclaim partially deleted blocks (erase unit: 4MB), TRIM marks blocks unused for GC and wear leveling. Mount option ‘discard’ or periodic fstrim.
(5) Wear leveling: SSD concern, but managed by firmware. Check scheduler: cat /sys/block/sda/queue/scheduler.
What is syslog? How do daemons log messages? To where?
(Ch 37)
syslog is the standard logging system for Unix/Linux (they don’t have a terminal).
Daemons write via Unix domain socket (AF_UNIX) /dev/log. syslogd/rsyslogd daemon receive with kernel assisted mem-to-mem copy and routes log messages.
Configuration: /etc/syslog.conf or /etc/rsyslog.conf routes messages by facility.priority to files, remote servers, or console. Log files: /var/log/messages, /var/log/syslog, /var/log/auth.log. Modern alternative: systemd-journald with journalctl.
Daemons typically: open syslog at startup, log events during operation, avoid writing to stdout/stderr (no terminal).
API:
openlog(ident, option, facility) - initialize.
syslog(priority, format, …) - send message.
closelog() - cleanup.
Facilities: LOG_DAEMON, LOG_AUTH, LOG_KERN, LOG_USER, LOG_LOCAL0-7. Priorities: LOG_EMERG, LOG_ALERT, LOG_CRIT, LOG_ERR, LOG_WARNING, LOG_NOTICE, LOG_INFO, LOG_DEBUG.
How does a shell implement a pipeline like ‘ls | wc’? Also compare popen() vs pipe()/fork()/exec().
Shell pipeline ‘ls | wc’ implementation:
(1) Shell calls pipe(pipefd) - creates pipe with pipefd[0] (read) and pipefd[1] (write).
(2) Shell forks first child (for ls): Child does dup2(pipefd[1], STDOUT_FILENO) to redirect stdout to pipe write end, closes both pipefd ends, execve(‘/bin/ls’).
(3) Shell forks second child (for wc): Child does dup2(pipefd[0], STDIN_FILENO) to redirect stdin to pipe read end, closes both pipefd ends, execve(‘/usr/bin/wc’).
(4) Parent shell closes both pipe ends (important! otherwise wc never sees EOF).
(5) Parent calls wait() for both children.
Key insight: dup2() duplicates fd, allowing execve’d program to use standard stdin/stdout unaware of redirection.
popen(): Convenience wrapper — opens a pipe to/from a shell command, returns FILE*. Internally does pipe()+fork()+exec(). Simpler but less control: can only read OR write (not both), runs via /bin/sh (extra process), no access to child PID until pclose(). Use pipe/fork/exec directly when you need bidirectional I/O, error handling, or control over the child.
What is SIGCHLD? Why is it important and how is it typically handled ?
(Ch 26)
SIGCHLD (17): Sent to parent when child terminates, stops, or resumes. Default: ignore.
Importance: Notifies parent to call wait() and collect exit status (prevents zombies).
Handling patterns:
(1) Install handler that calls waitpid(-1, &status, WNOHANG) in loop - reap all terminated children.
(2) Use SA_NOCLDWAIT flag - kernel auto-reaps, no zombies, but lose exit status.
(3) Ignore signal + wait() periodically.