How do you frame SQL Server performance tuning in an interview — what planes do you triage, and where does infrastructure fit for a cloud SRE?
Three planes: instance (resources and engine defaults: max server memory, MAXDOP, tempdb, storage), schema (indexes, keys, types, partitioning, statistics strategy), query / application (T-SQL shape, sargability, ORM patterns, parameter sniffing, transaction length). A fourth layer — platform / infrastructure (VM size, RDS parameter groups, network RTT) — is fair to call out when waits point at IO or CPU limits outside the engine. Panel line: “I classify where the lever lives — config, structures, or workload — and I fold in capacity when metrics say it’s not SQL-only.”
What are wait stats, which DMVs do you use, and why don’t waits alone tell the whole story?
Waits show what sessions or the instance are waiting on (disk, scheduler, locks, latches, network). Instance-wide: sys.dm_os_wait_stats (reset on restart or DBCC SQLPERF); live sessions: sys.dm_exec_requests (wait_type, wait_time, last_wait_type). Dominant waits hint at IO (PAGEIOLATCH_*, WRITELOG), CPU/parallelism (CXPACKET, CXCONSUMER, SOS_SCHEDULER_YIELD), memory (RESOURCE_SEMAPHORE), locks (LCK_M_*). You still pair waits with queries, plans, and counters — waits are a compass, not a diagnosis by themselves.
How do you use sys.dm_db_missing_index_* without index sprawl?
Missing-index DMVs (and Query Store hints in tools) suggest candidates where seeks/scans could be cheaper — but every index has write, storage, and maintenance cost; duplicate or narrow indexes can hurt. Correlate suggestions with top resource queries, execution plans (missing-index XML), cardinality, and selectivity. Prefer fewer well-chosen indexes over accepting every DMV row.
What do MAXDOP and cost threshold for parallelism control, and what goes wrong when they’re mis-set?
MAXDOP caps how many processors parallel work may use (scope rules vary by version/edition context). cost threshold for parallelism is the estimated cost above which the optimizer may parallelize. Too high MAXDOP → overhead and waits like CXPACKET / CXCONSUMER; too low → underused CPU on large analytics-style work. Many orgs set MAXDOP from core/NUMA guidance; MAXDOP 1 appears for some OLTP-heavy small VMs or targeted hints — validate with Query Store and actual plans, not folklore alone.
How do you reason about SQL Server memory on a busy host — buffer pool, max server memory, and signals like PLE?
The buffer pool caches data pages; pressure often shows as IO waits or lazy writer activity. max server memory should leave headroom for the OS, other instances, and non-buffer use (plan cache, threads, linked servers, etc.); min server memory limits sudden shrinkage. PLE (page life expectancy) is a rough signal — working set and storage speed matter; don’t treat a single PLE number as an SLA. On Windows, LPIM (Locked Pages In Memory) reduces paging of SQL’s working set (privilege + policy).
Why does tempdb matter for performance, and what operational choices reduce contention?
tempdb is shared: sorts, hashes, worktables, row versions for RCSI / snapshot isolation, and internal objects. Allocation and metadata contention still appears on busy systems despite version improvements. Use enough data files with same initial size and growth on fast storage; follow current Books Online guidance for file count per version (historically “one per logical CPU up to 8” was a starting point — say you follow current Microsoft guidance). Avoid long-lived user objects in tempdb unless intentional.
Define blocking in SQL Server terms and when it becomes an incident-level problem.
Blocking is normal short-term: session B waits on a lock session A holds in a conflicting mode until A releases (commit/rollback) or compatibility allows progress. It’s a problem when waits are long or cascade (many sessions behind one head blocker). Common drivers: long transactions, missing/wrong indexes (scans → more/larger locks), lock escalation, over-strong isolation (REPEATABLE READ / SERIALIZABLE). Panel line: “I find the head blocker, what it’s running, and how long the transaction is open — then I reduce lock duration and footprint.”
Which DMVs and patterns do you use to diagnose blocking in production?
sys.dm_exec_requests: blocking_session_id, wait_type, wait_time. sys.dm_tran_locks for who holds what. sys.dm_exec_sql_text and sys.dm_exec_query_plan for blocked and blocker sessions. Legacy sp_who2 active or Activity Monitor for a quick picture. Often the fix is app-side (transaction boundaries, batch design) as much as an index change.
What mitigations do you propose for chronic blocking — beyond “add an index”?
Root cause: shorten transactions; commit in smaller safe units; fix hotspot updates; avoid SELECT * inside long transactions. Indexes that turn scans into seeks narrow locks and duration. Isolation: where business rules allow, options like READ COMMITTED SNAPSHOT at the database level reduce reader/writer blocking — with awareness of tempdb version store cost. Retries with backoff for transient blocking, not as a band-aid for 30-minute blockers.
Blocking vs deadlock — shape, how SQL Server resolves each, and what the application sees.
Blocking: usually a chain (A blocks B, B blocks C…); the waiter waits until the holder finishes. Deadlock: a cycle of lock waits (A→B→A or longer); the engine detects it and picks a deadlock victim, rolls back that work (error 1205). Apps often retry 1205; blocking usually surfaces as slow queries or timeouts. Panel line: “Deadlock is circular; SQL kills one victim. Blocking is linear until someone finishes.”
What causes deadlocks in real systems, and how do you inspect them?
Causes: inconsistent lock order (Proc1 updates A then B, Proc2 updates B then A), same rows touched in different order from different code paths, heavy HOLDLOCK / serializable range locks. Use Extended Events xml_deadlock_report (or SSMS deadlock graph) for resources, processes, and victim; error logs may summarize; Query Store / plan handles tie statements to recurrence.
How do you mitigate deadlocks without only telling apps to retry forever?
Standardize lock order across procedures for the same object set (documented ordering: e.g. Orders then OrderLines). Smaller transactions; indexes so locks are narrow and short. RCSI / snapshot can reduce some reader/writer deadlock classes — writer/writer cycles still happen. Retry 1205 for idempotent operations is normal — still fix a frequent graph.
When do you escalate a performance issue to application / query owners vs keep it in the DBA/SRE lane?
Escalate when the fix is application behavior: N+1 queries, fat WHERE, implicit conversions killing sargability, ORM cartesian patterns, transaction scope, retry storms, cache design — you prove with plans and stats, but code wins. Stay platform-owned for instance config, storage, maintenance (stats, integrity), parameterization / plan stability (RECOMPILE vs forced plans), resource governor if used.
Name isolation levels and lock escalation at a level you’d use in a senior panel — why they tie to performance.
Isolation: READ UNCOMMITTED, READ COMMITTED (default), REPEATABLE READ, SERIALIZABLE, plus snapshot models (ALLOW_SNAPSHOT_ISOLATION, READ_COMMITTED_SNAPSHOT). Stronger isolation → more / longer-held locks → more blocking risk. Lock escalation: SQL may promote many row locks to page or table, increasing blast radius; heavy partitioning or LOCK_ESCALATION options matter in edge cases. You choose isolation to match correctness needs, not defaults by accident.
Give a tight closing line for a DEEP&T-style reliability interview tying waits, planes, and partnership.
“I use waits and plans to classify CPU vs IO vs locks; I own instance and schema levers and propose concrete changes; I partner with engineering when the workload shape or transaction design has to change — with metrics and repro so the conversation stays evidence-based at scale.”
What is a DMV?
A DMV (dynamic management view) is a server-state catalog view in the sys schema (names like sys.dm_*) that exposes live or cumulative internal metrics: sessions, waits, locks, plans, IO, memory, and more. DMFs (dynamic management functions) are the same family but parameterized (e.g. pass a session or plan handle). They are read-only, often lighter than full traces for ad hoc diagnosis, and reset / scope rules differ by object (e.g. sys.dm_os_wait_stats resets on restart or DBCC SQLPERF). Panel line: “I use DMVs first to see what the engine is doing now and to tie waits to sessions and plans.”