When a replica is “disconnected,” what three layers should you triage first?
Network (connectivity, DNS, firewall), WSFC membership (node down or evicted, quorum, witness), and AG / data health (synchronization, suspended, RESTORING / suspect secondary). The fix depends on which layer is broken.
How is restoring a disconnected node related to adding a new node?
Often similar at the end, not always at the start. A short transient outage usually needs no reinstall — fix connectivity and let redo catch up or RESUME. A bad secondary copy on an otherwise healthy instance matches the re-seed path (like add-node §3 manual seeding). A rebuilt OS/VM, evicted node, or new SQL install repeats WSFC + SQL + endpoint, then ADD REPLICA and seed — same family as add node.
For network / security issues between replicas, what should you check first?
HADR endpoint port (often 5022), Windows Firewall, cloud NSGs / security groups, and DNS resolution for instance names — confirm replicas can reach each other on the expected ports and names.
Where do you look when SQL is online but the AG database is not synchronizing or is suspended?
sys.dm_hadr_database_replica_states — synchronization_state, suspend_reason — plus error logs on primary and secondary to see why redo or mirroring stalled.
Describe Path A — transient disconnect: when does it apply and what is the usual fix?
When: Short network blip, controlled reboot, or suspended maintenance — no corruption, no cluster eviction. Fix: Restore connectivity and services; if SUSPENDED, run ALTER DATABASE db SET HADR RESUME (or resume per runbook); watch sys.dm_hadr_database_replica_states until synchronized (or acceptable async lag). No full backup/restore if redo catches up.
Give a one-line interview answer for a healthy secondary that was only disconnected.
“If the secondary is still in the AG and data is consistent, I fix connectivity and RESUME if needed — redo does the rest.”
What is the critical difference between ALTER DATABASE db SET HADR OFF on a secondary and ALTER AVAILABILITY GROUP ag REMOVE DATABASE db?
SET HADR OFF on the secondary removes that replica’s local copy from the AG — use it to tear down a bad secondary before re-joining. REMOVE DATABASE at AG scope removes the database from the availability group everywhere — destructive for the AG database object; not the “fix one node” tool.
Path B — when do you re-seed only one replica, and what is the high-level flow?
When: That replica’s databases are stuck, won’t catch up, or you removed the DB on that node only — primary and other replicas are fine. Flow: On the affected secondary, SET HADR OFF (and DROP DATABASE or empty files per standard if required); keep the database in the AG on the primary; re-join via automatic seeding or manual full + log NORECOVERY + SET HADR AVAILABILITY GROUP; validate with DMVs and logs.
Give a one-line interview answer for re-seeding a single bad secondary.
“I SET HADR OFF on the bad secondary to drop only that copy, then re-seed — automatic seeding or the same manual restore + JOIN as a new replica.”
Path C — when do you treat the work like a full “bring back a server”, and what are the main steps?
When: OS rebuild, new VM, SQL reinstalled, node evicted, or instance not trustworthy. Steps: Re-add to WSFC (validation, quorum); install/repair SQL, Always On, HADR endpoint, matching patch level; if stale replica metadata remains, REMOVE REPLICA for the old name then ADD REPLICA; seed (automatic or manual per add-node §3).
Give a one-line interview answer for a rebuilt node.
“After a rebuild, I treat it like add node: WSFC membership, SQL + endpoint, remove stale replica if needed, ADD REPLICA, then seed.”
Why must you re-verify quorum and witness after a node was long missing?
Quorum and votes can change while a node is away (dynamic quorum / witness behavior). When the node returns, confirm votes, witness reachability, and that the cluster still meets your failover and split-brain assumptions.
What happens if a synchronous partner is unavailable, and what should you know operationally?
Commits on the primary may block until failover or a mode change. Know session timeout, business RPO/RTO, and whether you will fail over, change to asynchronous, or accept write unavailability.
After host or IP changes in a multi-subnet AG, what should you verify for clients?
Listener registration/DNS and that clients use MultiSubnetFailover (or equivalent) so failover and subnet routing behave correctly.
In one sentence, why is confusing REMOVE DATABASE with SET HADR OFF dangerous?
The wrong command can drop HA for the database cluster-wide instead of fixing one secondary’s copy — always match the scope of the operation to the failure mode.