Clustering Restore Node Flashcards

(15 cards)

1
Q

When a replica is “disconnected,” what three layers should you triage first?

A

Network (connectivity, DNS, firewall), WSFC membership (node down or evicted, quorum, witness), and AG / data health (synchronization, suspended, RESTORING / suspect secondary). The fix depends on which layer is broken.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is restoring a disconnected node related to adding a new node?

A

Often similar at the end, not always at the start. A short transient outage usually needs no reinstall — fix connectivity and let redo catch up or RESUME. A bad secondary copy on an otherwise healthy instance matches the re-seed path (like add-node §3 manual seeding). A rebuilt OS/VM, evicted node, or new SQL install repeats WSFC + SQL + endpoint, then ADD REPLICA and seed — same family as add node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For network / security issues between replicas, what should you check first?

A

HADR endpoint port (often 5022), Windows Firewall, cloud NSGs / security groups, and DNS resolution for instance names — confirm replicas can reach each other on the expected ports and names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Where do you look when SQL is online but the AG database is not synchronizing or is suspended?

A

sys.dm_hadr_database_replica_statessynchronization_state, suspend_reason — plus error logs on primary and secondary to see why redo or mirroring stalled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe Path A — transient disconnect: when does it apply and what is the usual fix?

A

When: Short network blip, controlled reboot, or suspended maintenance — no corruption, no cluster eviction. Fix: Restore connectivity and services; if SUSPENDED, run ALTER DATABASE db SET HADR RESUME (or resume per runbook); watch sys.dm_hadr_database_replica_states until synchronized (or acceptable async lag). No full backup/restore if redo catches up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give a one-line interview answer for a healthy secondary that was only disconnected.

A

“If the secondary is still in the AG and data is consistent, I fix connectivity and RESUME if needed — redo does the rest.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the critical difference between ALTER DATABASE db SET HADR OFF on a secondary and ALTER AVAILABILITY GROUP ag REMOVE DATABASE db?

A

SET HADR OFF on the secondary removes that replica’s local copy from the AG — use it to tear down a bad secondary before re-joining. REMOVE DATABASE at AG scope removes the database from the availability group everywheredestructive for the AG database object; not the “fix one node” tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Path B — when do you re-seed only one replica, and what is the high-level flow?

A

When: That replica’s databases are stuck, won’t catch up, or you removed the DB on that node onlyprimary and other replicas are fine. Flow: On the affected secondary, SET HADR OFF (and DROP DATABASE or empty files per standard if required); keep the database in the AG on the primary; re-join via automatic seeding or manual full + log NORECOVERY + SET HADR AVAILABILITY GROUP; validate with DMVs and logs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give a one-line interview answer for re-seeding a single bad secondary.

A

“I SET HADR OFF on the bad secondary to drop only that copy, then re-seedautomatic seeding or the same manual restore + JOIN as a new replica.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Path C — when do you treat the work like a full “bring back a server”, and what are the main steps?

A

When: OS rebuild, new VM, SQL reinstalled, node evicted, or instance not trustworthy. Steps: Re-add to WSFC (validation, quorum); install/repair SQL, Always On, HADR endpoint, matching patch level; if stale replica metadata remains, REMOVE REPLICA for the old name then ADD REPLICA; seed (automatic or manual per add-node §3).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Give a one-line interview answer for a rebuilt node.

A

“After a rebuild, I treat it like add node: WSFC membership, SQL + endpoint, remove stale replica if needed, ADD REPLICA, then seed.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why must you re-verify quorum and witness after a node was long missing?

A

Quorum and votes can change while a node is away (dynamic quorum / witness behavior). When the node returns, confirm votes, witness reachability, and that the cluster still meets your failover and split-brain assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What happens if a synchronous partner is unavailable, and what should you know operationally?

A

Commits on the primary may block until failover or a mode change. Know session timeout, business RPO/RTO, and whether you will fail over, change to asynchronous, or accept write unavailability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

After host or IP changes in a multi-subnet AG, what should you verify for clients?

A

Listener registration/DNS and that clients use MultiSubnetFailover (or equivalent) so failover and subnet routing behave correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In one sentence, why is confusing REMOVE DATABASE with SET HADR OFF dangerous?

A

The wrong command can drop HA for the database cluster-wide instead of fixing one secondary’s copy — always match the scope of the operation to the failure mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly