Observability Flashcards

Question

Why monitor p95/p99 latency?

Answer 1

They capture tail behavior, reveal queuing and bottlenecks, and correlate better with user experience under load.

Answer 2

Request rate (RPS), error rate (4xx/5xx), latency (p95/p99), timeouts, retries, and saturation indicators (thread pool, DB pool).

Answer 3

CPU utilization, memory usage, disk I/O, network I/O, container restarts, and resource saturation signals.

Answer 4

A measure of how ‘full’ a resource is (e.g., queue depth, thread pool exhaustion, DB connection pool at max).

Answer 5

Notify humans only when action is needed; alerts should be actionable and tied to user impact.

Answer 6

An alert that clearly states what’s broken, impact, and what to check first, ideally linking to relevant dashboards/log queries.

Answer 7

They can be noisy and not directly tied to user impact; user-impacting alerts should focus on error rate/latency/saturation.

Answer 8

5xx error rate > X% for Y minutes; p95 latency > threshold for Z minutes; queue depth increasing steadily; DB pool exhaustion.

Answer 9

Alerting based on deviations from normal baseline patterns rather than fixed thresholds.

Answer 10

Service Level Objective: a target level of reliability/performance (e.g., 99.9% success rate, p95 < 300ms).

Answer 11

Service Level Indicator: the measured metric used to evaluate an SLO (e.g., success rate, latency percentile).

Answer 12

The allowable amount of unreliability in a period; used to balance shipping features vs stability.

Answer 13

Tracking a request across services/components using a trace composed of spans, showing timing and dependencies.

Answer 14

A timed operation within a trace (e.g., DB query, HTTP call) that records duration and metadata.

Answer 15

They show where time is spent across services, revealing bottlenecks like slow DB calls or downstream latency.

Answer 16

Context propagation (traceId/correlationId) across service boundaries and consistent instrumentation.

Answer 17

Mitigate user impact and restore service first; root cause comes after stabilization.

Answer 18

Detect → Triage → Stabilize/Mitigate → Diagnose → Fix → Verify → Postmortem/Prevention.

Answer 19

Assess scope and impact: affected users, endpoints, regions, error rates, and urgency.

Answer 20

Rollback deployment, disable feature via flag, scale up, apply rate limiting, enable fallback, isolate failing dependency.

Answer 21

A blameless analysis after an incident to identify root cause and create action items to prevent recurrence.

Answer 22

Add tests, improve alerts/dashboards, fix code, adjust timeouts/retries, update runbooks, and refine deployment safety.

Answer 23

Check recent deploys; identify affected endpoints; inspect logs with correlation IDs; review metrics for DB pool/timeouts; use traces to find failing dependency; mitigate (rollback/flag) then fix.

Answer 24

Check saturation (CPU/mem/thread pools/DB pools), DB slow queries, GC pressure, downstream latency/retries; use traces to locate bottleneck; mitigate and optimize.

Answer 25

Look for timeouts, retries storms, race conditions, unstable dependencies; add metrics for retries/timeouts; improve logging correlation; consider circuit breakers/backoff.

Answer 26

When many retries amplify load and failures, causing cascading degradation across services.

Answer 27

Without timeouts, requests can hang, tie up threads, cause queueing, and trigger cascades under partial failures.

Answer 28

A resilience pattern that stops calling a failing dependency temporarily to prevent cascading failures and allow recovery.

Answer 29

It reduces immediate pressure on a failing dependency and avoids synchronized retry spikes.

Answer 30

Log request start/end or completion with status, latency, and correlation ID; avoid logging sensitive payloads.

Answer 31

Security/PII risk, cost, and noise; prefer targeted logging and sampling when necessary.

Answer 32

Start with metrics to classify the problem: errors vs latency vs saturation; then drill down with logs/traces.

Answer 33

Correlate metrics (DB pool, error spikes), logs (exception types), and traces (which span is slow/failing) to isolate the component.

Answer 34

Fast narrowing: error rate, latency percentiles, traffic, saturation by service/endpoint, plus links to logs and traces.

Answer 35

CloudWatch Logs for logs, CloudWatch Metrics/Alarms for metrics and alerting (plus dashboards).

Answer 36

Distributed tracing via OpenTelemetry instrumentation; backends like AWS X-Ray, Jaeger, Zipkin, or vendor platforms.

Answer 37

“Metrics first to classify the failure, then logs with correlation IDs for context, and traces to locate bottlenecks across services—mitigate first, then root cause and prevention.”

Answer 38

An alert that is user-impacting, actionable, and points you to the next diagnostic step (dashboard/log query).

Observability Flashcards

(62 cards)