Observability Flashcards

(62 cards)

1
Q

What is observability (practical definition)?

A

The ability to understand what’s happening inside a system using external signals (logs, metrics, traces), enabling fast diagnosis, impact assessment, and prevention of recurrence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is observability important in production?

A

It reduces time to detect and repair (MTTR), helps isolate root cause, measures user impact, and supports reliable operation of services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three pillars of observability?

A

Logs (discrete events), Metrics (numeric time-series), Traces (end-to-end request flow across components).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is relying on only one pillar (e.g., logs) insufficient?

A

Each pillar answers different questions; correlating logs + metrics + traces is needed to detect issues, localize root causes, and understand performance across services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does MTTR stand for?

A

Mean Time To Recovery (or Repair): average time to restore service after an incident.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are logs?

A

Discrete records of events that happened at specific times, often used for debugging, auditing, and providing context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are logs different from debugging with print statements?

A

Logs are production signals designed for consistency, filtering, and correlation; print statements are ad-hoc and not suitable for scalable production diagnosis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What should ERROR logs represent?

A

Failures that require action—unexpected exceptions, failed operations, or degraded functionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What should WARN logs represent?

A

Recoverable anomalies or suspicious conditions—retries, fallbacks, partial failures, or unusual but non-fatal states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What should INFO logs represent?

A

Meaningful high-level events or state transitions (not noise): e.g., request completed, job processed, significant business events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When should DEBUG logs be used?

A

Detailed diagnostic information helpful when troubleshooting; typically disabled in production by default.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is structured logging?

A

Logging in a machine-parseable format with key-value fields (e.g., userId=…, status=…, traceId=…), enabling powerful querying and aggregation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is structured logging better than plain text logs?

A

It enables filtering, grouping, dashboards, and faster investigations (e.g., query all logs with traceId=abc123).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a correlation ID / trace ID?

A

A unique identifier attached to a request and propagated across services so all related logs/traces can be linked.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why are correlation IDs valuable in microservices?

A

They allow you to follow a single request across multiple services to pinpoint where failures or latency occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are common things you should NOT log?

A

Passwords, JWT tokens, API keys/secrets, sensitive PII, or large payloads that cause log explosion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is log explosion and why is it bad?

A

Excessive logging that increases cost and noise, slows debugging, and can impact performance/storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a good rule for logging at INFO level?

A

Keep INFO logs meaningful and high-signal; avoid noisy per-line/per-loop logs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are metrics?

A

Numeric measurements collected over time (time-series) used for monitoring trends, performance, and alerting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the RED method stand for?

A

Rate (traffic), Errors (error rate), Duration (latency), typically for request/endpoint monitoring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does the USE method stand for?

A

Utilization, Saturation, Errors—commonly for resource monitoring (CPU, memory, queues, pools).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the ‘golden signals’ of monitoring?

A

Latency, Traffic, Errors, Saturation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why are averages often misleading for latency?

A

Averages hide tail latency; a small % of very slow requests can harm users without changing the mean much.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are latency percentiles (p50, p95, p99)?

A

p50 is typical request latency; p95/p99 show tail latency experienced by the slowest 5%/1% of requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Why monitor p95/p99 latency?
They capture tail behavior, reveal queuing and bottlenecks, and correlate better with user experience under load.
26
Name a few key application metrics for an API.
Request rate (RPS), error rate (4xx/5xx), latency (p95/p99), timeouts, retries, and saturation indicators (thread pool, DB pool).
27
Name a few key infrastructure/resource metrics.
CPU utilization, memory usage, disk I/O, network I/O, container restarts, and resource saturation signals.
28
What is saturation (in USE/golden signals)?
A measure of how ‘full’ a resource is (e.g., queue depth, thread pool exhaustion, DB connection pool at max).
29
What is the goal of alerting?
Notify humans only when action is needed; alerts should be actionable and tied to user impact.
30
What is an actionable alert?
An alert that clearly states what’s broken, impact, and what to check first, ideally linking to relevant dashboards/log queries.
31
Why are ‘CPU > 80%’ alerts often poor?
They can be noisy and not directly tied to user impact; user-impacting alerts should focus on error rate/latency/saturation.
32
Give examples of good alert conditions for an API.
5xx error rate > X% for Y minutes; p95 latency > threshold for Z minutes; queue depth increasing steadily; DB pool exhaustion.
33
What is anomaly detection in alerting?
Alerting based on deviations from normal baseline patterns rather than fixed thresholds.
34
What is an SLO?
Service Level Objective: a target level of reliability/performance (e.g., 99.9% success rate, p95 < 300ms).
35
What is an SLI?
Service Level Indicator: the measured metric used to evaluate an SLO (e.g., success rate, latency percentile).
36
What is an error budget?
The allowable amount of unreliability in a period; used to balance shipping features vs stability.
37
What is distributed tracing?
Tracking a request across services/components using a trace composed of spans, showing timing and dependencies.
38
What is a span?
A timed operation within a trace (e.g., DB query, HTTP call) that records duration and metadata.
39
Why are traces useful for latency debugging?
They show where time is spent across services, revealing bottlenecks like slow DB calls or downstream latency.
40
What must be in place for effective tracing across services?
Context propagation (traceId/correlationId) across service boundaries and consistent instrumentation.
41
During an incident, what is the #1 priority?
Mitigate user impact and restore service first; root cause comes after stabilization.
42
What are the typical incident phases?
Detect → Triage → Stabilize/Mitigate → Diagnose → Fix → Verify → Postmortem/Prevention.
43
What does triage mean in incident response?
Assess scope and impact: affected users, endpoints, regions, error rates, and urgency.
44
Give examples of stabilization/mitigation actions.
Rollback deployment, disable feature via flag, scale up, apply rate limiting, enable fallback, isolate failing dependency.
45
What is a postmortem and why do it?
A blameless analysis after an incident to identify root cause and create action items to prevent recurrence.
46
What are good postmortem action items?
Add tests, improve alerts/dashboards, fix code, adjust timeouts/retries, update runbooks, and refine deployment safety.
47
How do you approach a spike in 500 errors?
Check recent deploys; identify affected endpoints; inspect logs with correlation IDs; review metrics for DB pool/timeouts; use traces to find failing dependency; mitigate (rollback/flag) then fix.
48
How do you approach p95 latency increasing?
Check saturation (CPU/mem/thread pools/DB pools), DB slow queries, GC pressure, downstream latency/retries; use traces to locate bottleneck; mitigate and optimize.
49
How do you debug intermittent failures?
Look for timeouts, retries storms, race conditions, unstable dependencies; add metrics for retries/timeouts; improve logging correlation; consider circuit breakers/backoff.
50
What’s a retry storm?
When many retries amplify load and failures, causing cascading degradation across services.
51
Why are timeouts important in service-to-service calls?
Without timeouts, requests can hang, tie up threads, cause queueing, and trigger cascades under partial failures.
52
What is a circuit breaker pattern?
A resilience pattern that stops calling a failing dependency temporarily to prevent cascading failures and allow recovery.
53
Why is backoff important for retries?
It reduces immediate pressure on a failing dependency and avoids synchronized retry spikes.
54
What is a good logging standard for requests?
Log request start/end or completion with status, latency, and correlation ID; avoid logging sensitive payloads.
55
Why avoid logging full request/response bodies by default?
Security/PII risk, cost, and noise; prefer targeted logging and sampling when necessary.
56
What’s a good first step when investigating a production issue?
Start with metrics to classify the problem: errors vs latency vs saturation; then drill down with logs/traces.
57
How do you decide if the issue is app, DB, network, or downstream?
Correlate metrics (DB pool, error spikes), logs (exception types), and traces (which span is slow/failing) to isolate the component.
58
What should dashboards enable during an incident?
Fast narrowing: error rate, latency percentiles, traffic, saturation by service/endpoint, plus links to logs and traces.
59
In AWS, what’s a common place to view logs and metrics?
CloudWatch Logs for logs, CloudWatch Metrics/Alarms for metrics and alerting (plus dashboards).
60
Name common tracing approaches/tools (conceptually).
Distributed tracing via OpenTelemetry instrumentation; backends like AWS X-Ray, Jaeger, Zipkin, or vendor platforms.
61
What’s a strong one-liner describing your production debugging approach?
“Metrics first to classify the failure, then logs with correlation IDs for context, and traces to locate bottlenecks across services—mitigate first, then root cause and prevention.”
62
What makes a ‘good’ alert in one sentence?
An alert that is user-impacting, actionable, and points you to the next diagnostic step (dashboard/log query).