What is observability (practical definition)?
The ability to understand what’s happening inside a system using external signals (logs, metrics, traces), enabling fast diagnosis, impact assessment, and prevention of recurrence.
Why is observability important in production?
It reduces time to detect and repair (MTTR), helps isolate root cause, measures user impact, and supports reliable operation of services.
What are the three pillars of observability?
Logs (discrete events), Metrics (numeric time-series), Traces (end-to-end request flow across components).
Why is relying on only one pillar (e.g., logs) insufficient?
Each pillar answers different questions; correlating logs + metrics + traces is needed to detect issues, localize root causes, and understand performance across services.
What does MTTR stand for?
Mean Time To Recovery (or Repair): average time to restore service after an incident.
What are logs?
Discrete records of events that happened at specific times, often used for debugging, auditing, and providing context.
How are logs different from debugging with print statements?
Logs are production signals designed for consistency, filtering, and correlation; print statements are ad-hoc and not suitable for scalable production diagnosis.
What should ERROR logs represent?
Failures that require action—unexpected exceptions, failed operations, or degraded functionality.
What should WARN logs represent?
Recoverable anomalies or suspicious conditions—retries, fallbacks, partial failures, or unusual but non-fatal states.
What should INFO logs represent?
Meaningful high-level events or state transitions (not noise): e.g., request completed, job processed, significant business events.
When should DEBUG logs be used?
Detailed diagnostic information helpful when troubleshooting; typically disabled in production by default.
What is structured logging?
Logging in a machine-parseable format with key-value fields (e.g., userId=…, status=…, traceId=…), enabling powerful querying and aggregation.
Why is structured logging better than plain text logs?
It enables filtering, grouping, dashboards, and faster investigations (e.g., query all logs with traceId=abc123).
What is a correlation ID / trace ID?
A unique identifier attached to a request and propagated across services so all related logs/traces can be linked.
Why are correlation IDs valuable in microservices?
They allow you to follow a single request across multiple services to pinpoint where failures or latency occur.
What are common things you should NOT log?
Passwords, JWT tokens, API keys/secrets, sensitive PII, or large payloads that cause log explosion.
What is log explosion and why is it bad?
Excessive logging that increases cost and noise, slows debugging, and can impact performance/storage.
What is a good rule for logging at INFO level?
Keep INFO logs meaningful and high-signal; avoid noisy per-line/per-loop logs.
What are metrics?
Numeric measurements collected over time (time-series) used for monitoring trends, performance, and alerting.
What does the RED method stand for?
Rate (traffic), Errors (error rate), Duration (latency), typically for request/endpoint monitoring.
What does the USE method stand for?
Utilization, Saturation, Errors—commonly for resource monitoring (CPU, memory, queues, pools).
What are the ‘golden signals’ of monitoring?
Latency, Traffic, Errors, Saturation.
Why are averages often misleading for latency?
Averages hide tail latency; a small % of very slow requests can harm users without changing the mean much.
What are latency percentiles (p50, p95, p99)?
p50 is typical request latency; p95/p99 show tail latency experienced by the slowest 5%/1% of requests.