SRE
Site Reliability Engineering — apply software principles to maintain reliability and performance.
SLI, SLO, SLA
SLI = metric; SLO = target; SLA = contract.
error budget
The acceptable failure allowance within SLO before breaching SLA.
Cloud Operations suite
Monitoring, Logging, Trace, Profiler, Debugger, Error Reporting.
export logs for long-term compliance
Aggregated sinks exporting to BigQuery or Cloud Storage.
backup and DR
Backup = data copy; DR = live standby system ready to take over.
RTO and RPO
RTO = time to recover; RPO = data loss window.
HA for VMs
Use Regional Managed Instance Groups with load balancing.
HA for Cloud SQL
Enable high availability (dual-zone replicas).
alerts trigger automation
Cloud Monitoring alert → Pub/Sub → Cloud Function → remediation.
reduce toil
Automate repetitive ops tasks to increase reliability.
visualize network performance issues
Network Intelligence Center or Monitoring dashboards.
Blue-Green deployment
Two environments — switch traffic only after verifying the new version.
rolling deployment
Gradually update instances with no downtime.
Infrastructure as Code
To define infrastructure declaratively (Deployment Manager/Terraform).
service analyzes CPU/memory performance
Cloud Profiler.
service traces latency across distributed apps
Cloud Trace.
Detect suspicious IAM activity
Event Threat Detection (SCC).
Prevent overspend while maintaining uptime
Budget alerts + sustained/committed use discounts.
Self-healing compute layer
Managed Instance Groups with health checks.