Reliability Engineering (Reliability & Observability)
- System monitoring (metrics, logs, traces - full observability)
- Log centralization and distributed tracing
- Alerting and proactive response to anomalies
- Defining and managing SLI, SLO, and SLA
- Measuring availability (uptime) and latency
- Incident management (runbooks, operational procedures)
- Full incident response lifecycle
- Post-mortem analysis and process improvement
- Root Cause Analysis (RCA)
- Load testing and stress testing
- Autoscaling and resource management optimization