Observability and monitoring for modern systems
As systems become more distributed and complex, traditional monitoring approaches fall short. According to Splunk's State of Observability, organizations with mature observability practices resolve incidents 69% faster and experience 90% fewer outages. Observability has become a competitive advantage.
The three pillars of observability
According to Datadog's Container Report, the average organization now monitors 500+ services, making observability essential for operational success.
Metrics, logs, and traces
Metrics
Numeric measurements over time. What is happening?
Logs
Timestamped records of events. What happened in detail?
Traces
Request paths through distributed systems. How did it happen?
Profiling
Resource usage at code level. Why is it slow?
Events
Significant occurrences. What changed?
RUM
Real user monitoring. What do users experience?
Beyond Three Pillars: Modern observability extends beyond metrics, logs, and traces to include profiling, real user monitoring, and synthetic testing for complete visibility.
Monitoring vs observability
Monitoring vs Observability
| Feature | Traditional Monitoring | Observability |
|---|---|---|
| Predefined Questions | ✓ | ✓ |
| Unknown Unknowns | ✗ | ✓ |
| Root Cause Analysis | ✗ | ✓ |
| Correlation Across Signals | ✗ | ✓ |
| High Cardinality | ✗ | ✓ |
| Ad-Hoc Investigation | ✗ | ✓ |
The RED and USE methods
Request-Oriented (Services)
Rate (requests/sec), Errors (failed requests), Duration (latency). Best for request-driven services.
Resource-Oriented (Infrastructure)
Utilization (% time busy), Saturation (queue length), Errors. Best for resources like CPU, memory, disk.
Google SRE Method
Latency, Traffic, Errors, Saturation. Comprehensive service health view.
Building an observability stack
Observability Tool Adoption (%)
Observability Stack Options
| Feature | Datadog | Grafana Stack | New Relic | OpenTelemetry + Backends |
|---|---|---|---|---|
| Metrics | ✓ | ✓ | ✓ | ✓ |
| Logs | ✓ | ✓ | ✓ | ✓ |
| Traces | ✓ | ✓ | ✓ | ✓ |
| APM | ✓ | ✗ | ✓ | ✗ |
| Open Source | ✗ | ✓ | ✗ | ✓ |
| Managed Service | ✓ | ✓ | ✓ | ✗ |
OpenTelemetry: the future of instrumentation
Vendor Neutral
Instrument once, send to any backend. Avoid vendor lock-in.
Unified Collection
Single SDK for metrics, logs, and traces.
Auto-Instrumentation
Automatic instrumentation for popular frameworks.
Industry Standard
CNCF project with broad industry support.
Alerting best practices
Define SLOs
Set service level objectives before alerts
Alert on Symptoms
User-facing impact, not internal metrics
Reduce Noise
Every alert should be actionable
Tiered Severity
Not every alert needs to page someone
Runbooks
Link alerts to troubleshooting guides
Review Regularly
Audit alerts quarterly, remove noisy ones
Alert Fatigue: Teams that receive more than 20 alerts per on-call shift experience significant fatigue and miss real issues. Quality over quantity in alerting.
SLOs, SLIs, and error budgets
Error Budget Consumption Over Time
Service Level Indicator
The metric you measure: latency p99, error rate, availability.
Service Level Objective
The target for your SLI: 99.9% availability, p99 latency under 200ms.
Service Level Agreement
External commitment with consequences: contractual SLO.
Acceptable Unreliability
If SLO is 99.9%, error budget is 0.1% downtime.
Distributed tracing deep dive
Primary Uses of Distributed Tracing
Instrument Services
Add tracing SDKs to all services
Propagate Context
Pass trace IDs across service boundaries
Collect Spans
Aggregate spans from all services
Visualize Traces
Display request flow through system
Analyze Patterns
Find common bottlenecks and failures
Sample Intelligently
Keep interesting traces, sample routine ones
Cost management for observability
Observability Cost Reduction Strategies (%)
Implementation roadmap
Foundation
Basic metrics and logging. Prometheus, ELK/Loki, core dashboards.
Distributed Tracing
Add tracing to critical paths. Jaeger/Tempo, trace-log correlation.
SLO Implementation
Define SLOs, error budgets, SLO-based alerts.
Automation
Auto-remediation, AI-driven insights, chaos engineering.
FAQ
Q: Where should we start with observability? A: Start with metrics—they're the foundation for alerting and SLOs. Add logging for debugging. Add tracing when you have distributed systems and need to understand request flows.
Q: How do we handle observability costs? A: Focus on high-value signals. Sample traces, tune log levels, limit metric cardinality. Use tiered storage for historical data. Costs should scale sublinearly with system growth.
Q: Should we build or buy our observability stack? A: Most teams benefit from managed solutions (Datadog, Grafana Cloud) for faster time to value. Consider open source (Prometheus, Jaeger) if you have strong platform engineering capacity.
Q: How do we instrument legacy systems? A: Start with infrastructure metrics, add application logging, consider sidecar proxies (Envoy) for network-level observability. Full instrumentation may not be worth the investment.
Sources and further reading
- Splunk State of Observability
- Google SRE Book
- OpenTelemetry Documentation
- Observability Engineering by Majors, Fong-Jones & Miranda
- Distributed Systems Observability by Sridharan
Build Comprehensive Observability: Implementing effective observability requires expertise in instrumentation, tooling, and practices. Our team helps organizations build observability platforms that reduce incidents and accelerate debugging. Contact us to discuss your observability strategy.
Ready to improve your observability? Connect with our SRE experts to develop a tailored monitoring strategy.



