Observability and monitoring for modern systems

As systems become more distributed and complex, traditional monitoring approaches fall short. According to Splunk's State of Observability, organizations with mature observability practices resolve incidents 69% faster and experience 90% fewer outages. Observability has become a competitive advantage.

The three pillars of observability

Incidents Detected by Monitoring

Faster Resolution with Observability

Organizations Investing in O11y

Data Growth per Year

According to Datadog's Container Report, the average organization now monitors 500+ services, making observability essential for operational success.

Metrics, logs, and traces

Metrics

Numeric measurements over time. What is happening?

Logs

Timestamped records of events. What happened in detail?

Traces

Request paths through distributed systems. How did it happen?

Profiling

Resource usage at code level. Why is it slow?

Events

Significant occurrences. What changed?

RUM

Real user monitoring. What do users experience?

Beyond Three Pillars: Modern observability extends beyond metrics, logs, and traces to include profiling, real user monitoring, and synthetic testing for complete visibility.

Monitoring vs observability

Monitoring vs Observability

Feature	Traditional Monitoring	Observability
Predefined Questions	✓	✓
Unknown Unknowns	✗	✓
Root Cause Analysis	✗	✓
Correlation Across Signals	✗	✓
High Cardinality	✗	✓
Ad-Hoc Investigation	✗	✓

The RED and USE methods

RED

Request-Oriented (Services)

Rate (requests/sec), Errors (failed requests), Duration (latency). Best for request-driven services.

USE

Resource-Oriented (Infrastructure)

Utilization (% time busy), Saturation (queue length), Errors. Best for resources like CPU, memory, disk.

Four Golden Signals

Google SRE Method

Latency, Traffic, Errors, Saturation. Comprehensive service health view.

Building an observability stack

Observability Tool Adoption (%)

Observability Stack Options

Feature	Datadog	Grafana Stack	New Relic	OpenTelemetry + Backends
Metrics	✓	✓	✓	✓
Logs	✓	✓	✓	✓
Traces	✓	✓	✓	✓
APM	✓	✗	✓	✗
Open Source	✗	✓	✗	✓
Managed Service	✓	✓	✓	✗

OpenTelemetry: the future of instrumentation

Benefit 1

Vendor Neutral

Instrument once, send to any backend. Avoid vendor lock-in.

Benefit 2

Unified Collection

Single SDK for metrics, logs, and traces.

Benefit 3

Auto-Instrumentation

Automatic instrumentation for popular frameworks.

Benefit 4

Industry Standard

CNCF project with broad industry support.

0% YoY

OTel Adoption Growth

Languages Supported

Contributors

Companies Using

Alerting best practices

Define SLOs

Set service level objectives before alerts

Alert on Symptoms

User-facing impact, not internal metrics

Reduce Noise

Every alert should be actionable

Tiered Severity

Not every alert needs to page someone

Runbooks

Link alerts to troubleshooting guides

Review Regularly

Audit alerts quarterly, remove noisy ones

Alert Fatigue: Teams that receive more than 20 alerts per on-call shift experience significant fatigue and miss real issues. Quality over quantity in alerting.

SLOs, SLIs, and error budgets

Error Budget Consumption Over Time

SLI

Service Level Indicator

The metric you measure: latency p99, error rate, availability.

SLO

Service Level Objective

The target for your SLI: 99.9% availability, p99 latency under 200ms.

SLA

Service Level Agreement

External commitment with consequences: contractual SLO.

Error Budget

Acceptable Unreliability

If SLO is 99.9%, error budget is 0.1% downtime.

Distributed tracing deep dive

Primary Uses of Distributed Tracing

Instrument Services

Add tracing SDKs to all services

Propagate Context

Pass trace IDs across service boundaries

Collect Spans

Aggregate spans from all services

Visualize Traces

Display request flow through system

Analyze Patterns

Find common bottlenecks and failures

Sample Intelligently

Keep interesting traces, sample routine ones

Cost management for observability

Observability Cost Reduction Strategies (%)

Implementation roadmap

Phase 1

Foundation

Basic metrics and logging. Prometheus, ELK/Loki, core dashboards.

Phase 2

Distributed Tracing

Add tracing to critical paths. Jaeger/Tempo, trace-log correlation.

Phase 3

SLO Implementation

Define SLOs, error budgets, SLO-based alerts.

Phase 4

Automation

Auto-remediation, AI-driven insights, chaos engineering.

FAQ

Q: Where should we start with observability? A: Start with metrics—they're the foundation for alerting and SLOs. Add logging for debugging. Add tracing when you have distributed systems and need to understand request flows.

Q: How do we handle observability costs? A: Focus on high-value signals. Sample traces, tune log levels, limit metric cardinality. Use tiered storage for historical data. Costs should scale sublinearly with system growth.

Q: Should we build or buy our observability stack? A: Most teams benefit from managed solutions (Datadog, Grafana Cloud) for faster time to value. Consider open source (Prometheus, Jaeger) if you have strong platform engineering capacity.

Q: How do we instrument legacy systems? A: Start with infrastructure metrics, add application logging, consider sidecar proxies (Envoy) for network-level observability. Full instrumentation may not be worth the investment.

Sources and further reading

Build Comprehensive Observability: Implementing effective observability requires expertise in instrumentation, tooling, and practices. Our team helps organizations build observability platforms that reduce incidents and accelerate debugging. Contact us to discuss your observability strategy.

Ready to improve your observability? Connect with our SRE experts to develop a tailored monitoring strategy.