Observability is the ability to understand your system’s internal state from its external outputs. The three pillars — logs, metrics, and traces — each answer different questions.
Metrics: The First Signal
Metrics are aggregated numbers over time. They tell you that something is wrong.
http_requests_total{status="500", endpoint="/api/users"} 47
http_request_duration_seconds{quantile="0.99"} 2.3
Use metrics for:
- Dashboards. Request rate, error rate, latency percentiles (the RED method).
- Alerting. Fire alerts on symptom-based metrics, not cause-based ones.
- Capacity planning. Track resource utilization trends over weeks and months.
Metrics are cheap to store and query. You can retain months of data at fine granularity. Start here.
Logs: The Context
Logs are discrete events. They tell you what happened.
Structure them as JSON. Unstructured logs are expensive to search:
{
"timestamp": "2026-03-01T14:22:03Z",
"level": "error",
"message": "payment processing failed",
"service": "checkout",
"trace_id": "abc123",
"user_id": "u_789",
"error": "card declined",
"duration_ms": 340
}
Key practices:
- Include a trace ID so you can correlate logs with traces.
- Log at the right level. INFO for business events, WARN for degraded operation, ERROR for failures that need attention.
- Don’t log sensitive data. PII, tokens, and credentials should never appear in logs.
- Set a retention policy. Storing every debug log forever is expensive and rarely useful.
Traces: The Path
Traces follow a request through your system. They tell you where time is spent.
A trace is a tree of spans. Each span represents a unit of work:
[checkout-service] POST /api/checkout (450ms)
├─ [checkout-service] validate_cart (5ms)
├─ [inventory-service] check_stock (120ms)
│ └─ [database] SELECT inventory (85ms)
├─ [payment-service] charge_card (300ms)
│ └─ [stripe-api] POST /charges (280ms)
└─ [checkout-service] create_order (15ms)
This immediately shows that the payment service is the bottleneck, and specifically the external Stripe call.
How They Work Together
A real debugging flow:
- Metric alert fires. p99 latency on
/api/checkoutexceeded 2 seconds. - Check dashboard. Latency spike started at 14:15. Only affects the checkout endpoint.
- Search logs. Filter by
service=checkoutandtimestamp > 14:15. See errors from the payment service. - Follow a trace. Pick a slow request, open its trace. See that
stripe-apicalls jumped from 200ms to 2000ms. - Root cause. Stripe is having an incident. Implement a timeout and circuit breaker.
Each pillar is useful alone. Together, they give you the full picture. Invest in correlation — trace IDs that appear in both logs and traces, and metrics that can be broken down by the same dimensions.