Observability is the ability to understand your system’s internal state from its external outputs. The three pillars — logs, metrics, and traces — each answer different questions.

Metrics: The First Signal

Metrics are aggregated numbers over time. They tell you that something is wrong.

http_requests_total{status="500", endpoint="/api/users"} 47
http_request_duration_seconds{quantile="0.99"} 2.3

Use metrics for:

  • Dashboards. Request rate, error rate, latency percentiles (the RED method).
  • Alerting. Fire alerts on symptom-based metrics, not cause-based ones.
  • Capacity planning. Track resource utilization trends over weeks and months.

Metrics are cheap to store and query. You can retain months of data at fine granularity. Start here.

Logs: The Context

Logs are discrete events. They tell you what happened.

Structure them as JSON. Unstructured logs are expensive to search:

{
  "timestamp": "2026-03-01T14:22:03Z",
  "level": "error",
  "message": "payment processing failed",
  "service": "checkout",
  "trace_id": "abc123",
  "user_id": "u_789",
  "error": "card declined",
  "duration_ms": 340
}

Key practices:

  • Include a trace ID so you can correlate logs with traces.
  • Log at the right level. INFO for business events, WARN for degraded operation, ERROR for failures that need attention.
  • Don’t log sensitive data. PII, tokens, and credentials should never appear in logs.
  • Set a retention policy. Storing every debug log forever is expensive and rarely useful.

Traces: The Path

Traces follow a request through your system. They tell you where time is spent.

A trace is a tree of spans. Each span represents a unit of work:

[checkout-service] POST /api/checkout    (450ms)
  ├─ [checkout-service] validate_cart     (5ms)
  ├─ [inventory-service] check_stock      (120ms)
  │   └─ [database] SELECT inventory      (85ms)
  ├─ [payment-service] charge_card        (300ms)
  │   └─ [stripe-api] POST /charges       (280ms)
  └─ [checkout-service] create_order      (15ms)

This immediately shows that the payment service is the bottleneck, and specifically the external Stripe call.

How They Work Together

A real debugging flow:

  1. Metric alert fires. p99 latency on /api/checkout exceeded 2 seconds.
  2. Check dashboard. Latency spike started at 14:15. Only affects the checkout endpoint.
  3. Search logs. Filter by service=checkout and timestamp > 14:15. See errors from the payment service.
  4. Follow a trace. Pick a slow request, open its trace. See that stripe-api calls jumped from 200ms to 2000ms.
  5. Root cause. Stripe is having an incident. Implement a timeout and circuit breaker.

Each pillar is useful alone. Together, they give you the full picture. Invest in correlation — trace IDs that appear in both logs and traces, and metrics that can be broken down by the same dimensions.