Observability: Logs, Metrics, and Traces in Practice

Observability is the ability to understand your system’s internal state from its external outputs. The three pillars — logs, metrics, and traces — each answer different questions.

Metrics: The First Signal

Metrics are aggregated numbers over time. They tell you that something is wrong.

http_requests_total{status="500", endpoint="/api/users"} 47
http_request_duration_seconds{quantile="0.99"} 2.3

Use metrics for:

Dashboards. Request rate, error rate, latency percentiles (the RED method).
Alerting. Fire alerts on symptom-based metrics, not cause-based ones.
Capacity planning. Track resource utilization trends over weeks and months.

Metrics are cheap to store and query. You can retain months of data at fine granularity. Start here.

Logs: The Context

Logs are discrete events. They tell you what happened.

Structure them as JSON. Unstructured logs are expensive to search:

{
  "timestamp": "2026-03-01T14:22:03Z",
  "level": "error",
  "message": "payment processing failed",
  "service": "checkout",
  "trace_id": "abc123",
  "user_id": "u_789",
  "error": "card declined",
  "duration_ms": 340
}

Key practices:

Include a trace ID so you can correlate logs with traces.
Log at the right level. INFO for business events, WARN for degraded operation, ERROR for failures that need attention.
Don’t log sensitive data. PII, tokens, and credentials should never appear in logs.
Set a retention policy. Storing every debug log forever is expensive and rarely useful.

Traces: The Path

Traces follow a request through your system. They tell you where time is spent.

A trace is a tree of spans. Each span represents a unit of work:

[checkout-service] POST /api/checkout    (450ms)
  ├─ [checkout-service] validate_cart     (5ms)
  ├─ [inventory-service] check_stock      (120ms)
  │   └─ [database] SELECT inventory      (85ms)
  ├─ [payment-service] charge_card        (300ms)
  │   └─ [stripe-api] POST /charges       (280ms)
  └─ [checkout-service] create_order      (15ms)

This immediately shows that the payment service is the bottleneck, and specifically the external Stripe call.

How They Work Together

A real debugging flow:

Metric alert fires. p99 latency on /api/checkout exceeded 2 seconds.
Check dashboard. Latency spike started at 14:15. Only affects the checkout endpoint.
Search logs. Filter by service=checkout and timestamp > 14:15. See errors from the payment service.
Follow a trace. Pick a slow request, open its trace. See that stripe-api calls jumped from 200ms to 2000ms.
Root cause. Stripe is having an incident. Implement a timeout and circuit breaker.

Each pillar is useful alone. Together, they give you the full picture. Invest in correlation — trace IDs that appear in both logs and traces, and metrics that can be broken down by the same dimensions.