All posts
Cloud & DevOps··8 min read

Observability for Developers: Logs, Metrics, and Traces

You can't fix what you can't see. A developer's guide to structured logs, the metrics that matter, and distributed tracing with OpenTelemetry.

By

On this page

At 3 a.m. a checkout endpoint starts returning 500s for 4% of requests. Your dashboard shows average latency of 180ms — green, healthy, boring. Meanwhile the on-call engineer is grepping a 6GB plaintext log file across eleven pods trying to find which request, which user, which downstream call actually broke. That gap between "the average looks fine" and "find the broken request" is the whole problem observability solves. Monitoring tells you the system is sick. Observability lets you ask why without shipping new code to find out.

After 17 years of being the person who gets paged, I've learned that observability is not three dashboards you bolt on at the end. It's three signals — logs, metrics, traces — that you wire in from the first commit, and that only pay off when they share a common thread. Let me show you how the pieces actually fit, with code you can paste today.

Structured logs: stop writing prose for machines

The single highest-leverage change most teams can make is to stop logging strings and start logging objects. console.log("user " + id + " failed payment") is a sentence a human wrote for another human, and it's useless at scale. You can't filter it, you can't aggregate it, and grep falls apart the moment your log line spans a stack trace.

Emit JSON instead. Every log line becomes a queryable event with stable fields. Here's a minimal structured logger in TypeScript using Pino (v9), which is the fastest serious option in Node and what I reach for by default:

import pino from "pino";
import { AsyncLocalStorage } from "node:async_hooks";
import { randomUUID } from "node:crypto";
 
const requestContext = new AsyncLocalStorage<{ requestId: string }>();
 
const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  // Thread the request id into every line automatically.
  mixin() {
    const ctx = requestContext.getStore();
    return ctx ? { requestId: ctx.requestId } : {};
  },
  redact: {
    // Never leak secrets or PII, even if a careless caller passes them.
    paths: ["password", "*.password", "authorization", "*.token", "req.headers.cookie"],
    censor: "[REDACTED]",
  },
});
 
export function withRequest<T>(fn: () => T): T {
  const requestId = randomUUID();
  return requestContext.run({ requestId }, fn);
}
 
export { logger };

The AsyncLocalStorage trick is the part people miss. It lets you generate one requestId at the edge (or read the inbound traceparent header) and have every log line emitted anywhere in that request — three call layers deep — carry it without you passing the id around as an argument. When the 3 a.m. page hits, you filter on one requestId and the entire story of that request falls out in order.

Use levels deliberately. error means a human needs to act. warn means something degraded but recovered. info is the request-level narrative. debug is for reproduction and stays off in production. If everything is error, nothing is.

And a rule I treat as non-negotiable: never log secrets or PII. No passwords, tokens, full card numbers, raw auth headers, or unmasked emails. The redact config above is a backstop, not a license to be sloppy — logs get shipped to third-party backends, sit in cheap storage for months, and are a prime target in a breach. Treat your log pipeline as a place attackers will eventually read.

Metrics: histograms, not averages

Metrics are cheap, aggregated numbers over time — the opposite of logs. You don't want a metric per request; you want counters and distributions that answer "how is the whole system doing right now."

Two frameworks worth internalizing. Google's four golden signals (latency, traffic, errors, saturation) come from the SRE book. The RED method — Rate, Errors, Duration — is the request-centric subset I instrument first on every service because it's trivial to apply uniformly.

The averages trap deserves its own paragraph. An average latency of 180ms can hide the fact that your p99 is 4 seconds. Averages are destroyed by the exact tail you care about. Always record a histogram and alert on percentiles (p95, p99). Here's RED instrumentation with the OpenTelemetry metrics SDK (@opentelemetry/api 1.x):

import { metrics } from "@opentelemetry/api";
 
const meter = metrics.getMeter("checkout-service");
 
const requestCounter = meter.createCounter("http.server.requests", {
  description: "Total HTTP requests (Rate + Errors via the status label)",
});
 
const requestDuration = meter.createHistogram("http.server.duration", {
  description: "Request duration in milliseconds (p95/p99 source)",
  unit: "ms",
});
 
export function recordRequest(route: string, statusCode: number, durationMs: number) {
  const labels = {
    route,                                   // "/checkout" — bounded set, good
    status_class: `${Math.floor(statusCode / 100)}xx`, // "5xx" — 5 values, good
  };
  requestCounter.add(1, labels);
  requestDuration.record(durationMs, labels);
}

Notice what I did not put in those labels: no userId, no requestId, no raw URL with query params. This is the cardinality trap, and it's how teams accidentally take down their own Prometheus. Every unique combination of label values creates a new time series. route has maybe 40 values; userId has millions. Multiply a few high-cardinality labels together and you've got a combinatorial explosion that blows up memory and your bill. The rule: labels are for things you'll group by in a query, and the value set must be small and bounded. High-cardinality identifiers belong in logs and trace spans, never in metric labels.

Distributed tracing: the missing dimension

Logs tell you what happened in one service. Metrics tell you the aggregate health. Neither tells you that the checkout 500 was caused by a 3-second call to the inventory service, which was itself blocked on a slow Postgres query. That's a trace: a tree of timed spans that follows a single request across every service it touches.

The mechanism is context propagation. The first service generates a trace id, and passes it downstream — over HTTP this is the W3C traceparent header (Trace Context is a W3C Recommendation). Each service creates child spans under that id. Stitch them together and you get a flame graph of where the time actually went.

The standard here is not negotiable in 2026: it's OpenTelemetry. It's a CNCF project, vendor-neutral, and supported by every serious backend — Jaeger, Grafana Tempo, Honeycomb, Datadog, Grafana Cloud. You instrument once against the OpenTelemetry API and switch backends by changing an exporter, not your code. That's the whole pitch, and it's why I stopped writing vendor-specific instrumentation years ago.

import { trace, SpanStatusCode } from "@opentelemetry/api";
 
const tracer = trace.getTracer("checkout-service");
 
export async function reserveInventory(sku: string, qty: number) {
  return tracer.startActiveSpan("inventory.reserve", async (span) => {
    span.setAttribute("inventory.sku", sku);   // high-cardinality is fine on spans
    span.setAttribute("inventory.qty", qty);
    try {
      const result = await inventoryClient.reserve(sku, qty);
      span.setAttribute("inventory.reserved", result.reserved);
      return result;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
      throw err;
    } finally {
      span.end(); // Always end the span, even on the error path.
    }
  });
}

startActiveSpan makes this span the active parent, so any child span — including the auto-instrumented Postgres query inside inventoryClient — nests correctly without you wiring it manually. Add the OpenTelemetry HTTP and pg auto-instrumentation packages and you get cross-service traces almost for free.

Wiring the three together

The magic is the shared id. When you start a request, derive your log requestId from the trace context so a single trace_id links the trace, every span, and every log line. In a Pino + OpenTelemetry setup you inject the active span's context into the log mixin:

import { trace } from "@opentelemetry/api";
 
function traceMixin() {
  const span = trace.getActiveSpan();
  const ctx = span?.spanContext();
  return ctx ? { trace_id: ctx.traceId, span_id: ctx.spanId } : {};
}

Now your workflow at 3 a.m. is: alert fires on the p99 latency SLO → click into the trace exemplar → see the slow span → pivot to logs filtered by that exact trace_id. Three signals, one thread. That's observability working as a system instead of three disconnected tools.

SLOs, alerting, and not going broke

A few hard-won principles that matter more than any tool choice:

Alert on symptoms, not causes. Page on "checkout error rate above 1% for 5 minutes" (something users feel), not "CPU above 80%" (something that may be totally fine). An SLO — say, 99.9% of checkout requests succeed under 500ms — defines your target. The gap between that and 100% is your error budget: permission to spend reliability on shipping fast. Burn it slowly and you're fine; burn it in an hour and you page someone.

Control trace cost with sampling. Tracing every request is expensive at volume. Use tail-based sampling in the OpenTelemetry Collector so you keep 100% of errored and slow traces and sample the boring fast ones at 1-5%. You keep the traces you'd actually look at and drop the ones you never will.

Here's how the three signals trade off, which guides what you reach for:

SignalCost per eventCardinalityRetentionBest for
MetricsVery lowMust stay lowLong (months)Dashboards, SLOs, alerting
TracesMedium (sampled)High OKShort (days)Latency breakdown, cross-service flow
LogsHighHigh OKMedium (weeks)Forensic detail on one request

A starting checklist

If you're standing up observability on a new service, do these in order:

  1. Structured JSON logging with levels and a trace_id field. Add a redaction backstop for secrets and PII on day one.
  2. RED metrics on every endpoint: a request counter and a duration histogram, labels bounded and low-cardinality.
  3. OpenTelemetry tracing with auto-instrumentation for HTTP and your database, plus manual spans around the operations that actually matter.
  4. Link the three via a shared trace_id so you can pivot between them.
  5. One SLO and one symptom-based alert before you add a tenth dashboard nobody reads.
  6. Tail-based sampling in the Collector before trace volume becomes a line item you have to explain.

Get the first three right and you've already escaped the 3 a.m. grep. The rest is refinement.

Further reading

  • OpenTelemetry documentation — opentelemetry.io
  • Google SRE Book, "Monitoring Distributed Systems" (four golden signals) — sre.google
  • W3C Trace Context — www.w3.org/TR/trace-context
  • Prometheus documentation — prometheus.io