Y
our system went down last Tuesday. Not for long — maybe eighteen minutes. But you found out from a user on WhatsApp, not from your own tooling. When you finally got into the logs, there were 40,000 lines of noise and one relevant error buried somewhere in the middle.That's not a monitoring problem. That's an observability problem.
Observability — logs, metrics, traces — these three signals are the foundation of any system you can actually debug under pressure. They're not interchangeable. They don't replace each other. And if you only have one of them, you're flying mostly blind when it matters most.
Here's what each one actually does.
- >Logs record specific events (what happened). If they aren't structured (JSON), they are useless at scale.
- >Metrics measure system health over time (is something wrong?). Monitor percentiles (p99), never averages.
- >Traces map a single request across multiple services (where did the time go?). Essential for microservices.
- >Observability is not something you add later. Retrofitting it during a production incident is a guaranteed failure.
Logs: The Crime Scene Photos
A log is a timestamped record of something that happened. An event. A state change. An error. Logs are the closest thing software has to a witness statement — and like witness statements, their value depends entirely on how well you structured them before the incident.
They're invaluable for debugging specific moments. When something breaks at 2:47AM, logs tell you exactly what happened in that window: which request came in, which function threw, which dependency timed out. You can't reconstruct that from a graph.
The problem is that most teams treat logs as their entire observability strategy. They're not. Logs are expensive to store, slow to query at scale, and structureless unless you enforce structure. Searching through unformatted log output isn't engineering — it's archaeology.
A few things worth enforcing from the start:
Structured logs only. JSON or nothing. Plain text logs are unqueryable at scale. If your logging library outputs ERROR: user not found at line 42, you can't filter by user ID, request ID, or service name without regex nightmares. JSON lets you filter, aggregate, and alert on any field programmatically.
Log levels with discipline. DEBUG stays off in production. INFO for significant events. WARN for expected-but-problematic states. ERROR for things that need a human to act. If everything is ERROR, nothing is.
Correlation IDs on every request. Every request that enters your system gets a unique identifier. That ID travels through every service, every function call, every log line. Without it, you cannot follow a single user's request across a distributed system. With it, you can pull every log line for order req-abc-123 across six services in thirty seconds.
Logs answer: what happened, and when?
Metrics: The Signal Your On-Call Engineer Actually Needs
A metric is a numeric measurement over time. Request rate. Error rate. Latency percentiles. CPU usage. Queue depth. These are the signals you watch continuously — not just when something breaks.
Prometheus is the dominant metric collection system for self-hosted setups. Grafana sits in front of it for dashboards and alerting. If you're on AWS, CloudWatch handles this for managed services, though its query syntax will test your patience.
Metrics are cheap compared to logs. A gauge that says "p99 latency is 340ms" takes bytes to store and milliseconds to query. Extracting the same information from logs requires parsing millions of lines.
The patterns that matter most for a growing product:
The four golden signals — rate (requests per second), errors (failed requests per second), latency (distribution, not average), saturation (how full your system is). If you instrument nothing else, instrument these four per service. They'll catch 80% of production problems before users do.
Percentiles, not averages. Average latency is close to useless. If 99% of requests complete in 50ms but 1% take 30 seconds, your average might look acceptable while users are abandoning checkout. Track p50, p95, p99. For financial transactions, track p999.
Business metrics alongside technical metrics. Conversion rate, payment success rate, checkout completion — these belong in the same dashboards as CPU and memory. A 40% drop in successful GoPay transactions is worth more signal than a CPU spike alert.
Metrics answer: is something wrong, and how wrong is it?
Traces: The Thread Through the Maze
A trace follows one specific request through every service it touches. Not a summary. Not an average. One request, showing exactly where time was spent across your entire system.
OpenTelemetry is now the standard instrumentation layer — it's vendor-neutral, which matters because it means you instrument once and can change backends without re-instrumenting your codebase. Jaeger and Grafana Tempo are common open-source backends. Datadog, Honeycomb, and AWS X-Ray are managed options.
A trace is made of spans. Each span represents a unit of work: an HTTP call, a database query, a function execution. Spans have start times, durations, and parent-child relationships that produce a timeline. For one specific request, that timeline shows: 12ms in the auth service, 8ms parsing the payload, 340ms waiting on a Postgres query, 6ms formatting the response.
That Postgres query. That's your bottleneck. You wouldn't have found it from logs — nothing failed, so nothing was logged at ERROR level. You wouldn't have found it from metrics alone — latency was elevated, but the metric doesn't know which service or which query caused it. The trace tells you exactly where the time went, on that specific request, at that specific moment.
Traces become essential once you have more than two services talking to each other. A monolith doesn't need distributed tracing. But the moment your backend calls an auth service calls a product catalogue calls a payment API calls an inventory service — traces become the only tool that gives you a complete picture of a single request's journey.
Traces answer: where did the time go, and in which service?
The Part Most People Get Wrong
The common mistake: treating observability as something to add later, once the system gets complex.
It's not. The right time is before your first production incident.
When your system is small, adding OpenTelemetry spans and structured logging takes two days. Retrofitting it into a running distributed system, during an ongoing incident, under pressure — that's a week of work done at the worst possible time. The engineers who are most grateful for good observability are the ones who set it up before they needed it.
The second mistake: treating high log volume as thoroughness. Teams log every function entry, every variable state, every routine database call — and then wonder why their Datadog bill is $4,000 a month. More logs aren't better logs. What you want is the right signal at the right verbosity.
Log sampling helps. If you have an endpoint handling 50,000 requests per minute and 99.9% succeed identically, you don't need a log entry for every success. Log failures at 100%. Sample successes at 1% or less. You'll catch every problem without paying to store data you'll never query.
The third mistake is specific to distributed systems and it's painful: not propagating trace context across service boundaries. If your trace ID doesn't flow from service A to service B through your HTTP headers, your traces fragment. You get a half-trace that looks complete but stops at the first service boundary. OpenTelemetry handles context propagation automatically — but only if you configure it correctly from the start. Defaults aren't enough.
What This Looks Like in Practice
A logistics startup had an intermittent latency problem on their shipment status endpoint. Metrics showed p99 climbing from 180ms to 1.1 seconds during certain time windows. The engineering team checked their logs: nothing. No errors. No timeouts. Every request completing successfully.
Without traces, they'd have spent days guessing. With traces, they found the root cause in under an hour. The trace showed that every affected request was spending 900ms waiting on an internal inventory service. That service wasn't throwing errors — it was completing, just slowly. The inventory team had run a database schema migration that turned a fast indexed query into a full table scan on a 12-million-row table.
Logs didn't catch it because nothing failed. Metrics showed the symptom but couldn't show the cause. The trace pointed directly at the slow span, in the right service, triggered by a deployment that happened thirty minutes before the latency appeared.
Gojek's engineering team has written about building this kind of observability infrastructure across their ride-hailing, payments, and logistics systems as they scaled to millions of daily active users. The same principle applies at every scale: logs for forensics, metrics for alerting, traces for diagnosis. You need all three. They each answer a different question.
There's also a compliance angle worth naming, especially for fintech products operating under OJK (Otoritas Jasa Keuangan) regulation: audit trail retention and operational logging are not the same thing. Operational logs — the kind you're using to debug incidents — can be short-lived, 14 to 30 days hot. Transaction audit trails may require seven-year retention and must be stored separately, immutably, and without PII. If your developers are writing customer card details or national ID numbers into operational logs, that's not just a bad practice — it's a regulatory exposure.
FAQ
Q: Do I need all three signals from day one, or can I start with just logs?
A: Start with structured logs and the four golden metrics per service — request rate, error rate, latency, saturation. Add distributed tracing once you have more than two services communicating. Trying to implement all three perfectly on a pre-launch product is often overthinking. But structured logs and basic metrics will carry you through early traction, and you should add tracing before you have a production incident that requires it.
Q: What's the difference between observability and monitoring?
A: Monitoring tells you something is wrong. Observability gives you the tools to figure out why. Monitoring fires an alert when p99 latency crosses 500ms. Observability is the combination of logs, metrics, and traces that lets you go find the root cause once that alert fires. You need both — but monitoring without observability just tells you the building is on fire without telling you which floor.
Q: Is OpenTelemetry worth the setup complexity for a small team?
A: Yes, for one specific reason: it's vendor-neutral. You instrument your codebase once and can switch trace backends — from Jaeger to Honeycomb to Datadog — without touching your application code. For a three-person team, initial setup is two to three days. That's a one-time cost. Skipping it and retrofitting later, when you're at ten services and something is broken in production, costs far more than that.
Q: How do we keep log storage costs under control as we scale?
A: Three levers. First, sample high-volume successful requests — you don't need a log entry for every routine success on a high-traffic endpoint. Second, enforce log level discipline — DEBUG off in production, ERROR only when human action is required. Third, set retention policies that match actual query patterns: 14 to 30 days hot for operational logs, archive or discard older data. Most incidents are investigated within 72 hours of occurring. Paying to store three months of logs you'll never query is waste.
Q: We're building a fintech product. Are there special considerations for logging?
A: Two main ones. First, never log PII or payment details — card numbers, national IDs, full account numbers — in operational logs. This is both a security risk and a regulatory exposure under OJK guidelines. Use tokenised identifiers in logs instead. Second, distinguish between operational logs and audit trails. Transaction records may require long-term immutable storage for compliance. Keep them separate from your operational observability stack so cost and access controls can be managed independently.
Most teams treat observability as infrastructure debt — something to fix after the next funding round, after the next hire, after things stabilise. But things don't stabilise. They scale. And the cost of retrofitting good observability into a system under load is always higher than building it in early.
If you're not sure what signal coverage your current system has, an [→ architecture review] usually surfaces that gap in the first hour.