Spectre<_ INDEX
// PUBLISHED22.05.26
// TIME15 MINS
// TAGS
#INFRASTRUCTURE#STARTUP#CTO#CLOUD
// AUTHOR
Spectre Command

Y

our CTO just handed you three vendor proposals. One for AWS. One for a managed Kubernetes platform. One from a consultant who wants to "right-size your cloud footprint." They all cost different amounts. They all sound reasonable. You have no idea which one fits your actual situation because nobody showed you the map before asking you to pick a destination.

That's the problem this post solves.

The startup infrastructure stack is not magic. It's a set of layers compute, data, networking, observability and every vendor you'll ever evaluate lives somewhere in one of those layers. Once you understand where each layer sits and what decisions belong to it, vendor selection stops being a trust exercise and becomes a comparison you can run yourself.

This isn't a tutorial on how to set up servers. It's the mental model you need before any of those conversations happen.

// EXECUTIVE SUMMARY
  • >The infrastructure stack consists of four immutable layers: Compute, Data, Networking, and Observability.
  • >Vendor selection is mechanical once you define your operational requirements for each distinct layer.
  • >Compute choices (VPS vs. Serverless) dictate cost structures; Data choices dictate long-term architecture.
  • >Never adopt complex solutions like Kubernetes or microservices before hitting actual scale limits.

The Stack Is a Set of Layers, Not a Single Choice

Every application a two-person MVP or a platform serving millions runs on the same fundamental layers. The vendors change. The layers don't.

Compute is where your code runs. A virtual machine, a container, a serverless function all compute. The question at this layer: who manages the underlying hardware, and how much of that management do you want to own?

Data is where your state lives. Databases, caches, message queues, object storage. This layer holds everything that needs to survive a server restart.

Networking is how requests reach your application and how your services talk to each other. Load balancers, CDNs, API gateways, DNS all networking decisions.

Observability is how you know the system is working. Logs, metrics, traces. Without this layer, you're flying blind and you won't know it until something breaks in production.

Developer tooling sits on top of everything else: CI/CD pipelines, infrastructure-as-code, secrets management. This layer is how your team interacts with all the others.

Most founders only think about the first two. That's how you end up with a system that works until it doesn't, and nobody can explain why.

The map matters before the territory.

Compute: Where Your Code Actually Runs

This is where most vendor confusion starts. AWS, GCP, Azure, Vercel, Railway, Fly.io, DigitalOcean they're all selling compute at different abstraction levels. Three broad categories, each with a real trade-off:

Virtual machines (VPS, EC2, Droplets) give you a full server. You install the OS, configure the runtime, manage security patches. Total control. Total responsibility. A DigitalOcean Droplet at $12/month is a VPS. An AWS EC2 instance is a VPS with more configuration options and a more complicated pricing sheet.

If you have an ops-capable engineer, VMs are cost-efficient and predictable. If you don't, they become expensive babysitting.

Managed containers (ECS, Cloud Run, Fly.io, Railway) abstract the server away. You hand the platform a Docker image; it decides where to run it. You pay per compute time, not per hour the server exists. This is where most early-stage startups land today lower ops overhead than VMs, more control than serverless.

The trade-off is cost unpredictability at scale. At 1,000 requests per day, managed containers are cheap. At 10 million, you need to model the numbers carefully before you're locked in.

Serverless (Lambda, Cloud Functions, Vercel Edge) takes it further. No server, no container, no runtime management. Your function runs when called, stops when done. Genuinely good for spiky, unpredictable workloads. Genuinely painful for long-running processes, large payloads, or anything requiring persistent connections.

Vercel has made serverless the default choice for Next.js apps and it's a solid default until your usage pattern doesn't match the model. [→ Read: What Is a Server? Cloud, VPS, and Serverless Without the Jargon] When that happens, you have a decision to make that most teams aren't prepared for.

The compute decision shapes every cost and performance conversation that follows. Get it right once and you won't revisit it for years. Get it wrong and you'll be migrating infrastructure while simultaneously shipping features a combination that destroys team velocity.

Data Layer: The Decision That Ages the Worst

The database you pick in month two will still be running in year three. That's not a metaphor. It's what happens at almost every startup that doesn't plan this layer deliberately.

The data layer has four components, each deserving a separate decision:

Primary database is where your core application data lives. Either relational PostgreSQL, MySQL or document-based (MongoDB, Firestore). The relational vs. NoSQL question gets oversimplified constantly.

The honest version: if your data has clear relationships and your queries need to join across entities, start with PostgreSQL. It's not a boring choice it's a defensible one. Gojek runs PostgreSQL clusters at a scale that would surprise most engineers who assume the unicorns must be doing something more exotic.

Cache layer is Redis sitting in front of your database. Not every startup needs this on day one. You need it when your database starts showing query latency under load, or when you're reading the same data repeatedly and paying the I/O cost each time. The common mistake is adding caching as a band-aid before identifying which queries are actually expensive.

Message queue is how your services communicate asynchronously. RabbitMQ, Kafka, AWS SQS, Google Pub/Sub. You don't need a queue when you have one service talking directly to a database. You need it when a single user action must trigger multiple downstream processes send an email, update an analytics record, notify a third-party webhook and you can't afford to fail the user's request if one of those steps is slow or unavailable.

Object storage is S3, GCS, or equivalent where files, images, exports, and blobs live. Not a database decision. A separate layer. Always use managed object storage for files. Never store binary data in your primary database unless you enjoy slow queries and large backups.

Most startups combine a primary database, a cache, and object storage. The right time to add a queue is when you find yourself writing retry logic inside your API endpoints. That's the signal.

Understanding [→ Read: What Is a Database?] before picking one sounds obvious. It isn't. Many technical founders have committed to a database type based on a recommendation from a developer whose previous job was at a social media company a completely different read/write pattern from a B2B SaaS or a fintech wallet.

Networking: How Requests Actually Get to Your App

Networking decisions are invisible until they fail. Then they're the only thing anyone talks about.

THE STANDARD REQUEST FLOW
DNS & CDN
Load Balancer
Compute (App)
Cache (Redis)
Primary Database

Load balancer sits in front of your compute layer and distributes incoming requests across multiple instances. AWS ALB, Nginx, Caddy all load balancers. You need one the moment you run more than one instance of anything. Without it, if one instance dies, requests fail.

CDN (Content Delivery Network) caches static assets images, CSS, JS bundles at edge servers geographically close to your users. Cloudflare, AWS CloudFront, Fastly. For Indonesian users hitting a server in Singapore, the difference between CDN-cached and origin-fetched static assets is real: 20ms vs 300ms per request for something as simple as a homepage image. Multiply that across every asset on every page load. It compounds fast.

DNS is the address book that maps yourapp.com to an IP address. DNS changes propagate slowly sometimes hours. This becomes painful when you're doing an emergency migration and your users can't reach the new servers because their ISP is still caching the old record.

API Gateway sits in front of your backend services and handles cross-cutting concerns: authentication, rate limiting, request routing, logging. You don't need this on day one. You need it when you have more than two or three backend services and you're duplicating rate limiting logic across all of them.

The networking layer is where regional infrastructure decisions matter most. AWS launched its Jakarta region (ap-southeast-3) in 2021. If your users are in Indonesia and your servers are in Singapore, you're adding 30–50ms of round-trip latency on every request. For a payment flow or a real-time feature, that's noticeable. For an internal document management tool, probably fine. Know your use case before picking a region.

Observability: The Layer Founders Skip Until Production Burns

Observability is not monitoring. Monitoring tells you a server is down. Observability tells you why it went down, which users were affected, how long before they noticed, and what the system was doing in the 30 seconds before the crash.

Three signals, each distinct:

Logs are timestamped records of events. Request received. Query executed. Error thrown. Logs are readable by humans. They're poor for aggregate analysis at scale.

Metrics are numerical measurements over time. CPU usage at 14:32. Request count per second. Queue depth at any given moment. Metrics are how you build dashboards and alerts. Prometheus, Datadog, CloudWatch all metrics systems.

Traces are the end-to-end journey of a single request through your system. A trace shows you that a user's checkout request spent 800ms in the payment service, 200ms in the inventory service, and 50ms in the database. Without traces, debugging latency in a multi-service system is guesswork.

The mistake almost every early-stage team makes: they add logs, skip metrics, ignore traces. Then something goes wrong in production and they spend four hours manually grepping log files trying to reconstruct what happened. A proper observability setup would have answered the question in two minutes.

You don't need a full observability platform on day one. You need structured logs (JSON, not freeform text), basic uptime monitoring, and one dashboard that shows request rate and error rate. Add traces when you have more than two services and can no longer reason about a request's path by looking at a single application's logs.

The Part Most Founders Get Wrong

They evaluate vendors before they understand their own requirements.

An investor tells a founder that their peers are using AWS. The founder asks their CTO to "move to AWS." The CTO interprets that as "migrate everything to EC2." Six months later, they have significantly more complex infrastructure and a $40k/month cloud bill that's twice the previous setup because nobody defined what they were actually trying to solve.

The conversation should go the other way: define the requirement first, then find the vendor that meets it.

Those requirements come from answering four questions honestly:

What is your read/write ratio? Mostly reads, occasional writes? That changes your database and cache strategy. Constant high-volume writes? Different story.

What does your traffic look like? Steady and predictable, or spiky? A payroll SaaS has predictable traffic. A flash sale platform does not. Serverless handles spikes more cheaply. Reserved instances handle steady load more cheaply.

What's your team's operational capacity? Two backend engineers who have never managed production infrastructure are not the same as one DevOps engineer who has. The right vendor choice depends partly on who's going to maintain it.

What's the consequence of failure? An internal tool that's down for an hour affects your team. A payment gateway that's down for an hour affects your revenue. The infrastructure investment should match the failure consequence.

Once you have answers, vendor selection is mechanical. You're not comparing marketing pages you're comparing whether a vendor's model matches your requirement profile.

The [→ Read: How to Run a Technical Debt Audit] applies here too. Before recommending any infrastructure change, the first thing we do is map current state: what exists, what it costs, and what it can't handle. Every architecture conversation starts with that map.

Real-World Example: An Indonesian Fintech That Got It Wrong Twice

A wallet product in Jakarta launched on a single VPS. One server, one database, no cache. Fine at 5,000 users.

At 80,000 users, the database started struggling during peak hours after-work and weekends for a consumer wallet. The team upgraded the VPS to a larger server. The database still struggled because the problem was query efficiency, not compute.

They brought in a consultant who recommended migrating to AWS and splitting into microservices. Three months of migration work. The performance problem persisted because the underlying queries were never fixed. Now they had the same slow queries running on more expensive, more complex infrastructure.

What would have worked: add a read replica to offload reporting queries, add Redis caching for the five most-called read operations, and fix the three queries doing full table scans without indexes. A two-week project. No migration. No microservices.

The lesson isn't that AWS was wrong. It's that the diagnosis came before the map. Nobody had drawn the stack clearly enough to see that the bottleneck was in the data layer, not the compute layer.

Understanding [→ Read: How to Build a Backend That Scales from 100 to 10M Users] starts with knowing which layer is the constraint. Almost always, it's the data layer.

FAQ

Q: Do I need to understand all of this before hiring a cloud vendor?

A: You need enough to have the conversation on your own terms. You don't need to be able to implement it that's what engineers are for. But if you can't articulate what you're trying to solve at each layer, you'll accept whatever the vendor proposes, which is usually the option with the highest margin for them.

Q: My engineering team says we need Kubernetes. Do we?

A: Almost certainly not yet. Kubernetes solves real problems container orchestration, autoscaling, zero-downtime deployments but it introduces significant operational complexity in return. Most startups at Seed and Series A don't have the team to operate it properly. Start with managed container services (Cloud Run, ECS, Fly.io) and revisit when you're consistently running more than 20 services in production.

Q: We're on Vercel and costs are climbing fast. Is that an infrastructure stack problem?

A: Possibly. Vercel's pricing model is built for moderate traffic with standard Next.js patterns. If you're seeing unexpected costs, the first question is whether your traffic pattern matches their model specifically around serverless function invocations and bandwidth. The fix might be architectural (moving long-running work out of serverless functions) or might require a different compute layer. Either way, it's a data question before it's a vendor question.

Q: Should data and compute always be with the same vendor?

A: No. It's common and sensible to run compute on one platform (Fly.io, for example) and use a managed database from another (Neon, Supabase, AWS RDS). The latency between them matters they should be in the same region but being locked into a single vendor for everything usually costs more than it saves.

Q: How do I know if my observability is good enough?

A: Ask this: if your system started responding slowly to users right now, how long would it take your team to identify which component is responsible? If the answer is "more than 15 minutes" or "we'd grep logs manually," your observability layer needs work. Good observability means the answer is on a dashboard before a user files a support ticket.


The infrastructure stack is learnable. You don't need to implement it but you do need to understand it well enough to ask the right questions when a vendor is proposing a solution and you have 30 minutes to decide.

That's the difference between infrastructure choices that hold up under growth and ones that become the migration project you're managing 18 months from now. It always starts with drawing the map first.

// END_OF_LOGSPECTRE_SYSTEMS_V1

Is your current architecture slowing you down?

Stop guessing where the bottlenecks are. We partner with founders and CTOs to audit technical debt and execute zero-downtime system rewrites.

Book an Architecture Audit