A
t some point, someone on your team will walk into a meeting and say "we need to add caching." You'll nod. The engineers will nod. Two weeks later, your database CPU is still at 90% and nobody's quite sure why. That's not a caching problem. That's a "we added caching without understanding what it's actually doing" problem. There's a difference.Caching is one of those concepts that sounds simple until your production system starts behaving strangely at 2am. The database is slow. Users are complaining. The on-call engineer is staring at a Redis cluster that should be helping but isn't. Getting caching right means understanding not just the tool, but the pattern and which pattern fits your specific situation.
This post covers how caching actually works, when Redis makes sense versus Memcached, the cache-aside pattern your team is probably already using (correctly or not), and the mistake that causes more cache-related outages than any other single thing.
- >Caching is a deliberate trade-off between speed and data freshness. Do not cache financial balances or inventory counts.
- >For 99% of startups, Redis is the correct choice over Memcached due to its advanced data structures and persistence.
- >The Cache-Aside pattern breaks on the write path. You must explicitly invalidate cache keys when the underlying database row updates.
- >Cache Stampedes (Thundering Herds) will crash your database. Always add random jitter to your TTL expirations.
What Caching Actually Does to Your System
Your database is a disk-backed store. Every query hits storage, applies indexes, evaluates conditions, and returns data. Under light load, that's fine queries take milliseconds and nobody notices. Under heavy load, those milliseconds compound. Connection pools fill up. Query queues grow. Everything slows down together.
A cache sits in front of your database and holds recently accessed data in memory. RAM access is orders of magnitude faster than disk. When your application needs a user profile or a product listing, it checks the cache first. If the data is there (a cache hit), it returns immediately without touching the database at all. If it's not (a cache miss), it goes to the database, gets the data, stores a copy in the cache, and returns it. The next request for the same data hits the cache.
The math is simple. If 80% of your requests ask for the same 20% of your data, and you cache that 20%, you've eliminated roughly 80% of your database load. That's the theory. The practice is messier.
Caching works best for data that's read far more than it's written, doesn't need to be perfectly fresh on every read, and is expensive to generate. User sessions, product catalogues, API responses from third-party services, expensive aggregation queries all good candidates. Financial balances, inventory counts, anything where stale data causes real harm cache very carefully or not at all.
Redis vs Memcached: The Actual Difference
Redis and Memcached are both in-memory key-value stores. If all you need is a simple cache with string values and a TTL, either works fine. But they're not equivalent.
Memcached is older, simpler, and built for one thing: caching strings. It's multi-threaded, which means it scales horizontally across CPU cores with less overhead. If you have a large number of small cached objects and you just want raw throughput, Memcached is genuinely fast. It's also easier to reason about because it does less.
Redis does more. It supports multiple data structures: strings, hashes, lists, sorted sets, bitmaps, hyperloglogs, streams. That sounds like marketing copy but it matters in practice. Storing a cached user object as a Redis hash means you can update individual fields without deserialising and re-serialising the entire object. Sorted sets let you implement leaderboards, rate limiters, and priority queues on top of the same infrastructure you're using for caching. Redis also supports persistence you can configure it to write to disk so cached data survives restarts. And Redis has built-in clustering and replication that's production-grade.
In practice, most startups should use Redis. You'll almost certainly want the extra data structures as your product grows, and the operational overhead of running two cache systems (Memcached for performance-critical paths, Redis for everything else) isn't worth the marginal throughput gain unless your load is truly extreme. AWS ElastiCache supports both the choice is yours. Managed Redis on ElastiCache, DigitalOcean Managed Redis, or Upstash (serverless Redis with per-request billing) all remove most of the operational burden for smaller teams.
One thing Memcached does better: memory efficiency for very large values. If you're caching large HTML fragments or bulk data blobs, Memcached handles that more efficiently. For everything else, Redis.
The Cache-Aside Pattern (And How Most Teams Implement It Wrong)
Cache-aside is the pattern you're almost certainly using, whether or not your team named it. The application code is responsible for managing the cache. When you need data, check the cache. On a miss, read from the database, write to the cache, return the result.
Here's what the flow looks like when the requested data is not yet cached:
Simple. The problem is step 4 specifically the TTL (time to live). Most teams set a TTL and assume that's enough. It's not.
The write path is where cache-aside breaks down. When a user updates their profile, your code updates the database. But the cache still has the old data. For up to an hour (or whatever your TTL is), anyone who reads that profile gets stale data. The fix is to invalidate the cache entry on write delete the Redis key whenever the underlying data changes. Some teams update the cache on write instead of deleting. That works too, but it's more complex to keep consistent and usually not worth it.
The other common mistake: not setting a TTL at all. A cache with no expiry grows until it runs out of memory, then starts evicting data based on whatever eviction policy you configured (or forgot to configure). At scale, you want predictable expiration and a deliberate eviction policy. allkeys-lru (evict least recently used keys when memory is full) is a reasonable default for most caching use cases.
The Part Most Teams Get Wrong: Cache Stampede
Here's the scenario. Your cache key expires. At that exact moment, 500 concurrent requests come in for the same piece of data. All 500 get a cache miss. All 500 hit the database simultaneously. Your database collapses under 500 concurrent queries for the same row.
This is a cache stampede, also called a thundering herd. It's more common than you'd think, especially in high-traffic systems that get burst traffic and in Indonesia, Lebaran, Harbolnas, and flash sale events make burst traffic a near-certainty for consumer apps.
Three approaches to prevent it:
Probabilistic early expiration. Before the TTL expires, a small percentage of cache reads proactively refresh the cached value in the background. The cache never actually expires from the caller's perspective. Libraries like dogpile.cache implement this.
Mutex lock on cache miss. When a cache miss occurs, acquire a distributed lock before querying the database. Only one request does the database query and writes to cache. Other requests wait briefly, then get the freshly cached value. This requires careful lock timeout handling to avoid a different class of problem.
Jitter on TTL. Instead of setting all cache entries to expire at exactly now + 3600, add random jitter: now + 3600 + random(0, 300). This staggers expiration across time and prevents simultaneous mass expiry.
For most startups, TTL jitter is the easiest win with the lowest implementation cost. Add it by default and you eliminate a class of production incident before it ever hits you.
A Real-World Example: Marketplace Under Harbolnas Load
A logistics-adjacent marketplace in Jakarta ran into this problem during Harbolnas 2023. Their product catalogue was served directly from PostgreSQL. On a normal day, peak load was around 3,000 requests per minute. On Harbolnas, that went to 40,000 in the first hour. The database ran out of connections within twenty minutes.
The fix they implemented was straightforward: cache product listings in Redis with a 10-minute TTL. Sellers update listings infrequently enough that 10-minute staleness was acceptable for their use case. Product detail pages (price, stock count) got a shorter TTL of 60 seconds and explicit cache invalidation on stock changes. The separation mattered they weren't caching inventory numbers with a 10-minute TTL, which would have caused overselling.
Result: database connection count dropped from ~8,000 at peak to ~400. The database handled the residual load without strain. They also added TTL jitter after the first stampede taught them the hard way.
The key decision was figuring out which data could tolerate staleness and for how long. That's always the real question with caching. The tools are a detail.
FAQ
Q: How much memory does Redis actually need for caching?
A: It depends entirely on what you're caching and how many unique cache keys you have. A rough starting point: if you're caching 100,000 user session objects that are 2KB each, that's 200MB. Add overhead for Redis data structures (roughly 40-50 bytes per key) and you're looking at about 250-300MB. Start with 1-2GB on a managed instance and monitor actual memory usage Redis exposes this cleanly via INFO memory.
Q: Should I cache database queries or application-level objects?
A: Application-level objects almost always. Caching raw SQL results tightly couples your cache structure to your database schema. Caching the assembled domain object (the user profile as your application thinks of it) means your cache layer is independent of schema changes. It also lets you cache computed values, not just raw data.
Q: Is Redis safe to use as a session store?
A: Yes, and it's the standard approach. Use Redis with appendonly yes or configure a replica if you need session data to survive a primary restart. If you use Upstash or ElastiCache, persistence and replication are handled for you. Just be deliberate about session TTL don't let inactive sessions accumulate indefinitely.
Q: Can caching cause data consistency issues?
A: Yes. This is the trade-off you're making. If your application can't tolerate stale data financial balances, stock inventory, access control decisions cache with extreme caution or not at all. For everything else, decide the maximum acceptable staleness for each data type and set TTL accordingly. Document the decision. Future engineers will thank you.
Q: When is caching the wrong solution?
A: When the underlying problem is a slow query that touches data that changes frequently. Caching a query that's slow because of a missing index just hides the problem. Fix the query first. Caching is best applied to inherently expensive or high-volume reads not as a band-aid for queries that should be fast but aren't.
Caching is not magic. It's a deliberate trade-off between freshness and performance, and the decisions you make about TTL, invalidation strategy, and which data to cache determine whether it helps or creates a new class of problem. If you're adding Redis to a struggling system, start by identifying the five most frequently read queries in your database, check whether they're reading data that changes rarely, and cache those specifically. Don't cache everything cache what the data tells you to cache.
If you're not sure where to start, an infrastructure review helps. That's the kind of work we do at SpectreDev before touching a single line of code.
External Documentation:
- [Redis Key Eviction] — Official Redis documentation on eviction policies and memory management.
- [AWS ElastiCache] — Managed Redis and Memcached on AWS, covers pricing and configuration options.