SpectreDev | High-Performance Systems Engineering Alternative: SpectreDev

// PUBLISHED27.06.26

// TIME10 MINS

// TAGS

#DATABASE#NOSQL#MONGODB#CASSANDRA

// AUTHOR

Spectre Command

hen someone says "we should use NoSQL," that sentence carries almost no information. It's like saying "we should use a vehicle." A motorcycle and a truck are both vehicles. They solve completely different problems. Picking the wrong one because both qualify as "not a car" is how you end up carrying furniture on a Vespa.

NoSQL is a family of databases that share one characteristic: they don't use the relational model. Beyond that, they differ from each other in fundamental ways — in data model, query language, consistency guarantees, scaling behaviour, and the specific problems they're designed to solve.

The types of NoSQL databases explained here aren't interchangeable. A team that picks MongoDB because they heard it's "more scalable" than PostgreSQL, when they needed Cassandra's write throughput or Redis's latency profile, has made a choice that will quietly degrade their system for years.

Here's how each type actually works, and when each one earns its place.

// EXECUTIVE SUMMARY

>NoSQL is a umbrella term for completely distinct database engines; treating them as interchangeable creates massive structural debt.
>Document stores (MongoDB) excel at heterogeneous data like e-commerce catalogues, but shift relationship enforcement to application code.
>Column-family stores (Cassandra) are built exclusively for extreme write-heavy event ingestion and constrain your query options.
>Always match the choice to your active data access patterns on a whiteboard before migrating off a stable relational PostgreSQL layer.

Document Databases: Flexible Structure, Relational Responsibility

Document databases — MongoDB, Firestore, CouchDB — store data as self-contained JSON-like documents. No fixed schema. Each document in a collection can have entirely different fields.

A products collection might contain a running shoe with fields for size_range, sole_material, and drop_height, alongside a laptop with fields for ram_gb, storage_type, and screen_resolution. The database accepts both without complaint.

This is genuinely useful in two scenarios:

Heterogeneous Datasets: Your data is actually heterogeneous. A product catalogue for a marketplace, where thousands of product categories each have different attributes, is a real case for document storage. Forcing every product type into a shared relational schema either results in hundreds of nullable columns or a complex entity-attribute-value pattern that's worse than the alternative.

Rapid Prototyping Cycles: Your schema is changing fast. Early in a product's life, when you're iterating on what data you even need to store, a document database removes the friction of writing and running migrations. This buys speed. It also accumulates debt — because without schema enforcement, your data can drift into inconsistency silently, and cleaning it up later is a manual process.

The honest limitation: relationships between documents are your responsibility. If you delete a user and forget to clean up their associated records in three other collections, those orphaned records sit there indefinitely. The database won't stop you. A relational database would have.

Use document databases when your data model is genuinely document-shaped — self-contained records with variable structure. Don't use them because someone told you they scale better. At startup scale, PostgreSQL scales further than most teams expect before becoming the constraint.

Key-Value Stores: Speed First, Everything Else Second

Redis, Memcached, DynamoDB in its simplest form — key-value stores do exactly one thing with extraordinary efficiency. You store a value at a key. You retrieve it by key. That's it.

No joins. No schema. No complex queries. If you know the key, you get the value in microseconds. If you don't know the exact key, retrieval ranges from difficult to impossible depending on the database.

Redis is the most commonly deployed key-value store in production systems, and its role is almost always the same: caching. Put frequently-read data in Redis with an expiry. Your database handles writes and the occasional cache miss. Your API handles everything else from memory. Latency drops from tens of milliseconds to under one millisecond for the cached paths.

Redis also handles session storage (a user's authenticated session lives in Redis, not your primary database), rate limiting (increment a counter per IP per minute, evict it after the window), and pub/sub messaging (lightweight event broadcasting between services without a full message queue).

DynamoDB deserves a separate note. AWS markets it as a general-purpose NoSQL database, and teams use it as one, but it's specifically optimised for high-volume key-based access at massive scale with predictable, single-millisecond performance. The catch: your data model must be designed around your access patterns upfront. Changing access patterns later means redesigning the table. It also has no joins, limited query flexibility, and a pricing model that surprises teams who treat it like a database rather than a high-throughput lookup system.

For most startups: Redis for caching and ephemeral data. Avoid DynamoDB unless you have a specific, extremely high-throughput key-lookup use case and an engineer who's operated DynamoDB in production before.

Column-Family Databases: Built for Writes That Never Stop

Apache Cassandra, HBase, Google Bigtable — this family is purpose-built for one thing: absorbing enormous write volumes across distributed nodes without degrading.

Cassandra is the one you'll encounter most. It's designed so that writes go to multiple nodes simultaneously, with no single point of failure, and the data is available for reads almost immediately. You can write a million records per second across a properly configured Cassandra cluster. Large event pipelines handle click and transaction events at scale using Cassandra for exactly this reason — the write volume is too high for a relational database to absorb without aggressive sharding, and the data shape (events are self-contained records, not relational entities) fits the column-family model naturally.

The trade-off is severe query constraints. In Cassandra, your table design must be driven by your query patterns. You decide upfront which columns you'll query by, design your partition key around those queries, and accept that querying by any other field means a full cluster scan — which is slow and expensive.

This is not a general-purpose database. It's a specialist tool. Teams that reach for Cassandra before exhausting what PostgreSQL or a well-structured document database can do are adding operational complexity they don't need yet.

When the requirement is: "we need to ingest millions of time-stamped events per day and query them by a known identifier" — Cassandra is a legitimate answer. For most other use cases, it's premature.

Graph Databases: When the Relationship Is the Data

Neo4j, Amazon Neptune, TigerGraph — graph databases model data as nodes (entities) and edges (relationships between entities). Both nodes and edges can have properties.

This becomes the right tool when the relationships themselves are what you're querying, and those relationships form complex, multi-hop networks that are expensive to traverse in a relational model.

Fraud detection is the clearest example. Detecting whether a new user shares a device fingerprint with a flagged account, which is connected to three other accounts that share a payment method with yet another flagged account — that's a graph traversal. A relational database can do this with recursive CTEs or multiple joins, but it gets expensive quickly as the network depth increases. A graph database does it naturally.

Recommendation engines are another case. "Users who bought this also bought that" is a graph problem if the product-purchase-user relationships are the core of your product rather than a reporting feature.

Most startups will never need a graph database. The use cases are real but specific. If your product's core value isn't about traversing entity relationships at depth, the operational overhead of running and querying a graph database isn't worth it.

The Part Most Teams Get Wrong

They evaluate NoSQL databases by reputation, not by access pattern.

"Cassandra scales to billions of rows" is true. It's also irrelevant if your write volume is 10,000 records per day. You've taken on all of Cassandra's operational complexity and query constraints in exchange for write capacity you don't need and won't need for years.

"MongoDB is flexible" is true. It's also irrelevant if your data is relational — in which case, the flexibility is a liability, not an asset, because now consistency is your application's job.

The right question before picking any NoSQL database: what are your access patterns, and which database's data model maps cleanly to them?

Write those patterns down explicitly. "We need to store X, query by Y, and retrieve Z." Then evaluate each database against that list. If a database requires you to redesign your queries to fit its model, that's a signal — not necessarily a dealbreaker, but a real cost you're accepting.

The most common advice we give teams evaluating NoSQL: start with your data model and access patterns on a whiteboard. Most of the time, that exercise reveals that the data is more relational than it first appeared — and PostgreSQL with appropriate indexing is the right answer for the next two or three years.

Real-World Example: One Product, Three Different Answers

A logistics platform serving mid-size e-commerce networks provides a useful illustration. Three different data requirements, three different answers.

Core shipment records — sender, receiver, package details, status history, timestamps, relationships to drivers and warehouses — are relational. Foreign keys matter. Transactions matter. A shipment update must atomically update the shipment record and insert a status history entry. PostgreSQL handles this exactly as designed.

Real-time location events — GPS pings from driver apps, arriving at 5,000 events per minute during peak hours — are write-heavy, time-ordered, and self-contained. Each ping is a document: driver ID, timestamp, latitude, longitude. No joins needed. A time-series optimised store or a column-family database handles this without degrading under the write load.

Session and API rate limiting data — which driver tokens are active, how many API calls each integration partner has made in the last minute — are ephemeral key-value pairs. Redis. Sub-millisecond reads, automatic expiry, no persistence needed.

Three use cases. Three databases. Each chosen because the data model matched, not because of marketing claims about scale.

The operational overhead of running three databases is real. You pay it because the alternative — forcing all three use cases into a single database that fits none of them well — costs more in query performance, developer time, and eventual migrations.

FAQ

Q: Is MongoDB still a good choice, or has it fallen out of fashion?

A: MongoDB is a solid, mature database for the use cases it fits. The legacy "MongoDB is bad" sentiment from the industry came from teams that used it for relational data and then paid the consistency debt later. Used correctly — variable-structure documents, fast schema iteration, non-relational data — it's fine. The question isn't whether MongoDB is fashionable. It's whether your data model is document-shaped.

Q: Redis is listed as a key-value store, but I've seen teams use it as a primary database. Is that wrong?

A: It depends on your tolerance for data loss. Redis stores data in memory with optional persistence. If the process crashes before a write is flushed to disk, that write is lost. For cache data or ephemeral sessions, that's acceptable. For order records or payment data, it isn't. Teams that use Redis as a primary database for durable data are accepting a risk that often isn't made explicit when the decision is made.

Q: When should I add a time-series database like InfluxDB or TimescaleDB?

A: When you're storing high-volume time-stamped measurements and querying them by time range with aggregations — IoT sensor data, application metrics, financial tick data. TimescaleDB is PostgreSQL with time-series extensions, which makes it a low-overhead first step if you're already on PostgreSQL. A dedicated time-series database earns its place when TimescaleDB's query performance no longer meets your needs at scale.

Q: We have a small team. Does running multiple databases make sense for us?

A: Probably not yet. The operational overhead — backups, monitoring, failure modes, different query languages — is real and compounds with team size. Start with one database that fits your primary use case. Add Redis when you have a specific caching or session need that's causing measurable problems. Reach for a third database only when the second one is genuinely the constraint. Premature polyglot persistence is a maintenance problem waiting to happen.

Q: How do I explain this database choice to a non-technical co-founder or investor?

A: The filing cabinet analogy works. A relational database is a filing cabinet with labelled folders, strict rules about what goes in each folder, and cross-references between folders that the cabinet itself enforces. A document database is a cabinet where you can put anything in any drawer, in any shape — fast to fill, harder to keep organised. A key-value store is a locker with a combination lock — instant access if you know the combination, useless if you don't. Graph databases are a network diagram. Each tool for a different job.

The types of NoSQL databases each solve a real problem. None of them solve the problem of picking the wrong one. That choice is made on a whiteboard, against your actual access patterns, before any vendor is evaluated.

If you're in the middle of a data layer decision and the conversation has jumped straight to vendor comparison before the access patterns are defined, that's the moment to slow down. The right database chosen deliberately will be invisible. The wrong one will be the story your engineers tell at every quarterly retrospective.

Internal Reference Logs: