SpectreDev | High-Performance Systems Engineering Alternative: SpectreDev

// PUBLISHED24.04.26

// TIME10 MINS

// TAGS

#TECH DEBT#MONOLITH MIGRATION#REFACTORING STRATEGY

// AUTHOR

Spectre Command

t some point, the question stops being "should we rewrite this?" and becomes "how do we do it without the business dying in the process?"

That's the hard part. A greenfield rewrite sounds clean on a whiteboard. In practice, you're replacing the engine of a plane that's already in the air. Customers are still signing up. Revenue is still flowing. Your team is expected to keep shipping product while simultaneously dismantling and rebuilding the thing that powers it.

Most software rewrites that fail don't fail because of bad engineering. They fail because of bad strategy — no clear boundary between old and new, no plan for the transition period, and no honest accounting of how long it will actually take. This post covers how to do it without stopping your business.

Why the "Big Bang" Rewrite Almost Always Goes Wrong

The instinct is understandable. The old system is a mess. Starting fresh sounds like relief. So the team scopes out a full rewrite, estimates six months, and gets executive sign-off.

Twelve months later, the rewrite isn't done, the old system has continued accumulating bugs that nobody's fixing, and the team is exhausted. This is not a hypothetical — it's the most common rewrite story in the industry. Netscape famously did this in 2000, spent three years on it, and nearly destroyed the company. The lesson didn't stick.

The core problem with big bang rewrites is that the old system is a moving target. While your team builds the new one, the business keeps adding requirements to the old one. By the time the new system is "done," it's already behind. And you haven't had a single day of reduced risk in the interim — you've had a year of double the operational surface area and half the engineering attention on each.

The alternative isn't to accept the old system forever. It's to replace it incrementally, in a way that lets you keep operating throughout.

The Strangler Fig Pattern: The Right Mental Model

There's a pattern in software architecture called the Strangler Fig. It's named after a tropical tree that grows around a host tree over decades, gradually replacing it — until one day the host is gone and the strangler fig is standing on its own.

Applied to a software rewrite, it means this: you don't replace the old system all at once. You build the new system alongside it, migrate one piece of functionality at a time, and route traffic gradually from old to new. The old system slowly shrinks. The new one grows. At some point — with much less drama than a big bang — the old system handles nothing and can be decommissioned.

This approach works because it forces you to make decisions incrementally. Each migration is a discrete project with a clear scope, a clear test, and a clear rollback plan. You're never in a position where the new system has to be 100% complete before you get any value from it.

It also works because it keeps the business visible throughout. Users might not notice anything changing. Revenue keeps flowing. Engineers can still ship features — on the new platform, as each piece migrates.

[→ Read: How to run a technical debt audit — a guide for non-engineer founders]

How to Sequence the Migration

The sequence matters more than most people realise. Get it wrong and you'll spend the first six months on the hardest, most interdependent parts of the system — the ones that can't be migrated without touching everything else. You'll burn momentum and trust before you've shipped anything.

Start at the edges, not the core. The edges of your system are the parts with the fewest dependencies: background jobs, reporting pipelines, notification services, internal admin tools. These can often be migrated without touching the core application at all. They're lower risk, faster to move, and they give your team early wins that build confidence in the approach.

Identify your seams. A seam is a natural boundary in the system — a place where one part of the software talks to another through a clean interface. These are your migration boundaries. If your payment processing already talks to the rest of the application through a well-defined API, it can be replaced independently. If everything is tangled together with no clear separation, you need to create the seam before you can migrate anything.

Tackle the data layer carefully. This is where rewrites most often go wrong. Moving application logic is relatively forgiving — you can test it, run both versions in parallel, compare outputs. Moving data is not forgiving. A mistake in a data migration can mean lost transactions, corrupted records, or a state that can't be easily recovered.

For anything touching financial data, order history, or user accounts, the approach should be: write to both databases in parallel during the transition, validate consistency continuously, and only cut over reads once you're confident the new store is correct. It's slower. It's also the only safe way to do it.

Plan your traffic routing. As each component migrates, you need a way to control which traffic goes to which system. This is typically done with a feature flag or a routing layer at the API gateway level. It lets you send 1% of traffic to the new system, watch it, expand to 10%, watch it, and so on. It also gives you an instant rollback path — if something goes wrong, you flip the flag, not the infrastructure.

The Staffing Trap Most Companies Fall Into

Here's a decision that will determine whether your rewrite succeeds or fails: do you use the same team that built the old system, or do you bring in people who will build the new one?

The honest answer is: you need both, used carefully.

The engineers who built the old system carry irreplaceable knowledge. They know why certain decisions were made. They know which parts of the system are actually stable and which ones are held together with intent and luck. Without them, the new team will repeat old mistakes or, worse, accidentally break things they didn't know existed.

But those same engineers are often the most resistant to the rewrite — not out of ego, but because they understand the complexity better than anyone. They know how long things will actually take.

The pattern that works: keep your existing senior engineers as architects and domain experts. Let them define the interfaces, review the new system's design, and own the migration sequencing. Bring in additional capacity — either new hires or an external team — to build against those interfaces. This way, knowledge is transferred in the process of building, not lost.

What doesn't work: treating the rewrite as a separate project, staffing it with a parallel team that's never allowed to talk to the engineers who know the system, and calling it done when the new platform passes a test suite written by people who don't fully understand what the old system does.

[→ Read: How to build a backend that scales from 100 to 10 million users]

A Concrete Example: Migrating a Monolithic Order System

A logistics platform we worked with had a classic problem. Their monolithic backend handled everything — order intake, routing, driver assignment, status updates, invoicing — in a single application on a single database. It had been built fast in the early days and worked well until scale hit. At around 50,000 orders per day, the database started struggling. Deployments required full downtime windows. A bug in the invoicing logic once took down order routing.

They couldn't stop. Orders were coming in around the clock.

The migration started with invoicing — the most isolated component, with clear inputs and outputs. We built a new invoicing service, deployed it alongside the monolith, and ran both in parallel for four weeks, comparing outputs on every invoice. When confidence was high, we cut the monolith's invoicing logic to read-only and switched live traffic to the new service. The monolith didn't notice. Customers didn't notice. But the team had their first working piece of the new architecture in production.

From there: driver assignment, then status updates, then order routing. Each migration took four to eight weeks. The core order intake — the most complex, most interdependent part — was last. By the time they got there, the team had run this process four times and were genuinely good at it. The final migration was the smoothest of all.

Total timeline: fourteen months. During that entire period, the business never had a planned downtime window. Order volumes grew 60% while the migration was underway. And when it was done, they had a system they could actually operate at scale.

What the Rewrite Will Cost — Honest Numbers

This is where most rewrite plans fall apart: the estimate.

The mistake is calculating only engineering time. A rewrite costs engineering time, yes — but it also costs product velocity during the transition (features you couldn't build because the team was migrating), operational overhead of running two systems simultaneously, and the management attention required to keep the business aligned through a multi-month architectural change.

A realistic rule: a rewrite of a system your team built over two to three years will take twelve to eighteen months done properly. If someone tells you six months, they're either planning a big bang (risky) or they haven't scoped it honestly.

Budget for the parallel period. Running two systems simultaneously means two infrastructure bills, two monitoring setups, two things that can break at 2am. It's not permanent, but it's not free either.

And protect feature velocity. If you tell the business "we're doing a rewrite, no new features for a year," you will either break the commitment or break the business. The strangler fig approach works in part because it lets you keep shipping features on the new platform as each component migrates. That's not an accident — it's by design.

[→ Read: The real cost of technical debt: how one architectural shortcut became a $2M problem]

FAQ

Q: How do we know when a rewrite is actually necessary versus just refactoring?

A: The threshold is structural. If the current architecture makes it physically impossible to do what the business needs — can't scale to required load, can't add a feature without breaking three others, can't deploy without a downtime window — that's a rewrite signal. If the code is messy but the architecture is sound, refactoring is almost always the better answer. Don't rewrite because the code is embarrassing. Rewrite because the structure is a ceiling.

Q: Should we tell customers we're rewriting the system?

A: Generally, no. Customers care about reliability and uptime, not implementation details. If a migration goes wrong and causes an incident, be transparent about it. But announcing a multi-month rewrite to your users tends to create anxiety without giving them anything actionable. Internally, your key stakeholders — investors, large customers with enterprise contracts, anyone with an SLA — should know the roadmap.

Q: What's the biggest risk during a rewrite?

A: Data inconsistency during the transition period. When you're writing to two systems simultaneously, maintaining consistency takes active effort — and a gap in that effort can mean real business consequences. The second biggest risk is timeline drift: the rewrite stretches, the old system deteriorates further, the team loses confidence. Both risks are managed the same way: short migration cycles, continuous validation, and a clear definition of "done" for each phase.

Q: How do we handle features that customers request during the rewrite?

A: Triage ruthlessly. Features that can be built on the new platform should be — that's actually beneficial, because it accelerates validation of the new system. Features that would require deep work on the old system should be deferred or descoped unless they're genuinely business-critical. The mistake is adding significant new functionality to the old system mid-migration; you're increasing the surface area of what needs to be replicated.

Q: Can we run a rewrite with the same team that handles production support?

A: You can, but you need to protect the rewrite work from being constantly interrupted by support fires. That means at least a partial split: some engineers dedicated to the migration with protected time, others handling ongoing operations and bug fixes. If your entire team is permanently on-call for the old system, the rewrite will never get the sustained attention it needs.

There's no such thing as a risk-free rewrite. But there's a meaningful difference between managed risk — incremental migrations, parallel running, continuous validation — and reckless risk, which is what a big bang rewrite represents. The goal isn't a perfect new system on day one. It's a business that keeps running while you build toward something better.

If you're at the point where the rewrite conversation is happening but you're not sure how to sequence it, that's the exact moment to get architectural help. The decisions made in the first three months of a migration shape everything that follows.

Internal Reference Logs:

External Documentation:

[Martin Fowler on the Strangler Fig Application pattern] — authoritative source for the pattern name and description.
[Joel Spolsky — Things You Should Never Do (Netscape rewrite)] — well-known case study reference for big bang rewrite failure.