6. Infrastructure exit strategy

Share
6. Infrastructure exit strategy
Photo by Andrew Teoh / Unsplash

Twenty-four hours after the Shock, the mood briefly shifts. Against all odds, the IT team had made real progress. A regional provider was identified. Accounts were opened. Networks recreated. Containers pulled. One by one, core services start responding again. It is slower than before, rougher, missing pieces - but alive. For the first time since the Shock, there is visible momentum.

This was until some of the most critical workloads refused to start. The binaries had been compiled years ago against a specific CPU architecture. In the primary cloud, that detail was invisible. The new provider only offers a different processor family. A few services can be recompiled. Others cannot. Certain legacy components depend on libraries that no one has built in years. The engineers realized the uncomfortable truth: what looked portable was only portable inside its original environment.

Objective

Too many organizations frame “exit” as something you do after a crisis: when a provider fails, when sanctions hit, when legal access is revoked, when costs explode. That framing is backwards: you don’t design emergency exits when the fire starts - you build them during construction and run fire drills. An exit strategy is not a response mechanism, it is a design principle. 

The objective is simple and unapologetic : you must be able to redeploy elsewhere - even if it’s slow. It does not need to be “seamless” or “automated” - it just needs to be possible, under a time frame of your own choosing. 

Solutions

Cold restart beats live multi-cloud

Let’s clear one myth immediately: live multi-cloud is not the goal. Running active workloads across multiple hyperscalers sounds impressive on slides. Consider a typical SaaS platform attempting live multi-cloud:

  • Every service must be compatible with multiple provider-specific networking models.
  • Identity, permissions, and secrets must be synchronized across clouds.
  • Managed services must be avoided or replaced with custom equivalents.
  • Debugging becomes exponentially harder, because failures are no longer local.

The result is a system that is expensive to run, difficult to evolve, and understood by fewer and fewer people over time. 

Maintaining a fully live secondary environment is expensive and usually unnecessary. A cold standby is enough - and far more realistic. A cold standby environment typically includes:

  • An active account with an alternative provider.
  • Validated access paths (VPN, credentials).
  • Infrastructure code that has been applied at least once.
  • Deployment pipelines that have successfully run.
  • Data restoration procedures that have been tested.

A cold standby does not necessarily include live workloads or mirrored traffic. If the need arises, you will be already in a very good position with an environment that has the right data but responds slowly. You will have some breathing room to learn how to run this infrastructure under load. 

As a practice run, you can spin up your cold environment once per quarter, deploy the MSS, validate that it starts, and tear it down again. This is not redundancy, it is muscle memory. 

An exit strategy doesn’t require that you run everywhere. It requires that you can run elsewhere. The right mental model is not “failover”, it is cold restart: the ability to stand up your MSS from scratch, on a different infrastructure, within a bounded and acceptable timeframe. The first time you deploy elsewhere must not be during an actual crisis.

Portable by design, not abstracted to death

Read more