3. Define Minimum Survivable Service (MSS)

When every alarm goes off in the cockpit, a pilot doesn’t try to fix everything. They do one thing: keep the aircraft flying.

Share
3. Define Minimum Survivable Service (MSS)
When every alarm goes off in the cockpit, a pilot doesn’t try to fix everything. They do one thing: keep the aircraft flying.

The Shock happened eight hours ago. Since then, teams have been fighting furiously - but they are scattered. Each team is restoring its own systems, without a shared definition of what “recovery” actually means. Critical paths remain broken while non-essential components come back online. Leadership sees activity everywhere, yet no clear progress. 

But from the outside, none of this matters. Customers still cannot log in. Data is inaccessible. Transactions do not go through. And support is unreachable.

Meanwhile, customers are pounding the company to know it will recover. The company does not even know what data is safe, what money moved, or who did what.

Failure mode

During a large-scale incident, many companies attempt to “restore everything” at once.

  • Infrastructure teams chase full recovery while identity systems, billing, documentation, and internal coordination remain impaired. 
  • Engineering effort gets diluted across non-essential features. 
  • Leadership debates priorities instead of executing them. 
  • The company restores some level of functionality, but fails to meet critical regulatory or contractual commitments. 

Without an explicit definition of what must work for the company to survive, recovery efforts default to guesswork,  because survival criteria were never defined.

A Minimum Survivable Service (MSS) is the answer to that missing question. 

Objectives

During a major disruption, the objective is not to keep the product “fully operational.” It is to keep the company alive. High availability focuses on uptime. Minimum survivability focuses on continuity of the business. 

The Minimum Survivable Service (MSS). is not a degraded version of your full product. It is a deliberately reduced operating mode that:

  • preserves core value for customers,
  • prevents irreversible damage,
  • limits financial and legal exposure,
  • and keeps recovery paths open.

In practice, this often means accepting that most features will be unavailable, and that this is an acceptable and intentional outcome during extreme conditions.

Solutions

Core capabilities

Defining an MSS starts by asking “what must the company do to justify its continued existence?”. Identify the single most critical customer outcome, then define the smallest feature set that delivers that outcome. Usually, this includes:

  • the ability to authenticate a limited set of internal operators,
  • read-only access to critical data,
  • and a minimal execution path for core workflows.

This will enable you to explicitly exclude non-essential functionality, and to document which components may remain offline: convenience features, analytics, integrations, admin tooling, and growth-related capabilities. These may be painful to lose, but they are not existential.

This distinction must be made explicitly and in advance. Under stress, teams tend to overestimate what is “critical” and attempt to save too much, too late.

Deliverables:
- Which datasets are existential, and which are not?
- Which systems may remain offline without threatening survival?
- What must work for customers / users to access data ? Export data ? Perform transactions ? Contact support ?
- What is the smallest version of our product that must function?

Roadmap towards the MSS

Read more