3. Define Minimum Survivable Service (MSS)
When every alarm goes off in the cockpit, a pilot doesn’t try to fix everything. They do one thing: keep the aircraft flying.
The Shock happened eight hours ago. Since then, teams have been fighting furiously - but they are scattered. Each team is restoring its own systems, without a shared definition of what “recovery” actually means. Critical paths remain broken while non-essential components come back online. Leadership sees activity everywhere, yet no clear progress.
But from the outside, none of this matters. Customers still cannot log in. Data is inaccessible. Transactions do not go through. And support is unreachable.
Meanwhile, customers are pounding the company to know it will recover. The company does not even know what data is safe, what money moved, or who did what.
Failure mode
During a large-scale incident, many companies attempt to “restore everything” at once.
- Infrastructure teams chase full recovery while identity systems, billing, documentation, and internal coordination remain impaired.
- Engineering effort gets diluted across non-essential features.
- Leadership debates priorities instead of executing them.
- The company restores some level of functionality, but fails to meet critical regulatory or contractual commitments.
Without an explicit definition of what must work for the company to survive, recovery efforts default to guesswork, because survival criteria were never defined.
A Minimum Survivable Service (MSS) is the answer to that missing question.
Objectives
During a major disruption, the objective is not to keep the product “fully operational.” It is to keep the company alive. High availability focuses on uptime. Minimum survivability focuses on continuity of the business.
The Minimum Survivable Service (MSS). is not a degraded version of your full product. It is a deliberately reduced operating mode that:
- preserves core value for customers,
- prevents irreversible damage,
- limits financial and legal exposure,
- and keeps recovery paths open.
In practice, this often means accepting that most features will be unavailable, and that this is an acceptable and intentional outcome during extreme conditions.
Solutions
Core capabilities
Defining an MSS starts by asking “what must the company do to justify its continued existence?”. Identify the single most critical customer outcome, then define the smallest feature set that delivers that outcome. Usually, this includes:
- the ability to authenticate a limited set of internal operators,
- read-only access to critical data,
- and a minimal execution path for core workflows.
This will enable you to explicitly exclude non-essential functionality, and to document which components may remain offline: convenience features, analytics, integrations, admin tooling, and growth-related capabilities. These may be painful to lose, but they are not existential.
This distinction must be made explicitly and in advance. Under stress, teams tend to overestimate what is “critical” and attempt to save too much, too late.
- Which datasets are existential, and which are not?
- Which systems may remain offline without threatening survival?
- What must work for customers / users to access data ? Export data ? Perform transactions ? Contact support ?
- What is the smallest version of our product that must function?
Roadmap towards the MSS
Once the MSS is defined, you can turn it into a concrete engineering roadmap.
One of the highest-leverage steps is often the ability to switch the product into a read-only mode, preserving data integrity and customer access while disabling complex write paths that amplify failure and recovery risk. Read-only operation dramatically reduces blast radius and buys time without requiring architectural rewrites.
In parallel, you can invest in feature flags that cleanly disable non-core functionality (integrations, automations, analytics, background jobs) so that recovery effort concentrates on existential flows only.
This roadmap is not about perfect resilience; it is about ensuring that, when systems degrade, the product can be intentionally reduced to its survivable core rather than collapsing unpredictably.
- High-level roadmap towards the MSS
Defining the recovery horizon
This roadmap can be executed in two very different ways, and the choice directly determines the recovery horizon you are willing to accept :
- Teams may implement these changes ahead of time: building read-only modes, feature flags, and degraded paths while systems are healthy, so that switching to the MSS becomes a controlled operation measured in minutes or hours.
- Alternatively, companies may accept a longer recovery horizon and defer this work until a disaster occurs, effectively redesigning the product under stress, with limited access, incomplete context, and degraded tooling.
Both approaches are valid in theory, but they represent an explicit tradeoff between preparation effort and downtime tolerance. Preparedness is not about eliminating outages; it is about deciding in advance how much time, revenue, and trust the company is willing to lose while rediscovering how to survive.
- Arbitrate on the changes you want to engage into right now
- Confront your MSS engineering roadmap against this horizon
- Decide your recovery horizon (is it 1hour or 1 day ?) for the MSS
Payment, billing and revenue continuity
For companies where financial transactions are part of the core workflow, preparedness requires additional steps : you need to decide what happens to transactions under degradation: do transactions continue, pause, or degrade ? Are new transactions accepted, queued, or blocked ? Can refunds and payouts continue or are they suspended ? Are balances frozen ? Is all the above legal ?
If a provider becomes durably unavailable, the company must still be able to answer: what happened, when, and why. This means maintaining their own transaction ledger, recording immutable transaction events with timestamps and identifiers and storing these records in systems independent from the payment provider itself.
The extra-territorial reach of the U.S. dollar creates a structural dependency that matters in crisis scenarios. Because many payment rails, financial institutions, cloud services, and intermediaries operate in or clear through dollars, companies can become subject to U.S. legal, regulatory, or sanctions actions even when they operate outside the United States. In practice, this can result in sudden payment freezes, account suspensions, or service interruptions. Preparedness therefore requires recognizing that dollar-denominated dependencies are not neutral infrastructure: they embed jurisdictional risk. Diversifying payment rails, banking relationships, and currencies is not a political choice, but a pragmatic step to preserve operational continuity and control.
For EU-based companies in particular, this often means complementing global providers like Stripe or Adyen with at least one EU-based payment rail: SEPA Credit Transfers, SEPA Instant or emerging EU-native schemes such as Wero. The objective is not full redundancy, but jurisdictional and operational optionality.
- Implement an EU-based payment rail
- Decide what happens to transactions under degradation
Cost, liability, and damage containment
An effective MSS:
- avoids violating contractual or regulatory obligations,
- prevents data corruption or loss,
- preserves evidence and logs for post-incident analysis.
- and limits cloud spend and third-party costs.
In some scenarios, the safest survivable mode is partial or total service suspension, combined with preservation of data and access paths. This decision must be considered before an incident, not improvised during one.
- Identify your critical contractual and regulatory obligations. Does your MSS achieve them ? If not, what is the magnitude of the risk, and are you willing to take it ?
Conclusion
While implementation is technical, defining the MSS is ultimately a business choice. It requires input from founders and executives, legal and finance, product leadership and engineering. The question being answered is not “what can we keep running?” but “what must we keep possible?”
In extreme disruption scenarios, resilience is not about doing everything. It is about doing enough.