4bis. AI, models, and inference portability
The CTO saw it first in the weekly dashboard: median inference latency had drifted from 1.4 to 2.8 seconds over three weeks. No alerts. No incident. The provider's status page was green. But two days later, the internal eval suite came back. Contract extraction quality had dropped four points with no announced explanation. Customer-reported accuracy incidents had tripled over the same window. The Head of Customer Success was forwarding one complaint a day from enterprise clients in Germany and the Netherlands - until one of them terminated the contract for breach. Two days later, a competitor in Austin announced they had won that customer. That was the last straw.
The call with the LLM provider was professionally apologetic. A "Priority access" had been added to the enterprise offering. Unfortunately, this offering was not available outside of the US “yet”. Would it ever be ?
The product was built on GPT-4o. The prompts were tuned for it. The retrieval pipeline's embeddings could not be queried by any alternative model. Re-embedding the corpus would take three weeks of compute. Migrating off was a six-month project.
In this chapter:
Failure modes
Objectives
Solutions
- Abstract every production AI call
- Add an EU-hosted provider alongside the primary
- Build an open-weights fallback for MSS-critical flows
- Treat prompt portability as an engineering discipline
- Preserve the underlying assets
- Use BYOK and EU residency where available
- Govern workspace AI deliberately
- Alignment with the AI Act
Conclusion