27 June 2026
What Google and Microsoft taught me about deploying at scale
Personal lessons from Google and Microsoft applied to AI deployments: reliability, observability, documentation, speed, and experimentation for real LLM systems.
Scale starts before traffic
When I think back to six years at Google and two years at Microsoft, I do not first think about datacenters, huge metrics, or the distributed systems that make conference talks sound impressive. I think about concrete design reviews, postmortems without drama, documents that could still be found three years later, and one very practical obsession: will somebody still understand, operate, and repair this system when the original team has moved on?
That question has followed me into almost every AI project I have shipped since. Many mid-size teams approach LLMs as a model question: which provider, which context window, which prompt, which agent framework. That is understandable because the model is the visible part. But once the system touches a real business workflow, scale does not only mean more users. It means more edge cases, more interpretations, more dependencies, more accountability, and less tolerance for answers that nobody can explain.
The first lesson from large engineering cultures is therefore simple: scale starts before traffic. It starts when you decide what must be predictable, observable, documented, and reversible. An internal LLM assistant used by fifty people can already be a scaled system if it influences sales decisions, customer replies, or sensitive operations. On the other hand, a demo used by a thousand visitors can remain a toy if nobody truly depends on its output.
What enterprise engineering gets right
Google taught me the value of reliability as a daily discipline. Not abstract reliability that looks good on a slide, but the kind that forces teams to name failure modes, define invariants, watch the signals that matter, and accept that a system without a clear owner is already drifting. For generative AI, this means you cannot only measure whether the average answer looks elegant. You need to know when the system does not know, when it hallucinates, when source data is missing, when an external tool fails, and what the user should see in each case.
Microsoft shaped my view of enterprise environments in a different way: security, integration, governance, and compatibility with habits that already exist. In a large organization, a good product is not the one that requires everyone to change systems tomorrow morning. It is the one that respects identities, permissions, approval flows, document formats, legal constraints, and support teams. LLM projects often miss this. They answer well in a sandbox, then struggle when they have to live inside SharePoint, Teams, Salesforce, an old intranet, or a business validation process.
The third strength of enterprise engineering is documentation. It can be heavy and sometimes frustrating, but it creates operational memory. In an AI project, that memory matters even more than in a classic software product, because behavior is not always visible from the code. Why is this prompt written this way? Which data sources were excluded? Which errors were accepted? Which threshold triggers human review? If these decisions are not written down, the team slowly loses control of the system.
What that culture gets wrong
I also saw the other side of that rigor: large organizations can make every change too expensive. One more review, one more committee, one more dependency, and eventually the organization protects the system from risk so well that it also protects it from learning. This is especially dangerous with LLMs. The domain moves quickly, models change, costs change, usage patterns change, and some decisions can only be made by watching real users work through a real workflow.
Many teams inherit the wrong reflex from big-company engineering: they want to design the perfect architecture before testing human behavior. They spend weeks debating the agent framework, the orchestration graph, or the multi-model strategy, while the main risk sits somewhere else. Maybe users do not trust the output. Maybe the data is too ambiguous. Maybe the time saved is too small to change an existing habit. Serious engineering does not replace experimentation. It should make experimentation safer and more legible.
This is where scale-ups have an advantage if they know how to use it. They can decide quickly, reduce scope, test with a real team, fix issues in days, and then industrialize what deserves it. They lose that advantage when they copy big-company slowness without copying big-company reliability mechanisms. The worst of both worlds is common: many meetings, few logs, an impressive demo, no reproducible evaluation, and nobody who can explain why yesterday's answer was better than today's.
Applying those lessons to LLM deployment
When I deploy an AI system for a mid-size company today, I try to combine both cultures. The initial scope should stay small, but the foundations should not be improvised. I want a use case narrow enough to learn quickly and real enough to expose production constraints: permissions, exceptions, messy data, incomplete conversations, user feedback, and cases where the system should refuse to answer.
In practice, this starts with observability. A production LLM system should leave a useful trace: prompt version, model used, retrieved documents, tool calls, latency, cost, evaluation score when one exists, user feedback, and the reason for any fallback. Without that, teams debate impressions. With it, they can do real engineering work: compare two versions, identify the source of a bad answer, reduce cost without losing quality, or decide that a task should remain human.
Then comes product documentation, not only technical documentation. The team needs to write down what the system is supposed to do, what it will not do, what counts as a good answer, what counts as an acceptable error, and what should trigger escalation. For a support agent, this is not a detail. For a contract review assistant, it is not a luxury. For a content generation pipeline, it is the difference between a tool the team improves and a black box the team eventually routes around.
The right level of seriousness
The trap for technical leaders is to believe they must choose between startup speed and enterprise rigor. My experience says the opposite. The best AI deployments move fast because they are rigorous about the right things. They do not try to lock everything down. They lock down irreversible decisions, ownership boundaries, measurement mechanisms, and rollback paths. Everything else should remain experimental long enough for the team to actually learn.
In practice, that means shipping a first version that looks less like a strategic announcement and more like an operable system: one precise workflow, a few demanding users, complete traces, a weekly review, a short list of known failures, and an explicit decision about what happens next. You do not earn the trust of a CTO or product team by promising that the model will improve. You earn it by showing that the system can be observed, corrected, limited, and improved without relying on one heroic person.
That may be the most durable lesson from Google and Microsoft: technology changes quickly, but trust is built slowly. LLMs make some capabilities astonishingly accessible. They do not remove the need to design systems people can understand. For scale-ups, the opportunity is large: take the speed of the new AI building blocks, add just enough enterprise engineering culture, and ship tools that do more than impress in a demo. Ship tools that hold up when real work begins.