Back to blog

19 June 2026

What I learned deploying LLMs for real clients

Honest lessons from LLM deployment with real clients: what breaks between demo and production, how latency, prompt engineering, cost, adoption, and ownership decide whether AI in production lasts.

The demo always lies a little

I have rarely seen an LLM deployment fail because the language model could not produce one impressive answer in a demo. The demo is usually convincing: clean data, a carefully chosen scenario, a patient user, and an output that makes the future feel close. Production is different. Production brings incomplete documents, business exceptions, impatient users, access constraints, and cases nobody remembered to include in the script.

That is the first lesson I learned with real clients across media, insurance, education, and cinema: an LLM deployment should not be judged by its best example. It should be judged on the Wednesday morning when the system receives an ambiguous request, stale context, a badly named file, and still needs to help someone do real work. The distance between prototype and production is not a detail. It is the work.

Latency changes the product

On a slide, a few seconds of latency feels acceptable. In a steering committee, nobody worries about three seconds. Inside a real workflow, those three seconds can break attention. A journalist searching archives, an insurance operator processing a case, a teacher preparing material, or a film production team looking for an asset does not experience latency as an infrastructure metric. They experience it as friction.

As an AI engineer, I learned to treat latency as a product constraint, not an implementation detail. Sometimes the right answer is routing to a smaller model, precomputing context, streaming early, caching expensive steps, or accepting a less ambitious response that arrives when the user still needs it. AI in production is not the pursuit of the best abstract answer. It is the pursuit of the most useful answer within the available time.

The prompt matters, but it cannot rescue a fragile system

The prompt matters. I write prompts, version them, test them, and still believe that good prompt engineering can make a system much more reliable. But part of the public conversation around prompts creates a dangerous impression: if the behavior is bad, a better instruction should fix it. In real projects, that is almost never enough. The prompt usually exposes problems in context, data, architecture, and ownership.

When an assistant confuses two insurance rules, invents a hierarchy between sources, or answers too confidently on an incomplete education corpus, the problem is not only wording. You need to define trusted sources, precedence between documents, refusal conditions, output contracts, logs, and regression tests. A robust prompt looks less like a magic phrase and more like a contract between the model, the product, and the business.

Cost is not something to discover after launch

The cost of a language model is easy to underestimate while usage is low. A demo consumes little. A controlled pilot also feels harmless. Then real users arrive, documents get longer, retries happen, tools are called, embeddings are refreshed, evals run, logs accumulate, and one failed intermediate step causes the system to call the model three more times. At that point, cost is no longer a line in a spreadsheet. It is an operating constraint.

The strongest projects I have seen do not discover this with the first monthly bill. They define budgets per workflow, context limits, caching strategies, fallback models, and value metrics early. The useful question is not only how much one call costs. It is how much one resolved task costs, how much one avoided error is worth, and how much user trust is gained or lost with each tradeoff.

Adoption depends less on magic than on integration

The most consistent surprise is this: users do not adopt a system because it is impressive. They adopt it because it fits into their day without asking them to become AI operators. In a newsroom, a school, an insurance team, or a studio, people already have tools, deadlines, habits, and quality standards. If the system requires too much explanation, too much copy-paste, or too much verification, it remains a curiosity.

What makes an LLM deployment last is usually less spectacular: a narrow scope, an interface close to the workflow, a human in the loop where risk is real, simple evals, latency monitoring, cost monitoring, and a team that knows who owns what. Production rewards discipline. Clients are not paying to watch a model shine. They are paying for a problem to disappear without three new problems appearing around it.