Skip to content
Apr 20, 2026·8 min read

LLM Production Deployment: What to Check After the Pilot

A practical guide to moving an LLM from pilot to production: limits, observability, access control, model selection, common mistakes, and a launch checklist.

LLM Production Deployment: What to Check After the Pilot

What changes after the pilot

A successful pilot almost always works because it is carefully scoped. The team takes clear requests, tests one model, and watches how it performs in a quiet environment. Production begins where that comfort ends: real data is noisier, requests are longer, and users do not repeat the demo.

Problems that seem rare in a pilot become part of daily life after launch. As traffic grows, queues appear, latency spikes, and provider errors start to show up. What worked for 100 requests a day can start falling apart at 1,000, especially if some requests carry a large context or require a long answer.

A good example is an internal assistant for customer support. In a demo, it answers short questions from the knowledge base. After launch, employees paste customer emails, spreadsheets, and old threads. The average token count grows several times over, and so does the bill. The team thinks it is paying “per request,” but in reality it is paying for every extra bit of context and every wordy answer.

After the pilot, several things usually change at once:

  • not only the number of requests grows, but also their spread
  • traffic peaks appear at the same times every day
  • cost starts depending on tokens, not on the number of users
  • the same model shows different quality across different tasks
  • provider outages stop being a rare exception

Another common surprise is tied to model choice. In a pilot, it is convenient to take one “smartest” model and run everything through it. In production, that is rarely efficient. A model that writes customer-facing replies well may be poor at extracting fields from a document or may add too much latency in chat. A simpler model often handles classification just as well and costs much less.

So after the pilot, not only the scale changes, but the operating logic changes too. Instead of asking “which model is best overall,” a different question appears: “which model fits this operation under this price and latency limit?” If models can be switched through the same API, the transition is smoother. But even then, without control over tokens, limits, and quality, the pilot quickly turns into an expensive habit.

How to start the move into production

After the pilot, the team usually does not lack a model; it lacks clear rules. If you do not agree on which scenario to launch first and how to tell whether the launch succeeded, production quickly turns into a debate about impressions.

Start by choosing one working scenario. Not the whole list of ideas at once, but one clear flow where the benefit is visible in numbers. For example, an operator gets a draft customer reply instead of writing it from scratch. For such a scenario, one main metric is enough: the share of accepted replies, average handling time, or a reduction in manual edits.

Then estimate the load. Look not only at average requests per day, but also at peaks: Monday morning, month end, a campaign email, or a wave of users returning after an app update. If the pilot handled 300 requests a day and the feature will be released to the whole support team, the load can rise 20x within a few hours. Also estimate prompt and response length separately, because both cost and latency depend on more than request count.

It is better to roll out in stages:

  • internal use for your own team and adjacent teams
  • beta for a small user group or a slice of traffic
  • full release after metrics and incidents are checked

This makes it easier to see where logic breaks, where the model is too slow, and where costs go above plan.

Before launch, assign owners. One person is responsible for model choice and replacement, another for data and masking, and a third for incidents and service degradation. When there is no owner, every problem gets stuck between teams and takes longer to fix than it should.

It also helps to define acceptable failure in advance. For example, the team may decide that 1–2% timeouts are acceptable during peaks, that sensitive scenarios require manual review, that the feature should shut itself off after repeated errors, and that traffic should move to a backup model if latency grows. These boundaries remove unnecessary arguments. In production, it is better not to search for a perfect answer. It is much more important to launch a scenario where the goal, load, stages, and responsible people are clear.

How to set limits without unnecessary blocking

After a pilot, teams often swing to one of two extremes: either they set almost no limits, or they cut traffic so early that the product starts blocking itself. The workable approach sits in the middle. Limits should curb overspending and traffic spikes, but they should not break useful scenarios.

Constraints should be set across several dimensions at once:

  • requests per minute and per hour
  • input and output tokens per day or per month
  • budget per team, product, or feature
  • use of expensive models

One RPM limit will not save you if part of the calls turn into long responses. And token limits alone will not show that one customer is hitting the API too often with tiny requests.

Next, separate limits by purpose, not “for everyone at once.” Production, test, and sandbox should live apart. Support, the internal assistant, and document generation should not share one pool either. Otherwise, tests will eat the working budget, and a secondary feature will block the one customers need every day.

A full stop rarely helps. It is better to build in a soft failure. If the expensive model is unavailable or the team has exceeded its limit, the service can switch to a cheaper model, shorten the reply, turn off a nonessential feature, or queue the task. A user will usually tolerate a reply five seconds later more calmly than a blank error.

Traffic spikes are best checked with a test, not talked through in theory. Push 3–5x normal traffic and see what breaks first: the queue, timeouts, the external API, or your budget stopper. This kind of run quickly exposes weak points that are invisible in the pilot.

Retries and timeouts need discipline. Usually 1–2 retry attempts are enough, and only for temporary errors. Connection and response timeouts should be set separately, and there should be a short pause between retries so you do not hit an overloaded service with another storm of requests. The simple rule here is: the limit should protect the system, not punish the user.

What to include in observability

After the pilot, looking only at the overall success rate is no longer enough. In production, every request matters: how long it waited, how much it cost, which model it used, and how it ended for the user.

Observability starts not with a pretty dashboard, but with a solid event log. If a request took 12 seconds and then returned an empty response, the team should know that within minutes, not after customer complaints.

What to record for each request

The minimum set of fields should be the same across all services:

  • start time and total latency until the user receives a response
  • model, provider, prompt version, and task type
  • response status: success, timeout, error, empty text, broken JSON
  • input and output token counts and actual cost
  • request_id, so the full chain from API to screen or CRM can be assembled

This data is only useful when it is connected. If you have end-to-end tracing for the whole request path, you can quickly see exactly where the issue occurred: routing, the provider, post-processing, or the client app.

Also watch for silent failures. For LLMs, this is normal: the service returned status 200, but inside there is an empty string, broken JSON, or text in the wrong format. Formally the request succeeded; in practice the task was not completed. It is better to count such cases as a separate error type, otherwise the metrics will look better than reality.

Cost should also be tracked at the request level, not only at the end of the month. That way you can see which tasks suddenly became more expensive: for example, summarization is using a long context, or field extraction is running on a model that is too expensive. Then compare actual spend against the plan by day, team, and scenario.

And do not store everything. Keep masked inputs, operational tags, and the audit trail, but remove PII where it is not needed for analysis. Otherwise observability quickly turns from a tool into an unnecessary risk.

How to separate access and environments

Separate limits by environment
Give development, test, and production separate keys and quotas.

As long as the pilot runs on one shared key, the team moves quickly. In production, this approach breaks first. A developer should not accidentally burn the working service limit, and the test environment should not send real data to the same place where customers operate.

At a minimum, you need to separate three environments: development, test, and production. Give each environment its own keys, limits, models, and logs. Then a failure in test will not affect live traffic, and experiments with a new prompt will not ruin production metrics.

Access is better granted by role. An analyst can view logs but not change routing. A developer can use test keys but not touch production limits. A few “just in case” administrators quickly become a problem: later it is hard to tell who changed the model on Friday evening and why.

A useful minimum usually looks like this:

  • separate keys for each environment and each service
  • different limits for test and production
  • log access separate from settings access
  • a clear list of people who can change routes and quotas

Personal data is better masked before it is sent to the model, not after the response. If a support agent pastes a phone number, IIN, or address into a prompt, the system should hide those fields on input. That lowers the risk of leakage and makes internal security review easier.

Audit logs are needed not only for requests themselves, but also for setting changes. It helps to see which key called the model, which route was used, where costs increased, who raised a limit, and who switched the provider. Without this, every incident quickly turns into a memory-based debate.

How to choose models for different tasks

A general test of “which model is better” almost always gives a false answer. For production, it is more useful to collect 3–5 typical tasks that really go through the system every day. Otherwise, the model is chosen based on a nice demo, not real workload.

Usually, a set like this is enough:

  • a short reply in support chat
  • field extraction from a document
  • classification of a request by topic or risk
  • generating an email or summary
  • a task with long context and several rules

For each task, take the same set of examples and compare models on three things: answer quality, latency, and cost per useful result. Looking only at price is dangerous. A cheap model may make more mistakes, and then the team simply pays for manual review. Looking only at quality is also a problem: if the answer takes 8 seconds, the user may not wait.

So models are better assigned by role. One handles fast and cheap requests, another takes complex cases, and a third remains the backup. The rule should be explicit. For example, short FAQs go to the budget model, while contracts and high-risk requests go straight to a stronger one.

Mixing expensive and cheap models without rules is not a good idea. Costs spread quickly, and the system’s behavior becomes unpredictable. If the budget model fails, you need a clear threshold for switching: low confidence, long context, an important customer segment, or a format error.

A backup option is needed from day one. The provider may suddenly increase latency, hit a limit, or become temporarily unavailable. If the application works through one OpenAI-compatible gateway, such as AI Router on airouter.kz, the primary and backup model are easier to keep under the same API. The route can be changed separately from the code, and the application does not notice.

The model set should not be fixed forever. New releases quickly change the balance of price and quality. Once a month, it is useful to rerun your typical tasks and check whether a faster or cheaper option has appeared without losing quality.

Example of a working scenario

Add audit and control
Track routes, keys, and setting changes without manual work.

Imagine a bank chatbot that already responds to customers in the contact center. During the day, load rises sharply: people ask about card reissues, fees, application status, and branch hours. Most of these conversations are short and repetitive. If you send all traffic to the strongest model, replies become more expensive and often slower without much benefit.

The working route here is fairly simple. First, the system identifies the type of request: a routine question or a complex case. Simple topics go to a fast and inexpensive model. It handles FAQs, short instructions, and repeated clarifications. Complex requests, such as disputed charges, fraud signals, or long conversations with large context, go to a stronger model. If the primary provider becomes slow or unavailable, traffic is immediately moved to the backup.

This setup works especially well during peak hours. The customer does not need to wait for a powerful model where a short, accurate answer is enough. And the team does not pay for an expensive call on every question about a fee or branch.

It is better to keep this logic outside the application itself. Then developers do not need to change code every time they need to swap a model, add a backup, or adjust the routing rule. After that, all that remains is to watch a few working metrics: average latency, the share of switches to the stronger model, the number of provider errors, and cases where the bot is unsure and hands the conversation over to an operator. If the share of complex routes suddenly rises, it is worth checking the prompt, classification, and knowledge base.

Mistakes that slow down launch

After the pilot, it is usually not the idea that breaks, but the operational part. In a demo, you can live with manual checks and one lucky configuration. In production, that approach quickly leads to overspending, strange answers, and debates about who changed what.

One of the most common mistakes is sending all requests through one model. That is convenient only at the start. In practice, a short chat reply, field extraction from a document, and long-text summarization require different price, speed, and quality. If one model is used for everything, the team either overpays or loses quality where it really matters.

The second mistake is failing to track tokens separately by user, feature, and channel. Then expenses are visible only as one total line at the end of the month. As a result, nobody knows what is consuming the budget: a test scenario, one large customer, or a new feature that was enabled without a limit.

Teams also often oversimplify the metrics picture. Average latency almost always looks acceptable. But users feel the long tail, not the average. If 90% of requests finish in 2 seconds and the rest wait 18, complaints will still come. So look at least at P95, P99, timeouts, and error share for each scenario.

A shared API key for the whole team also slows down launch. With it, you cannot properly separate access and audit. If someone accidentally drops the limit or the key leaks, everything stops at once. It is much safer to issue separate keys for the service, environment, and team.

And one more expensive habit is changing the prompt directly in production without versioning and logging. On Monday the answers are good, on Tuesday they are worse, and by Wednesday nobody can explain why. You need simple rules: prompt version, author and change date, a short reason, and an easy rollback path. Otherwise even good logs will not help you find the source of the drop quickly.

Short pre-launch checklist

Keep your data in-country
Choose Kazakhstan-based infrastructure if you need to keep data in-country.

Before release, it is usually the small settings around the model, not the model itself, that cause trouble. It is worth checking not only answer quality, but also how the system behaves under load, errors, and ambiguous requests.

First, check the limits. The team should know how many requests per minute the service can handle, what token ceiling you give per user or feature, and where the monthly budget is set. Otherwise one successful integration quickly turns into a bill nobody expected.

Then look at the logs. Each request record should show the model, prompt version, response time, and error code. Without that, it is hard to tell what exactly broke: a new release, a routing change, a provider timeout, or a prompt that is too long.

Before launch, it helps to go through a short list:

  • production, test, and development have different keys and separate access rights
  • the product team cannot accidentally change live limits, and developers do not see unnecessary data
  • a backup model has been selected for the main task, and the switching rule has been written in advance
  • the incident on-call person is named, not chosen by “whoever is online”
  • the team knows where to check logs, limits, and error status

A backup model is needed almost all the time. If the primary model takes more than 15 seconds, returns too many 5xx errors, or exceeds the cost threshold, the system should switch according to a clear rule, not according to an engineer’s mood in chat.

The final check is simple: ask someone outside the team to go through the full user journey. One such run often finds what is invisible on the staging board: an extra permission, an empty log, the wrong rate limit, or a scenario where the backup model returns a different format.

What to do right after launch

After launch, the real work begins. The first few weeks usually reveal what the pilot missed: sharp traffic spikes, unusual user requests, extra costs, and answers that pass tests on paper but get in the way in real use.

It is better to look not at the overall “average temperature,” but at short slices. Usually three reviews are enough:

  • after 7 days, check errors, latency, token usage, and the share of escalations
  • after 14 days, review real conversations where the answer was weak, too expensive, or too slow
  • after 30 days, rebuild routing rules, limits, and access roles based on actual behavior, not expectations

Also collect examples where the model performs below normal. Not “accuracy dropped” in general, but specific cases: it confused fields in a form, gave an answer that was too long for an agent, or missed a risky clause in a contract. These cases quickly show what needs to change: the prompt, the model, the backup threshold, or the route itself.

After two or three weeks, you often need to adjust the model set as well. One model may be fine for cheap, high-volume requests, while another is needed for document review or medical text. If one rule is used for everything, costs rise and quality becomes uneven.

For teams in Kazakhstan, organizational requirements also surface quickly after launch: storing data inside the country, masking PII, audit logs, key-level limits, and one OpenAI-compatible API for multiple providers. In such cases, a separate gateway like AI Router can reduce operational load and simplify access and route management.

At the end of the first month, put the basic rules into one document and keep it close to the team, not scattered across chats. It should list the approved models by task, limits by team and service, rights for logs and settings, the escalation process, and a clear definition of an incident. Without such a document, the system quickly accumulates exceptions, and in a month nobody remembers why the expensive model turns on for a simple request or who is supposed to fix the quality drop.

Frequently asked questions

Where should you start when moving to production after a pilot?

Start with one scenario where the value is visible in numbers. A good starting point is a draft reply for an agent, field extraction from a document, or simple classification, because it is easy to measure time, the share of manual edits, and the cost of one result.

Why do costs grow faster than expected after launch?

Look not only at the number of requests, but also at spikes, prompt length, and response length. Often the bill grows not because of new users, but because of long context, embedded emails, and verbose model responses.

What limits should be set first?

Set limits along several dimensions at once: requests per minute, tokens per day or month, and budget per team or function. If a limit triggers, it is better to route traffic to a cheaper model or shorten the response than to return an error.

Which metrics are actually needed for LLMs in production?

Teams often look at the average and miss the long tail. Keep P95 or P99 latency, timeouts, cost per request, token counts, the share of empty responses, and cases where the model breaks the format, such as returning invalid JSON.

What should be logged for every request?

Log response time, model, provider, prompt version, result status, token counts, cost, and the request_id. Then you can quickly see whether the problem is in routing, the provider, post-processing, or the application itself.

Do you need one model for all tasks?

No, one model for everything almost always leads to extra costs or lower quality. For chat, classification, and data extraction, it is better to use different roles: a budget model for simple traffic, a stronger one for harder cases, and a backup for outages.

Why separate test, dev, and production?

Separate development, test, and production with different keys, limits, and logs. That way tests do not eat the production budget, and a random experiment does not damage metrics or customer responses.

When should PII be masked — before the model or after?

Mask personal data before sending it to the model. If a request includes an IIN, phone number, address, or other sensitive fields, the system should hide them on input and keep only what is truly needed in logs for incident review.

How can prompts be changed safely after launch?

Change the prompt only through versioning and a short change log. Otherwise, in a few days the team will not know why the answers got worse: a new prompt, a different model, or a provider issue.

What should be done in the first month after launch?

In the first week, check errors, latency, token usage, and the share of escalations to a human. By the end of the month, rebuild routing rules and limits based on real traffic, because after launch users almost always behave differently than they did in the pilot.