LLM Gateway Metrics in Production: A Short Daily Set
LLM gateway metrics help you see quality, latency, errors, and costs every day. Here is a short set of numbers for making production decisions.

Why a gateway quickly loses control without numbers
Problems in an LLM gateway rarely start with a major outage. More often, things go wrong quietly: one model’s latency grows, another’s quality drops, and a third starts throwing errors. Complaints come later, after the team has already lost time and money.
If you only look at user complaints, you almost always learn about the problem too late. One shared chart does not help much either. A gateway often carries dozens of models, several providers, and different scenarios: chat, summarization, search over a knowledge base, support. On average, everything looks fine, even though one provider is already returning 5xx errors and one model is responding 40% slower than usual. In a unified OpenAI-compatible endpoint, this happens all the time.
Metrics are not for reports. They are for everyday operational decisions. They help you quickly understand:
- where the problem started;
- what the user will notice first;
- what needs to be fixed now;
- when it is time to switch the route to another model or provider.
Without numbers, the team argues based on feelings. Some think quality has slipped. Others blame the network or the provider. A third group only sees the bill rising. All of them may be right at the same time, but without a simple set of indicators, you cannot tell.
Choosing the first step is even harder. If responses got worse, the model is not always the cause. Sometimes latency increased, and the application cuts off the answer because of a timeout. Sometimes routing moved traffic to a cheaper option, and quality fell together with the price. Sometimes quality is fine, but costs have already gone past the daily limit.
A good dashboard should suggest an action within a few minutes. In the morning, the team opens it and immediately sees where the provider slipped, where request cost increased, and where it is time to roll back a route.
Where to start with your metric set
A dashboard that is too broad is almost as unhelpful as having no dashboard at all. For a daily review, 8–10 numbers are enough. They should be enough for a two-minute check and a quick conclusion: everything is calm, or you need to dig deeper.
You should not look at one overall line. The data needs to be sliced along at least three axes: model, provider, and scenario. Otherwise, the average will hide the real problem. A support chat may be working fine, while document extraction is already losing quality and response time.
These are the metrics that usually belong on the first screen:
- number of requests per day and per hour;
- successful response rate;
- p95 latency;
- average time to first token;
- timeout rate;
- 4xx error rate;
- 5xx error rate;
- average cost per request;
- average cost per 1,000 tokens;
- share of requests with a quality complaint, manual or automatic.
This set is enough to see quality, latency, errors, and costs without too much noise. If the gateway routes traffic through multiple providers, the provider split quickly shows where timeouts increased and where the bill went up.
Test traffic and production traffic should be separated from the start. Otherwise, nightly runs, load tests, or the team’s work on new prompts will distort the picture. A simple rule works well: anything that does not affect users goes into a separate stream and a separate dashboard.
Every metric should have a clear action. If p95 crosses the threshold, the team checks the provider and, if needed, shifts part of the traffic. If 5xx errors rise, they check whether it is a routing failure or a provider-side issue. If request cost rises, they check whether traffic moved to a more expensive model. If the share of successful responses drops in one scenario, they inspect the prompt and the input data.
If you do not define the action in advance, the metric is almost useless.
What counts as response quality
Quality is rarely visible through just one score. A polite response does not necessarily mean the answer helped the user or is even suitable for the product. For daily work, four signals are enough. They are more useful than a long list of rare indicators.
Minimum set
- Pass rate in a manual sample. Once a day or after a release, take 30–50 responses and mark whether each one answered the question properly, avoided making up facts, and followed the rules. If 32 out of 40 passed, that is 80%.
- Format failures. Count separately the cases where the model breaks JSON, skips a field, or violates the schema. For the product, this is often more painful than weak wording, because the code fails later or the scenario freezes.
- Share of repeated requests after failure. If the app retries, sends the request to a fallback model, or the user asks the same question again, quality has already slipped. This signal is good at catching a hidden problem when there are almost no obvious errors in the logs.
- A control set of tasks. Keep 20–50 typical requests and run them after changing the model, prompt, or route. Good examples are field extraction, summarization, classification, and knowledge-base answers.
It is better to read these indicators together. Manual review shows whether the response is useful to a person. Format and retries show whether the system can live with that response without extra workarounds.
A simple example: a support bot gives clear answers, and the manual review holds at 84%. But JSON breaks in 6% of cases, and repeated requests have risen from 9% to 17%. On the surface, the model looks decent, but in practice the team is already losing time on retries and failure analysis.
If you can implement only two checks, start with the manual sample and the broken-format rate. They quickly show whether the model answers the question properly and whether its output can be used in production.
What to watch for in latency
The average almost always lies. One fast response and one very slow response can still produce a “normal” average, even though the user is already frustrated. That is why it is better to keep p50 and p95 side by side.
p50 shows an ordinary day: how quickly the gateway responds in a typical request. p95 shows the tail: how long those who were less lucky waited. It is usually p95 that hurts the product, support, and SLA.
For chat, it is not enough to know only the total response time. The user first notices how quickly the first token arrives. If the answer starts typing after 700 ms and finishes after 8 seconds, that is one experience. If the first token arrives after 4 seconds, the same answer already feels frozen.
That is why it is better to track p50 and p95 for total response time, time to first token for conversational scenarios, full response time for long tasks, and the share of requests that hit a timeout.
One number for the whole system does not tell you much. Latency should be sliced at least by model, provider, and request type. Short chat, document extraction, and long report generation behave differently. If you mix them into one chart, the cause of the problem disappears.
It helps to look at these slices together. If p95 rises only for one model, the problem is often in the model or provider. If the increase appears across all models within one request type, look for a bottleneck in the application, queues, or network.
Timeouts should not be tracked separately from p95. When p95 creeps toward a 20- or 30-second limit, that is already a signal. If the team runs the same scenario through different models and providers in one gateway, this combination of metrics quickly shows where waiting is acceptable and where it is better to switch the route right away.
Which errors should be counted separately
A single overall error rate usually gets in the way more than it helps. If you mix 429s, 5xxs, network timeouts, and schema errors into one number, the team will see that failures are rising, but it will not know what to fix first.
Split errors into at least four groups:
- 429 — you have hit a limit, and a queue, another provider, or a more careful retry usually helps;
- 5xx — the problem is usually on the model or provider side; this is a direct signal to switch the route;
- network failures and timeouts — this is where you look for an unstable connection, DNS, a proxy, or timeouts that are too short;
- validation and schema errors — the request arrived, but your code or contract broke.
Schema errors should be tracked separately even when the share is low. They are rarely widespread, but they hurt badly: JSON cannot be parsed, a required field is missing, or a tool call arrives in the wrong format. A retry usually does not help here.
You also need a separate line for stream interruptions and empty responses. Formally, the request may have finished without a 5xx, but for the product it is still an error. The user sees a frozen response or an empty window, while the regular success counter says everything went fine.
Another useful indicator is the share of requests that only became successful after a retry. If that grows, the gateway is still holding up, but the margin is already shrinking. For example, a success rate of 99.2% looks calm today. But if 7% of successful responses came only on the second or third attempt, an incident is already near.
Keep a request_id for every failure. Without it, the investigation quickly turns into guessing: it is hard to find the log, match the provider, model, status, and response time. With a request_id, an engineer understands in a couple of minutes where it broke: in the client, in the gateway, or at the external API.
How to calculate costs without confusion
The total daily amount says almost nothing. It grows because of traffic, bad routing, and long responses. That is why it is better to tie costs not to all requests in a row, but to the ones that produced a good result.
The first anchor is the cost of 1,000 successful requests. If 12,000 calls went through during the day, but some failed due to a timeout or error, divide the cost only by successful requests. That shows how much the working flow costs, not all the noise around it.
The second metric is closer to the product: the price of one useful response per scenario. For a support bot, a useful response is one after which the person does not ask again and does not call an operator. For internal search, it is a response that the employee opened and used. This number is often more useful than the average request cost.
Tokens are also better tracked separately. Input tokens show where the system prompt, chat history, or internal instructions have grown too large. Output tokens show where the model is answering too verbosely. When everything is merged into one sum, the cause of overspending is almost invisible.
It is also useful to keep a few more numbers nearby: the share of requests that went to expensive models without a clear reason, the cost of a useful response for each scenario, and the cache hit rate for prompts, if caching is already enabled.
A small example: a FAQ bot usually handles simple questions on a cheaper model. If, over one week, 25% of such requests started going to a more expensive model and the cache hit rate dropped, the bill will rise even if traffic stays the same. The problem here is not request volume, but the routing rule or a changed prompt template.
These indicators are useful to look at alongside latency and errors. Then the team sees the specific reason for rising costs: long input, overly wordy output, a cache miss, or an unnecessary choice of an expensive model.
Your first dashboard in one day
Do not try to cover all traffic at once. Pick one or two scenarios with real load: a support bot, knowledge-base search, or an internal assistant for operators. If request volume is low, the charts will be too flat, and you will not see where the gateway is losing money or time.
To start, a simple log for each request is enough. Save the model, provider, selected route, input and output tokens, error code, and response time. If the team works through a single OpenAI-compatible gateway, these fields are easier to collect in one place than across several provider dashboards.
The first screen should answer four questions: what is happening with quality based on a simple signal such as a like, dislike, share of human escalations, or a repeated request; what p95 response time is holding for each scenario; where errors are rising and which codes are most common; and how much one successful response costs, including tokens and retries.
Do not make the thresholds too complicated. If p95 has risen by about 30% above the normal baseline, give a yellow signal. If errors go above 2–3% or the cost of a successful response jumps sharply, turn it red.
This kind of dashboard already helps in daily work. It quickly shows when one provider has slowed down, when a route starts failing more often due to timeouts, and when a new model eats budget without visible benefit.
Check it in the morning and after every release. The morning review takes only a few minutes, but it often catches overnight failures. The post-release check is needed for the same reason: the team changes a prompt or a route, and p95, errors, and response cost drift within the first hour.
Example: a support bot after launch
After the release, the support bot handled simple questions on its own: order status, plan changes, refunds, password resets. If a question looked complex, the gateway sent it to a stronger model. In the overall summary, everything looked calm: average response time barely changed, and the error rate stayed low.
The problem surfaced when the team broke the data down by response length. Short responses stayed almost the same, but p95 for long responses increased noticeably. A user asking about a refund or contract terms was now waiting not 4–5 seconds, but 11–13. The average almost hid this, because there were more short conversations.
The error picture was uneven too. Code 429 came mostly from one provider and almost always during peak hours. Formally, requests were not failing at scale, but some of them were going through retries, the queue was growing, and long responses were slowing down even more.
Costs showed the same pattern. The cost of a successful response rose not because the models got more expensive, but because the bot was calling the expensive model too often. The escalation threshold had been set too cautiously: even ordinary questions like "where is my order" or "how do I change my email" were often sent the wrong way.
A short manual sample of several dozen conversations revealed another issue. The response format was breaking more often than the logs suggested. HTTP 200 was returned, tokens were billed, but JSON sometimes lost a field, and the template for the operator response drifted. For the business, that is an error, even if technical monitoring marked the call as successful.
After that, the team fixed four things: they started tracking p95 separately for long responses, split 429 by provider and by hour, added the cost of a successful response instead of only the request cost, and introduced a schema check after generation.
That is how useful metrics work in production. They do not vaguely say that "something got worse." They show exactly where seconds, money, and quality are being lost.
Where teams make mistakes most often
Most often, teams spoil the picture themselves. They collect many numbers, but look at them in a way that makes it impossible to make a decision. Metrics should help you choose a model, a route, and limits for today, not live in a pretty report.
The first mistake is living by average response time. The average almost always gives false comfort. If 80% of requests finish in 2 seconds and the rest hang for 15, the user will remember the freeze.
The second mistake is throwing all models and all providers into one chart. One route may have slowed down, another may still be stable, and the combined chart will show "almost fine."
The third mistake hits the budget. The team counts costs only by tokens and celebrates a cheap model. Then that model fails more often, the operator has to repeat more work, and the bot sends more conversations to a human. On paper, tokens are cheaper. In reality, one resolved case costs more.
Another common miss is mixing test traffic and production traffic. Load tests, manual checks, and prompt experiments can easily ruin the latency and error picture. Test traffic should have its own tag, or better yet, a separate key.
After changing routing, teams often look at the new chart and draw conclusions from memory. Memory lies. You need two comparison windows: before and after, with the same traffic type.
If the dashboard shows one graph for all models, only the average without p95, costs without the successful-response rate, test and production traffic in one stream, or route-rule changes without a before-and-after comparison, the data can no longer be trusted.
A good dashboard argues with gut feeling. That is its purpose.
A short daily check
The daily check should not take half an hour. If the dashboard is built well, 5–7 minutes are enough for the team to understand whether the gateway is in normal mode or already drifting.
For such a check, five signals are enough.
- Look at
p95for the main scenarios, not average latency. If the support bot usually finishes in 4 seconds and todayp95rose to 7, there is already a problem. - Keep
429,5xx, and timeouts nearby. It is more useful to compare them not with an ideal, but with your usual range over the last few days. - Check the cost of one useful response after every routing, prompt, or model change. If quality did not improve and the response became 20–30% more expensive, the change is rarely worth keeping.
- Open a small sample of live responses. Ten to fifteen is usually enough to spot broken JSON, extra text instead of the required format, or weaker answers after a silent model change.
- Check whether all traffic has landed on one provider or one model. If one route has taken almost all requests, you now have extra risk around outages, limits, and price.
It helps to keep one simple rule: if two signals break at the same time, the team does not wait for the evening call and immediately rolls back yesterday’s change or turns on the fallback route.
It sounds boring, but it works. One short review in the morning often saves the day better than a long postmortem after user complaints.
The team’s next step
A metric set works only when the team builds it into its normal workflow. Write down 8–12 indicators, give them simple names, and assign one person to check them every day. That person does not need to fix everything alone. Their job is simpler: notice the shift, start the investigation, and keep the problem from getting lost between releases.
It is better to calculate metrics at the gateway level, even if you have several models and providers. Otherwise, the team only sees pieces of the picture. One provider has growing latency, another has more errors, a third is more expensive, and nobody brings the product impact together in one place. First keep a shared view of quality, p95 latency, error share, and cost per 1,000 requests. After that, drill down into the model, route, or client.
Once a month, a short cleanup is useful. If a metric does not lead to a decision, remove it. If the team keeps arguing about the same issue without numbers, add a new metric and set a threshold that triggers action.
Usually, a simple rhythm is enough: one review owner checks the dashboard every day, the team looks only at deviations once a week, removes unnecessary metrics and adjusts thresholds once a month, and adds a new metric only for a specific question.
For teams in Kazakhstan and Central Asia, this is especially convenient when there are many models and billing needs to stay unified. In that setup, AI Router on airouter.kz helps bring routing, audit logs, and invoicing in tenge into one place. That makes it easier to compare providers, track costs, and investigate incidents without switching between different dashboards.
Frequently asked questions
What metrics should I use for the first dashboard?
Start with success rate, p95 latency, time to first token for chat, timeout rate, 4xx and 5xx errors, average request cost, and the cost of a successful response. That is enough to quickly understand in the morning where quality, speed, or costs have slipped.
Split the data by model, provider, and scenario right away. Otherwise, averages will hide the problem.
Why is average latency not enough?
The average almost always smooths out the unpleasant tail. If some requests are fast and others hang for 10–15 seconds, the average may still look fine even though users are already waiting too long.
At minimum, look at p50 and p95. For chat, keep time to first token separate, because that is what people notice first.
How do I slice the metrics without hiding the problem?
You need at least these three cuts: model, provider, and scenario. In most cases, they quickly show whether the issue is in one specific route or in the whole application.
If your traffic varies a lot by request type and length, add another cut, such as short chat versus long generation. That makes the root cause easier to find.
How can I tell whether response quality really dropped?
Do not try to reduce quality to one score. For day-to-day work, a manual sample of responses, the share of broken JSON or schema failures, retries after failure, and a small control set of tasks are enough.
If a response sounds polite but breaks the format or makes the user ask again, quality has already dropped. That signal matters more than pretty wording.
Which errors should be tracked separately?
Do not lump everything into one error rate. Keep 429, 5xx, network timeouts, and schema or validation errors separate.
That way, the team immediately knows what to do. With 429, a queue or another provider often helps; with 5xx, switch the route; with a schema error, fix the contract or the prompt.
How do I avoid mixing test and production traffic?
Tag test traffic separately or send it through a different key. Then nightly runs, manual checks, and prompt experiments will not distort your production metrics.
If you mix test and prod, you will get false spikes in latency, errors, and cost. After that, it is hard to trust the dashboard.
How do I track costs without confusion?
The total daily sum explains very little. Calculate the cost of a successful response and the cost of 1,000 successful requests, not all calls combined.
It also helps to track input and output tokens separately. Then you can see whether costs are rising because of a long prompt, overly verbose output, or routing that sends more traffic to an expensive model.
When should I switch traffic to another model or provider?
Switch routes when p95 clearly moves beyond the normal range, 5xx or timeouts rise, and quality or cost no longer fit your threshold. Do not wait for a flood of complaints.
A simple rule works well: if two signals break at once, such as speed and errors, turn on the fallback route or roll back yesterday’s change.
What should be written to each request log?
Store request_id, model, provider, selected route, input and output tokens, error code, response time, and retry flag. That is already enough to investigate an incident quickly.
If you use streaming or a strict response schema, also add stream status and the result of the format check. Otherwise, some real failures will stay hidden.
How often should the dashboard be reviewed, and who should own it?
A short check in the morning and after each release is usually enough. If the dashboard is built well, it takes only 5–7 minutes to review.
Assign one owner for the review. That person checks deviations every day, starts the investigation, and keeps the issue from getting lost between releases.