Jan 25, 2025·8 min read

SLOs for LLM Applications: How to Measure Against Business Goals

SLOs for LLM applications help connect latency, valid response share, and cost to business expectations, not to charts made for reporting.

Where Metrics Start to Lie

The team looks at the dashboard and sees green: average latency is 1.8 seconds, request cost is low, and errors are rare. But users do not need a pretty chart. They need an answer that lets them finish the task without a second request, manual checking, or a trip to an operator.

The first trap is simple: the team measures what is convenient for the system, not what matters to the service. A response-time chart does not show whether the bot understood the question, returned the right format, or sent the person in circles. If 92 out of 100 answers arrive quickly, but 8 break the payment or application flow, the service is already hurting the business even if the dashboard looks calm.

Most often, average latency is the liar. One response in 200 ms and one timeout in 20 seconds produce an average that sounds acceptable. For a live service, the tail matters more: how many requests finish within 2 seconds, how many time out, and how many need a retry. Users remember not the average, but the moment the chat froze and they had to type everything again.

Cost can be misleading too. A cheap model on a single call may end up more expensive overall if it more often asks for clarification, mixes up fields in JSON, or returns text that cannot be sent to the customer without editing. Then one "cheap" request turns into two or three, and manual work appears right after that. In support, you notice it immediately: the operator spends not 30 seconds checking, but 3 minutes saving the conversation.

When latency, valid response rate, and cost live separately, the team argues instead of fixing anything. Engineers say the system is fast. Finance sees a normal token spend. The business sees fewer completed flows. SLOs for LLM applications only start working when those three metrics are tied together.

Even if the team uses an API gateway like AI Router, the problem does not disappear. One OpenAI-compatible endpoint makes it easier to compare models, routes, and providers, but it still does not answer the main question: did the user solve the task on the first try or not?

What the Business Expects from a Model Response

Businesses rarely need the model response itself. They need the result that lets a person finish a clear action: find the right clause in a contract, draft an email, fill in a support case, or quickly check text for risk. If nobody names that action, the metrics quickly lose meaning.

So start with a simple question: why does a person open this screen or send this request in the first place? The answer should describe a concrete job, not a vague benefit. Not "helps employees," but "cuts incoming email review to two minutes" or "provides a first draft reply that a manager edits in 30 seconds."

Then define how long the person is willing to wait. For chat and in-product suggestions, patience is short: after a few seconds, the user starts losing the thread. For overnight document processing, time matters less. The same response can be excellent for a background task and useless in a conversation if it arrives too late.

Next, you need a simple line between "good enough" and "not good enough." Do not look for a perfect answer. Look for an answer after which the person does not have to start over. Usually the debate ends once the team writes down 3-4 signs of a normal result: correct format, no invented facts, required fields, and meaning that matches the request.

Cost is another frequent source of confusion. The cost of one call tells the business almost nothing. If the model is cheap but every fifth request needs a retry, manual correction, or a more expensive fallback, the task already costs noticeably more.

That is why you should measure the cost of a completed task. Include retries, human checks, backup routes to another model, and employee time. This matters especially if the team routes requests through several models or a single gateway and can switch providers easily.

A good expectation sounds grounded: the system prepares a draft answer based on the document in 5 seconds, in 90% of cases a lawyer makes only small edits, and the average cost of one completed review stays within budget. With that kind of description, business can no longer be replaced by a pretty latency chart.

Which Metrics Belong in the SLO

For an LLM service, three anchors are usually enough: the latency tail, the share of valid responses, and the cost of a successful flow. The other numbers are useful for debugging, but they rarely help the business understand whether the service is doing its job.

For latency, look at p95 or p99, not the average. The average almost always looks fine, even when a noticeable share of users waits too long. If the chat answers in 2 seconds on average, but every twentieth request hangs for 12 seconds, people will remember exactly that.

The share of valid responses also cannot be measured by "feel." First, define explicit rules: on topic, correct format, no invented facts, no forbidden content, and required fields filled in if it is JSON or a form. Then the quality debate turns into a checklist, not an exchange of opinions.

Cost should be counted not per token, but per successful scenario. The calculation includes retries, fallback to another model, tool calls, filtering, database search, and any repeated generation. If the first attempt failed and the second one worked, the business paid not for one answer, but for the entire path to a working result.

Do not mix different task types

Chat, search, classification, and agent workflows have different expectations, so one shared SLO for all of them is almost useless. For chat, p95 latency and the share of responses without meaning errors are usually most important. For retrieval with generation, you also need source citation accuracy and the number of empty responses. For classification, class accuracy and stable output format matter more. For agent tasks, you need to measure completion of the whole flow, not just the model's first response.

If you route requests through a shared gateway, track metrics separately by task type, model, and provider. Otherwise, one fast model will hide a slow one, and a cheap route will mask expensive retries. The dashboard will look nice, but it still will not explain what the business is getting for its money.

How to Build an SLO Step by Step

For LLM applications, it is better to build an SLO not from the dashboard, but from the tasks where the model response affects money, human workload, or the risk of a mistake. If you take all requests at once, you will end up with a nice average and very little value.

Start with 2–4 scenarios where the effect is visible in the team's work. Usually support chat replies, case triage, knowledge base search, and drafts for the operator are enough. More scenarios at the start only create confusion.

For each scenario, write down what counts as a normal result. Not "the model answered," but "the customer got a usable answer in the required time and at an acceptable cost." Then set three thresholds: response time, share of valid responses, and cost. Time is better measured by percentile, for example p95, rather than by average. Cost should be tied to a useful action: per conversation, per request, or per 1,000 successful answers.

Then collect a baseline during a week of live traffic. Tests and manual runs almost always look better than real load. During that week, look not only at averages, but also at drops by shift, channel, and request type.

Even before launch, agree on what breaks the SLO and what counts as an exception. If the provider is unavailable, the user sends a file that is too long, or the operator ends the conversation manually, the rules should be clear to everyone. Otherwise the argument will start in the middle of an incident.

Thresholds should be reviewed after any noticeable change: a new model, a new prompt, different routing, caching, or safety filters. Old numbers are rarely fair after that.

A small example. If a chatbot should save the first-line team's time, the goal can be written like this: p95 no higher than 6 seconds, valid responses at least 92%, and cost no higher than 18 tenge per successful dialog. The team can immediately see the trade-off. The model may answer faster but make more mistakes. Or it may answer better but consume the budget.

If you switch models through AI Router or another compatible gateway, do not move the SLO over to the new route automatically. The same prompt on a different model often changes latency, the share of good responses, and cost. A week of testing on fresh traffic usually saves more time than fixing a poorly chosen threshold later.

Move providers calmly

Keep your current code and prompts while changing only the request route.

Start migration

If the model answers quickly but does not solve the task, the latency chart does not save anything. That is why the share of valid responses is not measured by general impression, but by a simple rule: a response counts only when it completed the task without critical errors.

The formula should be simple too: divide the number of valid responses by the number of responses in the checked sample. If you add canceled requests, test calls, and empty customer retries to the denominator, the metric will start to drift and argue with reality.

First, build a short checklist. It is not for decoration, but so two people can score the same response the same way. Four checks are usually enough: correct format, facts not distorted, response complete enough for the task, and no forbidden rules broken.

For tasks with a strict structure, it is better to automate the check fully. If the model must return JSON, fields, types, required values, and response length can be checked in code. The same applies to classification, category selection, required warnings, and template-based work. Manual review often just wastes time here.

Borderline cases are better handled with a manual sample. This works for expensive scenarios: customer replies, medical suggestions, internal assistants for lawyers, and contract summarization. You do not need to read everything. A small but honest sample is enough.

The "I liked it" rating almost always ruins the metric. The user may dislike the tone of the answer even though the task was solved. The opposite also happens: the text sounds confident, but it contains an error. For the SLO, the more important question is different: did the response complete the task according to the rules?

Mark the reason for failure separately. Otherwise you will see one number but not know what to fix first. In practice, it helps to keep several reasons: refusal without explanation, invented facts, empty response, wrong format, incomplete response. Then the team sees exactly what is hurting quality.

A simple example: a support bot must return the order status and the next action. If it replied politely but did not include the status, the response is invalid. If it invented the status, that is a different kind of failure. Both cases reduce the share of valid responses, but they are fixed in different ways.

A good metric does not argue with operators, product, or finance. It shows how many responses could really be used in the workflow.

Example for Customer Support

A bank operator does not need a perfect text. They need a draft that arrives quickly, does not invent extra details, and genuinely saves time. That is why SLOs for LLM applications in support are better measured by a usable answer, not by a pretty average latency.

Imagine a simple scenario. A customer writes in chat about a disputed card transaction. The model prepares a draft reply for the operator. If the draft arrives later than 4 seconds, the operator usually does not wait and writes the response themselves. Average latency is almost useless here: even if it is 1.8 seconds, the long tail breaks everything. For the team, the more important rule is this: in 95% of cases, the response must arrive in no more than 4 seconds.

Quality works the same way. What counts as valid is not a "generally good" text, but an answer without invented conditions and with the required fields. For example, the draft should include the reason for the reply, the next step for the customer, and the needed clarification about the case. If the model adds a nonexistent rule about the tariff or misses one of the required fields, that response is a failure.

There is also a stricter check for usefulness. The operator may slightly edit the wording, and that is fine. But if they change more than two fragments, the draft did not work. It took time instead of saving it.

In such a case, the SLO can be described with four rules: 95% of responses arrive in 4 seconds or less, a valid response does not contain invented conditions, it includes all required fields, and the cost calculation also includes a repeated request after an error or timeout.

Teams often understate the last point. If the service processed 1,000 cases and 60 of them needed a retry, you will pay for 1,060 calls. You should measure not the cost of one request, but the cost of one usable draft.

Suppose that out of 1,000 cases, 950 responses arrived on time. Of those, 890 passed the validity check. In another 80 cases, the operator had to edit more than two fragments. That leaves 810 usable drafts. If the system also made 60 retry requests after timeouts and errors, the business is looking at the cost of 810 useful results, not a report with "successful" 1,000 requests.

This calculation quickly changes the conversation. A cheap model may lose if it makes more mistakes or answers more slowly. And a more expensive model may end up cheaper if it produces more usable responses without retries and extra editing.

Where Teams Go Wrong

Bring providers into one API

Work with 500+ models through one OpenAI-compatible endpoint.

Connect API

Most often, a team takes what is easy to show on a chart and presents it as the real quality of the service. That is how SLOs for LLM applications quickly turn into a display case instead of a working agreement with the business.

The first mistake is to look at average latency. The average is almost always calming. If seven responses arrive in 2 seconds and three arrive in 18, the number still looks acceptable. For the user and the operator, that is already a bad experience: the conversation breaks, the person presses "send" again, and cost and error counts grow.

Money works the same way. The team counts input and output tokens, but not follow-up questions, retries, manual edits, and escalation to an operator. On the dashboard, a request may look cheap, while one real case costs two or three times more. If the model often writes a response that has to be fixed by hand, that is not an "almost good" result. It is extra work.

Another common mistake is changing the prompt and the model on the same day. After that, nobody knows what actually affected the share of valid responses: the new template, the different provider, or the new routing. One variable at a time sounds boring, but then you do not have to guess.

A demo set can also be deceptive. Curated examples are cleaner than live traffic. Real requests include typos, incomplete context, strange phrasing, and in Kazakhstan, often a mix of Russian, Kazakh, and internal company codes. On such data, the metric quickly drops from presentation level to reality.

Another mistake is hiding operator work outside the main metric. The model answered in 4 seconds, and formally everything looks fine. But if the operator then spends 90 seconds fixing the text, removing extras, and adding facts, the service performed badly.

In practice, it is more useful to keep four numbers in front of you, not one "pretty" one: p95 or p99 latency instead of the average, the share of valid responses on live traffic, the cost of one completed case instead of just token cost, and the share of responses the operator had to edit manually.

If those metrics move in the right direction, the service is helping the business. If only the average speed on the chart goes up while support is getting angry, you can close the chart.

Quick Checks Before Launch

Switch models without migration

Keep your code and prompts, and change only the API base address.

Try it

A nice dashboard before release means almost nothing if the team has not decided one simple thing: what response counts as good enough for the business, and what response already harms the process. SLOs for LLM applications are checked not by average lines on a chart, but by what happens to a real scenario on a bad day.

First, break the product into separate scenarios. A support chat reply, auto-triage of a request, and generation of an internal summary are different tasks. Each should have an owner who can name thresholds for latency, valid response share, and acceptable result cost. If there is no owner, the quality debate will start after launch.

Logs should also be checked before production, not after the first incident. For each call, the team should see which model answered, which prompt version was sent in the request, and why the flow failed: model refusal, timeout, empty response, schema error, or manual operator edit. Without that, every drop becomes guesswork.

If you use AI Router, check in advance that these fields are preserved in logs and audit records. A single gateway does not solve observability by itself. It only makes route comparison easier if the team has already agreed on what counts as success.

The dashboard should also split outcomes into at least four groups: successful response without edits, refusal or block, timeout or technical error, and response that a person manually edited. If all of that is bundled into one "success" number, the team will see the problem too late. Manual edits often get hidden this way: formally there is an answer, but in reality an employee spent another 3 minutes and rewrote half the text.

The cost mistake is even simpler. Counting the cost of any call is convenient, but it is a weak metric for the business. Look at the cost of a successful result. If the model is cheap but every fifth answer needs rework, the total cost grows faster than it seems.

Before launch, it also helps to define a response plan. If latency rises, the team temporarily moves traffic to faster models. If the share of valid responses drops, it rolls back the prompt or schema. If the cost of a successful result rises, it revisits routing and limits. If manual edits have become more frequent, the team analyzes specific cases rather than the overall average score.

That set of checks takes a couple of hours. But later it becomes clear exactly where the service stopped doing its job.

What to Do Next

Take one scenario where a model error immediately hits revenue or the queue. Usually this is first-line support, request triage, or customer chat replies. If the model answers slowly, the operator waits. If it gives an invalid answer, the person redoes the work manually. In this situation, the SLO quickly stops being a nice diagram and becomes the simple economics of a shift.

A good first step looks like this: choose one stream of requests with a clear business outcome, measure the current latency, share of valid responses, and cost per request, do not change the model for two weeks so you can see the real baseline, and keep the same request types instead of a random mix.

Two weeks is usually enough to see load peaks, long latency tails, and typical failures. If you start comparing earlier, the team often mistakes noise for improvement. The model may have had a good day and then fail at the end of the month when the queue doubled.

After that, run the same request set through several options. Compare fairly: the same prompts, the same scoring rules, the same response format. Put not an average temperature, but three numbers into one table: p95 latency, share of valid responses, and cost per useful answer. The last metric often sobers everyone up better than any presentation.

It also makes sense to look not only at large providers, but also at open-weight models. Sometimes they lose on overall strength, but win on cost, latency, or data requirements. For teams in Kazakhstan, this comparison is convenient to do through AI Router: you can change only the base_url to api.airouter.kz and run the same requests through different models and providers without changing the SDK, code, or prompts. That makes the experiment easier, but the SLO thresholds still need to be checked on live traffic.

When the comparison is done, write the thresholds down. Product needs a clear quality level, engineering needs latency and error boundaries, and the business needs a cost limit. If these numbers are not signed off by all three sides, the model argument will return at the first incident.

Frequently asked questions

Why doesn't average latency show the real quality?

Because an average hides failures. One fast response and one timeout can produce a decent-looking number, but the user remembers the frozen chat.

For a live service, look at the latency tail: how many requests finish within the target threshold and how many go into a retry or timeout.

Which is better to track: p95 or p99?

For chat and suggestions, teams usually use p95. This metric shows how long almost all traffic waits, not the ideal average case.

If you have strict requirements or frequent complaints about rare freezes, add p99 as well. You can keep the average for background tracking, but not for the SLO.

How do you know whether a model response is valid?

First, define a simple checklist for the task. The response should stay on topic, return the right format, avoid inventing facts, and fill in the required fields.

If the response breaks even one of those rules, do not count it as valid, even if it sounds confident.

How do you count the cost of a successful scenario, not just tokens?

Measure the cost of the completed task, not the cost of a single call. Include retries, fallback to another model, human review, and employee time spent on edits.

Then a cheap model will stop looking attractive if it often makes mistakes and creates extra work.

Can you use one SLO for all LLM tasks at once?

No, you should not do that. Chat, search, classification, and agent workflows have different expectations for time, quality, and cost.

Split the metrics at least by task type. Otherwise a fast route will hide a slow one, and cheap cases will mask expensive retries.

Which scenarios should you start with for your SLO?

Use 2–4 scenarios where a mistake directly affects money, queue length, or manual work. In most cases, that means support chat, request triage, knowledge base search, and drafts for operators.

If you start with all traffic at once, you will get a nice average and still not know where the service is getting in the way.

When should SLO thresholds be updated?

Review the thresholds after any noticeable route change. A new model, a new prompt, caching, filters, or another provider quickly changes latency, quality, and cost.

A good working approach is not to copy the old numbers automatically, but to test them on fresh traffic for at least a week.

What should be stored in logs for the SLO?

For each call, log the model, provider, prompt version, scenario type, latency, reason for failure, and whether a human made manual edits. Without this, the team will only be guessing about the cause of the drop.

If there was a retry or fallback, keep that path too. Then you can see what the working result really cost.

When is manual review unavoidable?

Keep manual review for expensive and risky scenarios. That includes customer replies, medical suggestions, legal drafts, and document summaries.

Do not read everything. Take a fair sample regularly and mark the reason for the failure so you can fix a specific mistake, not some abstract "quality".

Does AI Router automatically solve the SLO problem?

No, not by itself. A single OpenAI-compatible endpoint makes it easier to compare models, providers, and routes, but it does not answer the question of whether the user solved the task on the first try.

AI Router is useful for experiments: you can change the base_url to api.airouter.kz and run the same traffic through different options without changing the SDK, code, or prompts. But you still need to test the thresholds on live traffic.