May 22, 2025·8 min read

Runbook for the On-Call Engineer on an LLM Service: First 15 Minutes

Q: When should I switch on degradation mode?

Do not wait for a full diagnosis if `p95` and cost are rising together or the queue is filling up fast. At that point, it is better to reduce the damage temporarily than to preserve full answer quality and lose SLA. Usually reversible measures help: lower `max_tokens`, remove the expensive fallback, or move part of the traffic to a faster model. After that, check the same metrics again a few minutes later.

A short runbook for the on-call engineer on an LLM service: how to check error spikes, cost, and latency in 15 minutes, prioritize the right issues, and avoid service disruption.

What usually breaks in the first 15 minutes

At the start of an incident, there is almost never one clean symptom. Errors are already rising, but the cause is still hiding in the noise: some requests fail by timeout, some go through retries, and some reach a fallback model and change the service’s behavior.

Because of that, charts often seem to contradict each other. One place shows a spike in 5xx, another shows growth in 429, and product complaints sound like completely different failures. In reality, the source may be one thing: the provider started responding more slowly, the queue grew, retries added load, and then limit and budget errors started to appear.

Cost can also rise without an obvious traffic spike. Often the service silently switches to a more expensive model, or fallback routes part of the requests to another provider. Sometimes it is simpler: the max_tokens limit breaks, extra context gets into the prompt, and responses get longer. Another common cause is retries. One user request can be billed two or three times if the system is too aggressive in trying to get a response.

Latency is even more unpleasant. It grows even when incoming traffic barely changes. That happens if the provider holds requests longer, the worker pool gets clogged, prompt caching goes down, or the model on its GPUs hits a resource limit and starts building a queue. To the on-call engineer, it looks misleading: traffic is flat, but users are already saying the service is "thinking" for 20-30 seconds.

Complaints usually come in several forms at once:

"Responses stopped coming"
"Everything got slower"
"Quality got much worse"
"The bill is rising too fast"

For an LLM service, that is normal. One product can call several models, providers, and fallback routes at the same time. If requests go through a gateway like AI Router, part of the traffic may already be taking a different path while the rest stays on the old one. That is why users describe different symptoms at the same time.

A good runbook in the first 15 minutes helps you avoid looking for "the one error" and instead quickly choose between three possibilities: the route broke, generation got longer, or the system started spending more attempts on each request.

What to prepare before an incident

When errors, cost, or latency rise, the on-call engineer loses time not on analysis, but on finding basic information. You can remove that almost completely. Before an incident, collect a short data set and keep it in a note, on the dashboard, and in the shift chat.

First define thresholds. Not vague phrases like "something is slow," but numbers: what error rate counts as an incident, what p95 latency is normal for each important scenario, and where the cost limit per request, per thousand requests, or per session lies. If the service uses several models, keep the thresholds separate. A large external model, an open-weight model, and a rerank call do not have the same normal latency.

Next, prepare a route map. The on-call engineer should have a simple table: which model is in production, which provider it goes through, what fallback path exists, and what changes when traffic is switched. If the team works through AI Router, it is useful to note right away which scenarios can be moved to another provider through the same api.airouter.kz endpoint, and which can be sent to a model hosted inside the country when data residency and low latency matter.

The third block is a log of recent changes. The source of the problem is very often simple: a new prompt, higher max_tokens, an enabled flag, a changed default model, a fresh client release, or a new rate limit. The on-call engineer should not have to remember that from memory. You need a short list of changes for at least the last 72 hours, with time, author, and expected effect.

In one place, keep:

thresholds for errors, latency, and cost
current models, providers, and fallback routes
recent releases, flags, and prompt edits
contacts for the service owner and the person who accepts business risk

The last point is often underestimated. You do not just need a list of names, but a clear calling order: who decides whether it is acceptable to temporarily turn off an expensive model, who approves degraded answers, and who is responsible for customers with strict SLA. Then at 02:15 you do not have to guess whom to message first.

A well-built runbook does not save hours. More often, it saves the first 7-10 minutes, and that is the difference between a local dip and a visible incident for users.

First 5 minutes

First, check that the alert is real and not noise. One chart is not enough. Compare it with at least two signals: error logs, p95 latency, request volume, token spend, or complaints from the product.

If the alert fired for one service and neighboring metrics are calm, do not rush to declare an incident. Often it is not the model itself that is broken, but the monitoring, a limit on one key, or a single provider.

Then look at the time the problem started. Open the stream of releases, feature flags, and background job schedules. Compare that with traffic: was there a spike in requests, a model switch in routing, a large email blast, a nighttime batch process, or a new client? In LLM services, error and latency spikes often begin not with a provider outage, but with a normal load jump after a release.

At these moments, it helps to quickly sort symptoms into three buckets:

errors: 4xx, 5xx, timeouts, rate limit, empty or cut-off responses
cost: request cost, tokens per request, or share of the expensive model
latency: p95 and p99, queue length, time to first token

That split saves time. If only cost grew, you do not need to fix the network. If only latency grew, you should not immediately roll back the last API release.

The next step is simple and often the most useful: pause anything that changes system behavior on its own. Stop rollout, canary, auto-experiments, A/B tests, and automatic switching to new models if it is not limited by strict guardrails. First freeze the picture, then change settings.

If the team works through a single gateway, in the first minutes it is especially useful to check whether the provider or model route changed. The code may be the same, but latency and cost changed because of a different external path.

And one more thing many people skip: open the incident note right away and record the current state. Four lines are enough — start time, what broke, what you already checked, and what you froze. Ten minutes later memory starts to fail, and a short note removes arguments about the facts.

Minutes 5-10

If the alert has not cleared in the first few minutes, it is time to narrow the cause. Do not stare at the overall chart. Break the failure down into four slices: model, route, key, and client. Often one model causes the spike in 5xx, one route sees latency growth, and one client simply started sending requests that are too long.

In logs and metrics, it helps to answer two questions quickly: is the problem global or local, and what rose first — errors, latency, or cost. If you have a single API gateway, like AI Router, this step matters even more. From the outside, it all looks like one stream, but inside the cause may sit in one provider or even one routing rule.

Check the settings that break the service quietly, without an obvious outage. Timeouts may have become too short for the current model. The rate limit at the key level may have hit the ceiling. Response length often hits both metrics at once: latency rises and the bill rises.

A quick check at this point usually looks like this:

compare p95 latency and error rate by model and route
find the clients or keys where prompt tokens and completion tokens suddenly grew
check whether cache hit rate dropped on frequent requests
see whether the expensive route started being used more often than usual

A cache miss often looks like "everything suddenly got more expensive and slower." In reality, the service just stopped reusing repeated answers, and every request went through full generation. That can happen after changing the prompt, system message, or sampling settings.

If latency and cost are rising together, do not wait for a perfect diagnosis. At the 5-10 minute mark, it is better to limit the damage. Lower max_tokens for the problematic route. Remove the most expensive path from the fallback chain. Move part of the traffic to a backup model if recent checks showed acceptable quality.

The temporary fix must be reversible. Write down exactly what you changed: traffic share, token limit, list of clients, or route. An hour later, that will save a lot of nerves and will keep you from "fixing" the service by creating a new, hidden problem.

Minutes 10-15

Prepare a backup path in advance

Keep a fallback on another provider or a local model.

Set up fallback

If errors or latency have been holding for ten minutes already, the on-call engineer should stop trying everything at once. At that point, you need one step with a quick effect and a clear check a few minutes later.

Degradation mode is turned on not for the report, but so the service can keep breathing. It is needed when full answer quality already hurts SLA, cost, or queue length. In practice, that can mean temporarily lowering max_tokens, turning off tool calling for secondary scenarios, moving non-urgent tasks into a queue, or switching from an expensive model to a faster one. If the team works through AI Router, a fast move is often sending part of the traffic to another provider or another model through the same OpenAI-compatible endpoint.

A bad idea is changing four things at once. If you touch the model, timeout, retries, and limits all together, in five minutes nobody will know what actually worked. It is better to choose one action and watch the same metrics that triggered the alert. For example, reduce traffic to the problematic model from 80% to 20% and check whether error rate, p95, and average request cost dropped.

The team and support need a short status, without long explanations. It is enough to say when the problem started, which requests are affected, what you already changed, and when the next update will come. For support, a simple phrase is enough: "We are seeing higher latency in part of the requests, an alternate route is active, and we will be back with a new status in 10 minutes."

For escalation, this set is usually enough:

incident start time
what is affected: model, provider, region, or request type
numbers before and after the change
what you already did and at what minute
what risk remains right now

Set the next check immediately, not "later." After 5-10 minutes, the on-call engineer looks at the same charts again and makes one of three decisions: keep degradation, roll back the change, or call the next line. This simple rule is often more useful than any pretty diagram.

How to quickly separate cost growth from traffic growth

If the bill rises together with the number of requests, it is easy to make a false conclusion: load went up, so spend went up as expected. The check should not start with the total bill, but with unit cost. Look at cost per request and cost per 1,000 input and output tokens. If traffic rose by 30% and request cost almost doubled, traffic is not the only reason.

Then break spend down by model. A common story: some requests moved from a cheap model to an expensive one after fallback, a provider change, or a routing rule edit. If you work through a single gateway, that shift is easier to see in the share of requests by model and provider. Sometimes just 10% of traffic on an expensive route is enough to blow the budget.

Also check prompt and response size separately. With the same number of requests, spend can rise sharply if the system prompt got longer, users started sending more context, or the model began responding too verbosely. Look at the average and p95 for input tokens and output tokens. Growth only in output tokens often points to long responses or extra tool-call loops.

A quick breakdown is best done across five points:

cost per request and per 1,000 tokens
traffic share by model and provider
average size of the system prompt, user input, and response
number of retries, timeouts, and duplicates
recent edits to the system prompt and tool calls

Retries and duplicates burn budget quietly. The user sees one answer, but the service may make two or three calls because of a short timeout, a repeated queue send, or an idempotency issue. If useful answers did not increase and spending did, look there first.

After that, open the recent changes. Often a small edit is to blame: a long system prompt was added, max_tokens was raised, a new tool was enabled, or part of the tasks moved to a more expensive model. That is why thresholds should be defined in advance: how much cost growth per request is acceptable, what share of the expensive model is allowed, and at what prompt size the team immediately rolls back the change.

Example of a night shift

Bring order to limits

Set rate limits by key and catch noisy traffic faster.

Set up limits

02:13. After the night release, the dashboard shows an unpleasant shift: p95 has risen from 4 to 11 seconds. At the same time, 5xx barely move, there are few timeouts, and there are no customer complaints yet. This is a dangerous case. The service is formally alive, but the user is already waiting too long, and money is starting to burn faster than usual.

The on-call engineer does not look at the overall chart first, but at the breakdown by routes. Requests per minute have increased only slightly, but the average output tokens per response has almost doubled. On the cost chart, another trace appears: part of the traffic after the release moved to fallback, and fallback now leads not to the usual model, but to a more expensive one. If the team works through a single gateway, this imbalance is easy to see in route share and price per 1K tokens.

After a couple of minutes, the picture becomes clear. The new route started triggering more often on the latency threshold, so the system is switching requests to the backup model too early. There are few errors because the expensive model responds reliably. But it takes longer to think and returns longer responses. That creates two problems at once: p95 rises, and the token bill also climbs.

The on-call engineer does not argue with the charts and does not look for a perfect cause. They take two reversible steps. First, they return the previous route for the main request type. Then they cut max_tokens down to the safe limit that was already in production before the release. It is a blunt move, but at night it is often better than a beautiful hypothesis.

After 10 minutes, the metrics begin to recover. p95 drops first to 8 seconds, then lower. The average response size falls, the share of the expensive model returns to normal, and the cost per minute also goes down. Errors barely change, and that is a good sign: the team fixed the route and response length, rather than masking the failure with retries.

After that, the on-call engineer records three facts in the log: which release went out, which fallback was enabled, and which max_tokens limit was restored. In the morning, those three lines will help the team find the root cause quickly and update the runbook so the next night shift takes 5 minutes, not 20.

Where on-call engineers most often go wrong

The most common mistake is simple: a person starts turning every lever at once. They change the model, lower the timeout, adjust rate limit, and enable a new route. Five minutes later, it is already unclear what helped and what added noise. During an incident, it is better to make one change at a time and record the time immediately.

That kind of chaos is expensive. If the team turned on retries and switched traffic to a more expensive model at the same time, the service may start answering again, but the bill can grow so much that by morning you are dealing with a different incident.

The average calms you down too early

Many people look only at average latency. That is a trap. The average can stay normal while p95 and p99 have already moved up, and users are waiting longer than usual.

If the on-call engineer sees "2 seconds average" and relaxes, they miss the real problem in the tails of the distribution. For LLM services, that is a normal pattern: some requests go through quickly, while long or complex prompts get stuck and hurt the experience for real users.

Another common mistake is fixing only errors and not looking at cost. Suppose 5xx disappeared after switching to another route. Good. But what happened to cost per 1K tokens, response length, and the number of repeated requests? Sometimes the error goes away only because the system started answering longer, slower, and more expensively.

Switching traffic without a quick quality check

During an incident, it is tempting to move traffic to any live route. That is understandable, but the risk is high. Another model may return a different JSON format, follow instructions worse, or miss required fields more often. If switching providers takes minutes, that is convenient, but a quick quality check is still needed.

Usually a short run with a few real requests is enough: one short request, one long request, and one with a strict response format. That takes less time than untangling a broken integration after a rushed switch.

And the last mistake, the most boring and the most expensive: the on-call engineer does not leave an exact record of their actions. Without timestamps, the team later argues about when p99 rose, who turned on retries, and after which edit the cost jumped. That is why the runbook should require a simple note: what changed, at what time, and what effect it had after 2-3 minutes.

Short checklist before escalation

Test fallback before release

Run fallback routes through a single OpenAI-compatible endpoint.

Start testing

Escalation without facts only adds noise. If the on-call engineer writes "the service is down," the service owner will first ask five clarifying questions, and you will lose time. Before sending the message, gather a short summary that makes a decision possible right away.

First, record the exact start time and the current scale. You need one line: when the first spike began, which metric went out of range, and how many requests or customers were affected. The difference between "errors since 02:14" and "something strange happened at night" is huge.

Then separate the main symptom from the side effects. Often errors, latency, and spend rise together, but almost always one signal moves up more strongly than the others. If p95 doubled and errors barely changed, that is one scenario. If request cost rose sharply with the same traffic, look for another model, another provider, fallback, or a higher max_tokens.

Before escalation, check five things:

you have the start time, current error rate, p95, or request cost
it is clear what is growing the most: errors, latency, or spend
you can see which models, API keys, providers, and clients are involved
you already performed one safe action, and measured its effect after 2-3 minutes
the service owner received a short message with numbers and the next step

One quick step is needed almost every time. You can remove the expensive fallback, move part of the traffic to a neighboring model, lower the rate limit for a noisy client, or temporarily turn off long responses. What matters is not the step itself, but the measurable result: better, worse, or unchanged.

If traffic goes through a single gateway, it helps to look at the breakdown by model, provider, and key right away. Sometimes the problem is not the whole service, but one route or one client key. That changes escalation a lot.

The message to the service owner should fit in 4-5 lines. For example: "At 02:14 we saw a rise in 5xx, and 12% of requests are now failing. Qwen 3 and DeepSeek V3.2 are affected for two clients. We moved 30% of traffic to the backup route, and in 3 minutes the error rate dropped to 7%. We need a provider limit check and a decision on full failover."

What to add to the runbook after an incident

After an incident, it is best to edit the runbook right away, while the on-call engineer still remembers where time was lost. If someone spent 10 minutes looking for the right chart, remembering the route switch command, or checking the chat, the problem is already in the document.

First, add exact thresholds. Not general phrases, but numbers: what 5xx rate counts as an incident, what p95 latency already requires a switch, what cost increase over 10 minutes is abnormal, and at what number of 429s you should check limits. Next to them, write the commands, dashboard names, and chart snapshots for normal conditions and the failure state. The on-call engineer should see immediately where to look first.

Split the runbook by failure type

One general document for all cases quickly becomes confusing. It is better to keep separate branches for errors, cost, latency, and hitting limits. The symptoms may look similar, but the first actions are different.

Cost growth is often caused not by a broken model, but by long responses, a cache outage, a route change to a more expensive model, or retries. Latency growth more often leads to checking the provider, queue length, timeouts, and response size. If all of that sits in one section, the on-call engineer starts jumping between steps.

For each route, list the main model, the backup model, and the acceptable degradation. For example, you may temporarily lower max_tokens, turn off heavy reasoning, simplify the response format, or move to a faster model with slightly worse quality. The runbook should clearly state the limit: what can be degraded, and what must not be touched.

If you work through AI Router, add three more items: where to view audit logs, how to check provider routing, and what limits are set at the key level. That helps quickly separate an external provider failure from a local traffic spike or a rate limit hit at a specific service.

Update the runbook immediately after the incident

After the incident, a short 15-20 minute review is enough. Record which signal was noticed too late, which step worked immediately, what was unnecessary, and what was missing on the on-call screen.

If the service started throwing 429s at night and the team saved it by moving part of the traffic to a backup model, that path needs to be written into the runbook as concrete actions. Not in the memory of one engineer, but in a document that someone else will open at 03:00 and follow without guessing.

Teams that are already building this kind of setup through AI Router find it easier to keep one source of truth for routes, audit logs, and limits. But even with a good gateway, no one has canceled the runbook: in an incident, the team that wins is not the one with the most settings, but the one that makes one right decision after another in the first 15 minutes.

Frequently asked questions

What should I do in the first 5 minutes of an incident?

First, make sure it is not just noisy alerting. Compare the alert with at least two other signals: error rate, p95 latency, token spend, or product complaints.

Then open the list of recent changes and immediately freeze rollout, auto-experiments, and unnecessary route switching. That gives you the real picture and keeps you from adding new noise yourself.

How can I tell whether cost is rising, not just traffic?

Look not only at the total bill, but at unit metrics. If the number of requests barely changed and the cost per request or per 1,000 tokens went up, the cause is usually the model, fallback, response length, or retries.

It also helps to check traffic share by model and provider. Even a small shift to an expensive route can quickly inflate spend.

Which metrics should I check first?

Usually three groups of metrics are enough: errors, latency, and cost. Inside them, look at 4xx/5xx, p95/p99, time to first token, input tokens, output tokens, and traffic share by model.

If the service goes through a gateway, immediately break the data down by model, provider, API key, and client. The overall chart often hides a local failure.

Why can the bill spike without traffic growing?

Most often the cause is retries, long responses, or a quiet switch to a more expensive model. Spend also rises when extra context gets into the prompt or when fallback starts triggering more often than usual.

Check max_tokens, the size of the system prompt, and the number of duplicates per user request. These things burn budget without an obvious traffic spike.

When should I switch on degradation mode?

Do not wait for a full diagnosis if p95 and cost are rising together or the queue is filling up fast. At that point, it is better to reduce the damage temporarily than to preserve full answer quality and lose SLA.

Usually reversible measures help: lower max_tokens, remove the expensive fallback, or move part of the traffic to a faster model. After that, check the same metrics again a few minutes later.

What should I pause first?

Stop anything that changes service behavior on its own right away. That includes rollout, canary, A/B tests, automatic model switching, and other background changes.

While the system is changing routes by itself, you lose time chasing false leads. First freeze the state, then make changes deliberately.

How do I narrow down the cause of a failure quickly?

Split the incident into four slices: model, route, API key, and client. That way you can quickly see whether the problem is global or sits in one route, one provider, or even one noisy client.

Then answer a simple question: what grew first — errors, latency, or cost. That order is usually much more useful than a single overall service chart.

How do I safely switch traffic to a backup route?

Do not send all traffic at once. First move a small share of requests and run a few real scenarios: a short request, a long request, and one with a strict JSON response format.

If quality holds, increase the share gradually and watch error rate, p95, and cost per request. This is much easier to roll back if the backup model behaves differently.

What should I tell the service owner and support during an incident?

Write it down briefly and with numbers: when it started, what went up, who was affected, what you already changed, and when the next update will come. Long explanations only slow decisions down.

For support, a simple statement without guesses about the cause is enough. It is better to say there is a latency increase and an alternate route is active than to promise a precise diagnosis too early.

What must be updated in the runbook after an incident?

Right after the review, add exact thresholds, switching steps, and the commands the on-call engineer had to search for manually. If someone spent minutes at night looking for a dashboard or a command, the runbook was already incomplete.

It also helps to record which step had an effect, which one was unnecessary, and which charts should stay on one screen. Then the next on-call person will not have to rely on memory.