Dec 09, 2024·8 min read

LLM regressions: how to catch hidden drift before complaints

LLM regressions are not always obvious right away. Here we break down daily runs, alerts, control cases, and the checks to do before users complain.

Why quiet updates are noticed too late

A provider can change a model or its behavior rules without a separate email. On the outside, almost nothing changes: the same endpoint, the same parameters, a similar response style. The team looks at a few successful requests and decides everything is fine.

The problem is elsewhere. A quiet shift rarely breaks the system all at once. It usually starts with small things: the model handles JSON less well, skips a field more often, responds more cautiously to edge cases, or trims a long context differently. It is easy to miss by eye, especially if you read the answers like a person instead of checking them against clear conditions.

That is why LLM regressions live longer than they should. Inside the team, there is no sense of an outage because the average answer still looks "normal." But for the product, the difference is already there: the parser fails on 2% of requests, classification mixes up neighboring categories, and summarization loses one line that an operator needed to make a decision.

The first signals often come not from engineers. They are noticed by users, support, or business analysts. One customer writes that the bot has become vague. Another complains that the document is now processed only every other time. By the time those complaints form a clear pattern, the team has already lost several days.

A one-off manual review is almost useless here. People forgive model errors if the answer sounds reasonable overall. It is even worse when the reviewer already knows the "correct" result. The brain fills in the missing pieces and misses the failure.

This is especially visible in production, where the model is part of a chain of steps. If one answer changes the format even slightly, the next part of the system performs worse, even though the model by itself still seems acceptable. That is why teams notice the problem late for a simple reason: they are looking for big failures, while quiet drift almost always starts with small deviations.

What counts as a regression

A regression is not only a gross error or an obvious failure. In production, it is any deterioration compared with yesterday's working state. If the model solves the same task worse, slower, or at higher cost, that is already a problem.

Teams often get a false sense of safety: the answer looks "smart," so everything must be fine. But the user does not need smart text by itself. They need an accurate result in a short scenario: classify a request, fill in fields, return JSON, call the right tool without extra words.

So it is better to look not at abstract "quality," but at stability in repeatable tasks. If yesterday the model correctly extracted a contract number in 95 out of 100 cases, and today it does so in 88, that is a regression. Even if the answers still sound confident.

What breaks first

Usually the shift shows up in a few places:

accuracy drops in short business tasks where mistakes were rare before
the response more often comes back in a format the system does not expect
the number of unnecessary refusals, warnings, and hedges rises
tool calls, function arguments, or JSON break
latency and token usage increase for the same request

This hurts not the "beauty" of the text, but the business logic. One extra paragraph can break a parser. The wrong field type in JSON can stop a processing chain. One unnecessary refusal on a safe request can reduce conversion or add work for operators.

A good example is extracting data from an application. Yesterday the model returned short JSON without explanations: name, IIN, amount, term. Today it adds a phrase like "I may be wrong" and sometimes changes the date format. For a person, that is a small thing. For a service, it is already a failure.

When it is already an incident

Do not wait for mass complaints. If the shift repeats on a control set and affects a scenario involving money, documents, request routing, or user replies, that is an incident. The same applies to rising latency and token usage: the meaning of the answer may stay the same, but the model becomes too slow or too expensive.

The rule is simple: if the system has become worse at doing a specific job you already considered reliable, you have a regression.

How to build a control set

Start not with invented examples, but with logs. Take 20–50 real requests from the product and strip out personal data. Such a set is almost always more useful than polished synthetic cases, because it reflects live traffic, odd phrasing, and real user habits.

Regressions are not dangerous only in complex tasks. The model often breaks on simple ones: it mixes up strict JSON, stops calling the right tool, or cuts off a long response too early. That is why the set should be uneven. Add short requests, long dialogs, ambiguous phrasing, and edge cases where the model has failed before.

Usually five types of cases are enough: a short question with one exact answer, a long context with an important detail in the middle, an answer in a strict format, a tool-call scenario, and a request with sensitive data where you check PII masking or the correct content label.

Do not keep one perfect answer for every case. Some tasks need a gold answer, while others need only a simple failure criterion. If the model must return JSON with status and amount, check that those fields exist and that the structure is valid. If it must choose one of three classes, compare the label. If it is required to use a tool, record the call itself and its arguments.

For each case, it helps to write down four things: the input, the expected behavior, the check method, and the reason the case belongs in the set at all. A month later, that saves a lot of time. Nobody will wonder why there is an old 8,000-token dialog or a request with a typo here.

If you run the same cases through several providers, it is convenient to keep this on one compatible layer. For example, with AI Router you can send the same requests to different models through one OpenAI-compatible endpoint and see faster where accuracy, format, or speed dropped, without changing the SDK, code, or prompts.

And one more rule: if a case cannot be checked in seconds, it rarely survives a daily run. Keep only what gives a clear answer — pass or fail.

How to run a daily check

A daily run works only when there is little randomness in it. Fix everything that affects the answer: the system prompt, the user request template, temperature, top_p, max tokens, seed if available, and the exact route to the model.

If you have a routing layer, control not only the model name but the route itself. Otherwise you will compare different chains and get noise instead of a signal.

Run the same set of cases every day at about the same time. This is not a formality. Provider load, cold caches, and background jobs often change latency and even the shape of the answer. When the run happens on schedule, it is easier to see where the real shift is and where the normal daily variation is.

A good setup compares the new run against two points right away. The first is yesterday's result. It catches a sharp jump. The second is a stable baseline, for example the best known run over the last two weeks. It helps you notice a slow decline in quality, when the model loses 1–2% day by day and that still looks like noise for a long time.

First calculate everything automatically, and leave manual review for later. Usually four groups of metrics are enough: match rate against the gold answer for tasks with exact answers, refusal and empty-answer rate, format and required-field correctness, plus latency and cost per case.

After that, the engineer does not need to read the whole run. They need a list of notable deviations: the answer got 40% longer, JSON stopped parsing, the refusal rate rose, the classification shifted to a neighboring class. That way the team spends 15 minutes, not half a day, reading almost identical responses.

It helps to set different thresholds for different tasks. For data extraction, the tolerance is usually strict. For summarization, you can tolerate small textual differences if the meaning, structure, and facts are preserved. One threshold for every scenario almost always creates false alarms.

Which alerts actually help

One endpoint for checks

Send daily checks to different models through an OpenAI-compatible endpoint.

Try the API

The team quickly drowns in noise if it alerts on every strange response. You need signals that show a shift in one clear metric and lead directly to a check.

The first alert is format pass rate. If yesterday the model returned valid JSON in 98% of control cases and today it dropped to 93%, that is already a noticeable problem. The user will see it later, when forms, parsers, and call chains start breaking in different places.

Also look separately at the refusal rate. A rise in phrases like "I can't help," empty safe answers, or drifting into overly generic text often appears after a hidden provider update. From the outside, it looks like the model has simply become more cautious. For the product, it is a lost scenario.

A minimal set of alerts usually looks like this:

format pass rate drops below the set threshold
refusal rate rises compared with its usual baseline
median latency and the 95th percentile jump
response cost rises on the same request set
tool-call errors appear in a separate category

Latency and cost are often underestimated. They should not be. If the same prompt set suddenly answers 800 ms slower or uses 20% more tokens, that is already a shift. Regressions are often visible there first, not in obvious quality complaints.

Tool-call errors should be counted separately from ordinary text failures. The model may give a perfectly normal answer to a person and still break the automated flow: not call the function, pass the wrong arguments, or return broken JSON. That is a different kind of problem, and it is fixed differently too.

If you compare several models or providers, it helps to see these signals in one table. Then it is immediately clear whether the shift happened in one model, with one provider, or in the whole flow at once.

In the morning, the team usually only needs a short summary in the work channel: format, refusals, latency, cost, and tool errors, plus the difference from yesterday. Five lines are read in a minute and often save you from an unpleasant surprise by lunch.

A real production example

A team has a support bot. It reads an incoming request and returns JSON by template: request type, product, urgency, language, and queue tag. Then the CRM takes this JSON and routes tickets to the right people.

For several weeks, everything runs smoothly. Valid JSON comes back almost every time, and manual review is rarely needed. Then the provider quietly changes the model's behavior, and the bot starts skipping the priority field from time to time.

From the outside, this is almost invisible. The user still gets an answer, the ticket is created, but some requests go to the general queue instead of the urgent one. There are no complaints in the morning, but the daily run shows the shift right away: the share of complete responses drops from 99% to 93%, and on control cases with short complaints, the field goes missing especially often.

This kind of failure is easy to miss if you only look at whether the model responded or not. Formally, it responded. In practice, it did not, because it broke the contract that the ticket flow depends on.

The team does not wait for live incidents to pile up. It temporarily switches the traffic to another route with the same response schema and the same prompt. If the team has routing between models, that takes minutes.

After that, the engineers calmly dig into the cause. They compare yesterday's and today's answers on the same set of requests and quickly see the difference: the new model version more often treats the field as optional if the text does not contain an explicit urgency marker. Sometimes it also writes an explanation next to the JSON, although it did not do that before.

Only then do they decide, without rushing, what to change first: lock in another route, tighten validation, or rewrite the instruction so the model does not skip the field even with a weak signal. Users may not notice anything at all during that time.

Where teams make the most mistakes

Spot token growth earlier

Track cost and latency before complaints reach the team.

Open access

The most common mistake is building the control set only from synthetic examples and not looking at live logs. In the test set everything is neat: short requests, clean language, a clear format. In production, people type with typos, mix Russian and Kazakh, add extra context, and expect the answer in the right form.

Because of that, the team sees a green report in the morning and gets complaints in the afternoon. If a chat helps operators in banking, retail, or a call center, dozens of real anonymized requests from logs are often more useful than a hundred polished tests.

The second mistake is checking one perfect answer instead of a simple criterion. For many tasks, you do not need text word for word. You need a result that passes validation: the model chose the right category, did not lose the amount, answered in the right language, did not invent a fact, returned JSON without breaking.

The third mistake is comparing only with yesterday. That is convenient, but not enough. If quality drifts down for two weeks straight, a daily comparison may show nothing. You need a longer reference: the average or the best result over 7, 14, or 30 days on the same control cases.

Another common slip is changing everything at once. One day you adjust the prompt, the provider, and the parameters, and then nobody understands what actually broke the answer. It is better to keep a simple discipline: one release, one variable.

Finally, teams often lump all failures together. Because of that, the overall score may hold steady even though the response format has already fallen apart and fact extraction has slipped. The easiest way is to split errors into at least five types: factual error, broken format, unnecessary refusal, missing important detail, and answer in the wrong language. That breakdown immediately shows the nature of the shift.

A daily minimum that actually works

Pick a local model

Use hosted open-weight models when latency and in-country data storage matter.

Choose model

Regressions are often visible within the first 15 minutes of the workday. You do not need a huge report for that. You need a short ritual the team does every day before users notice the problem.

Start with a small smoke set. Usually 20–30 requests are enough: a few simple questions, a few data-extraction tasks, one strict JSON answer, and a couple of cases with sensitive rules where the model has failed before.

After the run, do not read everything in order. Immediately pick the 10 most noticeable differences compared with yesterday or the last stable run. Look not only at meaning, but also at response length, tone, extra refusals, missing required fields, and strange additions such as new disclaimers.

Check two numbers separately: the refusal rate and the broken-format rate. If the model must return JSON, a table, or a set of fields, any crack shows up quickly. Even a rise from 1% to 4% already hits production noticeably.

It is useful to open two long-context cases by hand. Short requests often pass fine, while problems surface where the model has to hold several rules, a long chat history, or a large document excerpt in mind. One case can be summarization, the other precise fact extraction.

If you have a backup route, check that too. Send the same request through the primary and backup paths and compare the answers. In systems like AI Router, this is convenient to do through the same endpoint: you can quickly run identical cases through several providers while keeping audit logs in one place. When a provider quietly changes the model's behavior, this kind of check saves a lot of time.

A daily minimum usually looks like this:

a short morning run
10 notable differences for manual review
refusal and format numbers
2 long cases
1 request through the backup route

If anything goes off track at any step, do not wait for the evening report. Freeze the release, shift part of the traffic, and review the examples while the problem is still small.

What to do next

Start not with full coverage, but with the places where an error hits the user immediately. Usually that is 10–20 scenarios: extracting data from a document, answering from a knowledge base, classifying a request, a short summary without hallucinations. If the model shifts there, the team will see the problem before the complaints arrive.

Those scenarios become your first test set. Do not try to build the perfect set in one day. To start, it is enough to include cases where you already had disputed answers, manual edits, or unexpected quality jumps.

The set should have an owner. Not "the team" in general, but one specific person who looks at the results every morning, reviews the failures, and decides what to do next: open an incident, temporarily roll back the route, update the alert threshold, or mark a case as noisy. Usually 15–20 minutes is enough for this.

And keep the data close together. When a test fails, nobody should be piecing the picture together from chats and logs. For each failure, one card is enough: the original prompt, call parameters, model response, review label, date, provider, and the previous run's result. Then you can see not only the failure itself, but also its shape — response length, JSON format, an extra refusal, or an error on an edge case.

For the first week, a simple setup is enough: a morning run, one owner, a clear failure log, and 10–20 risky cases. In a few days, it will become clear which control cases catch drift early and which only create noise. That is already a strong foundation for noticing hidden updates before your users do.

Frequently asked questions

What counts as an LLM regression?

Regression is any deterioration compared with your usual working state. If the model starts breaking JSON more often, mixing up classes, responding more slowly, or using more tokens for the same request, you already have a problem, even if the text looks "smart".

Why does manual review often miss a quiet drift?

Because people easily forgive small model mistakes. A response can sound fine, but the product is already suffering: the parser fails, a field disappears, a tool is not called, or a date arrives in a different format.

How many cases do I need for a first control set?

For a start, 20–50 real requests from logs, cleaned of personal data, is usually enough. That is enough to catch common breakages without turning the morning run into a long chore.

Which is better for the set: real logs or synthetic examples?

Start with live logs and add synthetic examples later. Real requests show typos, long context, mixed languages, and odd phrasing much better—the very situations where the model usually fails.

How do I check answers if the exact wording can vary?

Do not chase a word-for-word answer if the task does not require it. Check what actually breaks the product: the class label, required fields, JSON validity, tool calls, answer language, and the absence of invented facts.

Which metrics are useful to check every day?

Look at four things: accuracy on simple tasks, the refusal rate, format correctness, and latency plus cost. That set quickly shows whether the model has become worse for the system, not just for the person reading the answer.

When should a quiet drift count as an incident?

Do not wait for mass complaints. If the control set consistently shows a drop in a scenario involving money, documents, routing, or customer replies, open an incident right away and limit the risk.

How do I run a daily check without too much noise?

Run the same set every day at roughly the same time. Before that, lock the system prompt, request template, temperature, top_p, max tokens, seed if you have one, and the route to the model itself, or you will get noise instead of a comparison.

What should I do if the model suddenly gets worse in production?

First, move part of the traffic to a fallback route if you have one. Then compare yesterday's and today's answers on the same set of cases and look for a concrete shift: a missing field, an extra refusal, longer responses, or a broken tool call.

How do I avoid drowning in false alerts?

Do not alert on every odd response. It is better to keep thresholds for clear metrics, count format errors and tool calls separately, and review a short morning summary of deviations from your usual baseline.