Skip to content
Jul 03, 2024·8 min read

Pre-release evaluation pipeline: from golden set to regressions

A pre-release evaluation pipeline helps catch regressions before launch: how to build a golden set, choose metrics, and create a report people can read in 10 minutes.

Pre-release evaluation pipeline: from golden set to regressions

Why a release breaks what worked yesterday

A new release can noticeably change how a system behaves, even if the team only changes a prompt or switches the model. The same request starts sounding different, changes answer length, step order, or confidence level. Yesterday the model honestly said, "I don't know," and today it confidently makes something up.

In LLM products, this is especially noticeable. The interface stays the same, the API does not change, but the behavior is already different. For the user, it looks like random instability: yesterday the bot filled out a form correctly, and today it mixes up fields or asks for extra information.

Manual review often misses these failures. Usually people look at 10-20 familiar examples where things are expected to work anyway, and quickly feel reassured. But silent regressions hide in rare phrasings, long conversations, edge cases, and answers that look fine at first glance but already break the business logic.

The problem is that some mistakes do not look like mistakes on first read. An answer can be polite and smooth, but still mix up an amount, date, or customer name, skip a required warning, break JSON for the next step, or give an overly confident answer without evidence.

Users notice this right away. The bot responds slower than usual, repeats the question, goes off topic, ignores the brand voice, or suddenly asks for data it never needed before. In banking, retail, or healthcare, this quickly turns into a complaint, a repeat contact, and extra work for the operator.

That is why pre-release checks are usually built not around the "ideal" answer, but around comparison with the previous version. The ideal is often debatable. Two strong reviewers may rate the same answer differently. But it is much easier and more useful to spot a drop compared to what already worked in production.

If the old version solved 92 out of 100 typical requests, the new one does not need to look prettier in a demo. It just needs to be no worse on the same data. This approach quickly brings everyone back to reality. It helps catch not abstract "model quality," but concrete regressions: where it got slower, more expensive, longer, less accurate, or riskier for the business.

Where to start

Do not start with metrics and tables. First, the team should describe in one sentence what is actually changing. For example: "We are changing the response model in the bank support chat without touching the interface or the prompt." That is often enough to narrow the check and cut out unnecessary debates.

This wording immediately sets the boundaries. You are not evaluating the whole product. You are checking one specific change and its consequences.

If the release goes through a single gateway like AI Router, it is also better to name the scenario directly: the team is moving part of the traffic to another model through the same OpenAI-compatible endpoint and wants to know whether facts, tone, and latency got worse. That is already a workable task definition.

Who is responsible for the check

The first check almost always has the same problem: everyone is involved, but nobody owns the result end to end. It is better to assign roles in advance. One person maintains the golden set, one runs the tests and checks repeatability, and another makes the release decision. Sometimes that is two people, not three. That is fine. What is not fine is when the owner of the set changes examples at the last minute and the team that decides on release has not seen the details.

After roles, write down the risks. Not ten or twenty. For the first pass, 3-5 items that can actually derail the release are enough. The list usually becomes clear quickly: the model mixes up facts, writes too casually for a banking tone, breaks the response format, or responds more slowly than allowed.

It is better to phrase risks as testable expectations. Not "quality may get worse," but "the share of factual errors must not increase" or "the JSON response must not break." Then the decision can be made without a long meeting.

What will stop the release

Before launch, agree on what result means "stop." Otherwise the negotiation starts only after the numbers are in.

Usually a few simple rules are enough:

  • any drop in critical facts stops the release;
  • a format failure above the agreed threshold stops the release;
  • latency above the limit sends the release back for work;
  • disputed cases go to the person who makes the final decision the same day.

If these rules are written down in advance, the report can be read in ten minutes instead of being argued over for half a day.

How to build a golden set

A golden set is better built from real user behavior, not invented prompts. Take requests from logs, support conversations, and incident reviews. These are the examples that quickly show where the model gets confused, stays silent, or answers too confidently.

You do not need a huge archive for the first version. Often 150-300 examples are enough if they cover real scenarios well. If you work in banking, telecom, or healthcare, mask PII right away: names, numbers, addresses, accounts, phone numbers.

Then clean the set. Remove duplicates, almost identical phrasings, and noise that tests nothing. If ten users asked the same thing in different words, keep two or three versions. That is enough.

Next, split the examples by scenario and difficulty. Otherwise the set will be flooded with common easy requests, and the hard cases will disappear. In practice, four groups are usually needed: everyday questions, multi-step scenarios with long context, ambiguous requests where the model should clarify the question, and rare but risky cases.

Rare and awkward examples are very useful. Add typos, mixed Russian and English, broken phrases, angry messages, very long questions, and requests without the needed details. If the model works in support, include cases where the user asks for something the system should not do.

Each example should have more than just a "correct answer" field. It is much more useful to record what exactly you expect. For one task, you need an exact reference text. For another, evaluation rules matter more: the model does not invent facts, answers in the right tone, asks a clarifying question, does not reveal personal data, and does not promise an unavailable action.

The minimum for each record is simple: the request itself, the scenario, the difficulty level, the expected answer or a short rubric, and a note on why the example was included.

Treat the golden set as a living working artifact. After every production incident, add a new example the same day, not "sometime later." After a few releases, such a set will start catching problems before users do.

Which metrics to choose

If the report has 12 numbers, the team will remember none of them. For a release, 2-4 metrics are usually enough to answer the common questions about the result.

A good basic frame is this: did the model solve the task correctly, did it give a complete answer, did it follow safety rules, and did it stay within the required response time? These four metrics are understandable to product, engineering, and the manager who opens the report for ten minutes before release.

Each metric needs a simple way to count it. "Correct" can be measured as the share of answers that match the intended meaning. "Complete" as the share of answers with all required fields or points. "Safe" as the share of answers without personal data leaks, dangerous advice, or policy bypasses. "Fast" as p95 response time or the share of requests that stayed within the limit.

Thresholds are best set in two layers. First across the whole golden set. Then for the groups where mistakes are especially expensive: long dialogues, mixed languages, legal wording, JSON responses, requests with PII. A general passing score might be: accuracy at least 92%, completeness at least 90%, safety 99.5%, p95 no higher than 4 seconds. But for a group that requires JSON, it is reasonable to require 99% format compliance even if the overall result is higher.

Also count failure modes separately. Otherwise they will disappear inside the average score. In the top part of the report, bring out refusals without reason, empty answers, format violations, timeouts, and provider errors.

This is not noise, it is a direct release risk. If the model became slightly more accurate but the share of empty responses rose from 0.2% to 3%, it is too early to ship. The same goes for format: for a team that switches models behind one OpenAI-compatible endpoint, broken JSON is often more painful than a small drop in text quality.

Good metrics do not try to describe everything. They quickly show whether things got better or worse, and tell you where to inspect manually.

How to run the candidate

Compare models without rewriting code
In AI Router, change only the base_url and run the candidate on the same OpenAI-compatible API.

Comparison breaks the moment the team changes everything at once: the model, the system prompt, the temperature, and even the test set. For one run, it is better to take one candidate and keep the conditions the same. Otherwise you will not know what caused the improvement or the drop.

First, fix the run conditions. The set of examples, generation parameters, input format, and post-processing should not change during the check. If traffic goes through AI Router, it makes sense to explicitly save the model id, system prompt, temperature, max tokens, and other settings that can shift the result even for the same task.

The process usually looks like this:

  1. Take the production baseline and the new candidate.
  2. Run both versions on the same golden set in the same order.
  3. Save not only the final score for each example, but also the raw answer, latency, timeouts, and request cost if cost matters for the release.
  4. Put the results into one table where each row is one test and the columns show the baseline, the candidate, and the difference.
  5. Tag failures by type: factual error, refusal, unnecessary verbosity, bad format, policy violation, slower response time.

An average almost always hides unpleasant issues. The overall score may rise by 2% because the FAQ starts answering faster, but in loan applications the model starts mixing up the term and the interest rate more often. Formally, the average is better. For a bank, that is still a bad release.

That is why after the overall summary, split the results by segment. Look separately at long and short requests, dialogues with history, cases with PII, complex instructions, and rare scenarios. One failing segment is often more important than a careful increase elsewhere.

Then open the 10-20 worst examples by hand. Automation does not fully save you here. People quickly see what tables miss: the answer became evasive, the tone shifted, the format is technically correct but awkward for the operator to use.

A normal run summary does not sound like "the model got better." It sounds like a short conclusion: where the candidate won, where it lost, what it costs, and which tests block the release right now.

Example: updating a bank chatbot

A bank decided to update the system prompt for its support chatbot. The goal was clear: make answers shorter, stricter in tone, and less likely to drift into general advice instead of a direct answer.

The overall summary looked calm. The average score barely moved. If you looked only at the final score, the release could easily have moved forward.

The problem appeared after breaking results down by question type. On questions about fees, the new version made more mistakes. The bot mixed up conditions or answered too generally when a customer asked about transfers, cash withdrawals, or card servicing.

What the golden set found

The set was not made up only of short, clean questions. It included long messages in a real chat style, where the customer wrote about the card, the plan, limits, and asked to compare two cases in one message. It also included questions about old plans that still appear in support.

It was exactly on these examples that the new prompt started failing. It more often dropped one part of a long request and confidently answered using current conditions where the customer was asking about an old plan. The error did not look severe, but the customer still received the wrong information and went to complain to an operator.

The team quickly saw why this kind of check matters. Not for a pretty average score, but to find narrow failures that hurt the product badly. The report did not have a hundred rows. It had a short summary: the overall result for the old and new versions, the segments with drops, 14 problematic examples, and the decision to block the release or fix the prompt.

In this case, the release was stopped, the instruction for fee-related answers was adjusted, and the problematic segments were run again together with the full set. That took less time than handling complaints after launch.

Where teams usually go wrong

See the parameters for every run
AI Router audit logs help you verify the model, prompt, and temperature without arguments after the test.

Most testing failures do not come from weak metrics, but from poor discipline. The setup can look neat, and the result still gives false confidence.

The first common mistake is building the golden set from successful examples. It ends up filled with good demo chats, clean requests without typos, and short tasks where the model already performs well. Then the release goes live, and users bring everything that was missing from the set: long context, a mix of Russian and Kazakh, angry wording, disputed requests, and old bugs. If there are too few hard cases, the overall score almost always looks better than reality.

The second mistake is changing everything at once. The team takes a new model, rewrites the system prompt, changes the evaluation rules, and updates the retriever at the same time. After that, nobody knows what helped and what broke the answers. For one cycle, it is better to change one major layer and leave the rest alone.

The third mistake is looking only at the average score. The average is convenient for a slide, but it hides painful failures. A model may gain 4% overall and still answer complaints, requests with personal data, or long multi-turn dialogues noticeably worse. If you use routing between models, it is useful to look not only at the overall result, but also at slices by task type and route.

Another problem is that known failures are not tagged separately. As a result, old bugs keep showing up in every run, get mixed with new regressions, and blur priorities. It is much easier to keep a separate tag for them: "known," "acceptable before release," or "blocker."

And finally, the report is often written as if someone will study it for half a day. In reality, the team has ten minutes. In that time, a person should understand what changed, where it improved, where it dropped, and whether the release should go out now.

A quick pre-release check

Before release, you do not need another big run. You need a short control pass that the team does the same way every time.

First, check the set itself. It should include not only scenarios for the new feature, but also the old cases the product relies on every day. If the bot got better at explaining a new loan product but got worse at card blocking, it is too early to ship.

Then open the thresholds. They are written before testing, not after seeing the results. Otherwise the team almost always starts bargaining with the numbers. If it was written in advance that accuracy in critical scenarios cannot fall below 97%, there is nothing to argue about.

It is convenient when everything is in one place: candidate answers, previous-version answers, automatic evaluations, manual notes, prompt version, and model version. One table or one dashboard is better than five scattered files. If the check goes through AI Router, audit logs help quickly verify the run parameters and avoid arguing about which exact configuration was used.

The final check takes ten minutes if it includes five things: an updated set, pre-fixed quality and cost thresholds, all answers and scores in one place, a manual review of the worst cases, and one person who makes the final decision.

Manual review is mandatory. The average score can look fine while two failures in a legal disclaimer, one dangerous piece of advice to a customer, and a couple of broken answers in Kazakh sit at the bottom of the list. These are the cases that later end up in support.

The final release note should be short. Usually three lines are enough: what improved, where the drop is, and who made the decision. If the drop exceeds the threshold, the candidate is not discussed endlessly. It is fixed or rolled back.

What the report should contain

Choose model hosting
Use open-weight models on your own GPU infrastructure if in-country data storage and low latency matter.

The report after the run answers one question: can the release go live or not? In the first paragraph, write what exactly the team changed, what set it was tested on, and how it ended - whether the candidate passed the threshold or showed a regression.

On the first page, three numbers are usually enough:

  • the overall result across the whole golden set;
  • the result for the hard groups where the model fails more often;
  • the number of regressions compared with the current version.

Add a short comment next to the numbers. For example: "The overall score rose by 1.8%, but we found 12 new errors in long-context scenarios." That saves time and does not hide bad news in tables.

After the summary, add 5-10 examples that actually affect the decision. There is no need to fill the report with dozens of similar cases. It is much more useful to show situations where the error is expensive for the business: a wrong answer about a tariff, a missed restriction, confusion in amounts, or a confident invention in an answer to a customer.

A good example is simple: input, old version answer, new version answer, and a short reviewer note. If the new version got worse in one rare test but much better in twenty common ones, that is immediately visible. If it breaks the scenario for VIP bank customers, that is also immediately visible.

The report template should not change from release to release. Usually four blocks are enough: what changed, the three main numbers, the examples that influenced the decision, and the verdict - ship, fix, or send back for rerun.

What to do after the report

After the report, there needs to be a clear next step. If the release passed, record this run as the new baseline. If it did not pass, send only the errors that will actually change the decision on the next run into the work queue.

For teams comparing several models through one compatible endpoint, it is convenient to keep runs and logs in one place. In AI Router, you can rely on a single API and audit logs to compare candidates under the same conditions instead of arguing from memory or from different exports. This is especially useful when the same test is run across several providers.

The idea itself is simple: do not try to measure everything. Give the team a set it trusts, a few clear metrics, and a short report that can be read in ten minutes. That is already enough to catch regressions before users do.

Pre-release evaluation pipeline: from golden set to regressions | AI Router