A/B Test Prompt or Model: How to Tell What Worked
A/B tests for a prompt or model can easily produce false conclusions if you change everything at once. Learn how to test the prompt, model, and route separately.

Why A/B tests often lie
The same test can easily produce a nice-looking but false conclusion. The team sees a higher score, changes a setting in the live system, and a week later the quality starts drifting again. Usually the problem is not statistics, but the test design itself: it compared not one change, but several at once.
The most common mistake is simple. Version A had the old prompt and the old model, while version B had the new prompt and the new model. If the result improved, you still do not know what actually worked. Sometimes the new prompt helps, and sometimes the model drags quality down. Sometimes it is the other way around.
The second problem is different request sets. On Monday, the test included short and simple questions, and on Tuesday it included long requests with tables, nested conditions, and noisy text. The average score will almost certainly shift, even if the prompt and model did not change. For a fair comparison, both versions must answer the same set of requests, in the same order, and under the same evaluation rules.
There is also a less obvious trap: the request route. If you have fallback configured, part of the traffic may go to another model or another provider because of a timeout, a limit, or an error. Then it looks like you are testing one option, but in reality two or three are hidden inside it. If you work through a single gateway such as AI Router, this matters even more: the request path can change without any code changes in the application.
The problem usually comes down to four things: one run changes both the prompt and the model, different request samples are compared, fallback goes unnoticed, and only the average score is reviewed.
An average by itself is often misleading. If 80% of simple requests improved by 3%, while 20% of difficult ones started failing twice as often, the overall result may still look fine. For the business, that is a bad trade. The user will remember not the small rise in the average, but the serious mistake in an important scenario.
An A/B test for a prompt or a model only makes sense when you isolate one variable. Otherwise, you are measuring a mix of the prompt, the model, routing, and traffic composition. That kind of result is hard to reproduce and risky to roll into production.
First define the test question
The most useful test starts with one clear question. If a single run changes the prompt, the model, and the routing rule, you will see the difference, but you will not understand the cause.
So for each run, choose one variable. Either you test a new prompt on the same model and the same route. Or you compare two models with the same prompt. Or you look separately at how routing changes things when the input data, the prompt, and the model set are already fixed.
Before you start, write down what "better answer" means for you. For a support bot, it may mean correctness against the knowledge base. For an internal legal assistant, it may mean fewer hallucinated facts. For ticket classification, it may mean the share of correct labels. If the criterion is not fixed in advance, the team almost always starts arguing after the results come in.
Quality, cost, and latency are better measured separately. One option may be more accurate but cost twice as much. Another may barely change quality and still stay comfortably within the time limit. These are different conclusions, and they should not be mixed into one score.
Before the launch, it helps to lock four things:
- what exactly changes in this test;
- which quality metric decides the outcome;
- what cost and latency limits you accept;
- what counts as a win.
The wording should be simple and testable. For example: "The new prompt wins if accuracy rises by at least 5%, average cost does not rise by more than 10%, and 95% of responses arrive in under 2 seconds."
After that, the test stops being a matter of opinion. You no longer have just a "variant we liked," but a decision you can repeat on the next dataset.
What to lock before starting
If you change two things at once, the test quickly turns into a guessing game. For a fair comparison, first freeze everything that is not part of the hypothesis.
Start with the request set. It must be the same for all variants, in the same order, and with a similar share of simple, medium, and hard cases. If one group has more short questions and another has more long conversations with context, the comparison is already off.
Then lock the generation settings. Temperature, top_p, output token limit, and stop sequences affect the style and completeness of the answer more than people expect. A common mistake looks harmless: the team updates the prompt and also raises max_tokens, then decides the new version is "smarter." In reality, it was just given more room to answer.
Next, check the response format, provider, processing region, timeouts, retries, and fallback rules. If one variant returns free text and another returns strict JSON, you are no longer comparing answer quality, but parsing convenience and the risk of output errors. If one route goes to one provider and another sometimes falls back to a backup path, the test again mixes several causes into one result.
When you work through a gateway like AI Router, it is worth locking not only the model, but also the actual route: which provider is used, where the request is executed, and what happens on failure. Otherwise, part of the traffic will go a different way and blur the conclusion.
It is useful to keep all conditions in one experiment card: prompt version, model, settings, provider, region, response format, timeouts, and retries. A week later, that will save hours and remove unnecessary arguments.
How to separate the effects of prompt, model, and route
If you change three things at once, the test loses its meaning. There is a difference in the answers, but the source of that difference is no longer clear. The team may easily decide that the new model helped, when in fact the answer improved because of a better prompt or a different request route.
A good working order is usually this:
- First compare only two prompt variants on the same model and the same route.
- Then keep the winning prompt unchanged and compare models under the same conditions.
- After that, test routing separately: direct call, fallback, different providers, or different regions.
- Run the final combination on a new sample that you did not use before.
This order may seem boring, but it gives a clean result. The prompt changes how the task is framed. The model changes the style and accuracy of the answer. The route changes the provider, latency, limits, and sometimes even the actual model version. When all of that is mixed together, people can argue for a long time later, but prove almost nothing.
You need a new validation set so you do not fool yourself. Suppose prompt A won on 120 bank support requests. Then you take another 120 different requests, with different wording, a different field order, and rarer cases. If the same variant wins again, you can trust the result. If not, you just tuned the test around familiar examples.
How to run the test
Do not start with a couple of convenient examples. Take 100–300 normal requests from real work: customer messages, internal tasks, long and short formulations, simple and controversial cases. Do not choose only the "pretty" requests, or the test will show a showcase instead of the real picture.
First build one shared set and remove only obvious noise: duplicates, empty lines, and broken inputs. Keep the meaningful "awkward" cases. Then split the set into a main group and a control group. The main group is for comparing variants and making the decision. It is better not to touch the control group until the end, so you can verify the conclusion and see whether the effect holds on new data.
For each request, write down the expected outcome in advance. You do not need to write the perfect answer in full. Short criteria are enough: what the answer must include, where it must not be wrong, which format is required, and what counts as acceptable style.
Then run both variants on the same requests. If you are testing the prompt, do not change the model. If you are comparing models, do not touch the prompt. If requests go through a gateway, save the actual route separately so you do not mix model effects with provider effects.
It is best to score answers on one scale for all runs. A 1-to-5 system works, or a simple pass/fail if the criteria are strict. The key thing is to use the same scale for all variants.
The average score is useful, but it rarely tells the whole truth. Look at the spread too: how many answers were weak, how many were excellent, and where the variant started failing. The difference between an average of 4.2 and 4.0 may look nice, but if the new variant fails twice as often on hard requests, that win is not worth much.
A practical example
A retail team had a returns assistant. It answered customers in chat: which documents were needed, whether an item could be returned, where to send the order, and what to do if the deadline was close. There were two complaints: the answers looked inconsistent, and in borderline cases the assistant mixed up the rules.
The team did not change everything at once. It took the same set of 200 real requests, kept the metrics the same, and started with the simplest step.
In the first test, only the prompt changed, while the model stayed the same. The new text asked the assistant to answer in a clear order: first the decision, then the reason, then the next step for the customer. The format improved almost immediately. The assistant forgot less often to ask for the order number and more often warned about return deadlines. But the rule errors did not go away. When the case was borderline, it could still give the wrong answer.
In the second test, the team kept the same prompt and changed only the model. The picture changed. The answer style barely changed, but accuracy improved on hard cases: there were fewer wrong refusals, fewer false approvals, and less confusion about product category exceptions. It became clear that the prompt affected the form, while the model had a stronger effect on meaning.
The third test was not about text or the model itself, but about the call path. The team compared a direct request to one model with a fallback route: first a faster option for simple questions, then a switch to a stronger model if the answer seemed uncertain or the request was complex. Quality barely changed, but average latency dropped. Simple requests were handled faster, and difficult ones were not lost.
The result was very down to earth. The new prompt improved the format and completeness of the answer. The new model improved accuracy. The fallback route reduced latency. If the team had launched one big update, it would have seen a general improvement and not understood what actually worked.
Which metrics to track separately
If you reduce the test result to one number, the conclusion is almost always distorted. One version may answer more accurately, but break JSON more often. Another may be cheaper, but take so long to produce the first token that users leave.
So it is better to split the metrics into separate columns:
- answer quality - the share of correct answers on your task set;
- format compliance - broken JSON, missing fields, extra text, wrong language, or a missing required tag;
- cost - not just the price of one request, but the whole session;
- latency - time to first token and time to full response;
- reliability - timeouts, empty responses, stream interruptions, and other failures.
It also helps to review these metrics by task type. Extracting fields from a document and giving a free-form answer to a customer behave differently. The same model may keep structure neatly and still get the facts wrong.
In practice, the winner is often not the version with the highest average score, but the one that clears the minimum threshold in each group. If accuracy rose by 2% but the broken-format rate jumped from 1% to 9%, it is too early to ship that version.
Where people usually go wrong
The first mistake is already familiar: the team changes several things at once and then tries to guess the reason for the result. The prompt was updated, the model was switched, and the routing rules were adjusted too. The metric improved, but the conclusion is no longer clear.
The second mistake is using too small a sample. With 20 or 30 requests, it is easy to get a nice-looking picture that falls apart on real traffic. This is especially true if the requests are very similar.
The third mistake is testing convenient requests instead of real ones. The team uses short, clean examples without typos, extra context, or odd wording. But in real work, users write differently: they send long messages, mix languages, repeat the question, and ask for a strict format.
Another trap is blind trust in the average score. Suppose the average rises from 4.1 to 4.3. Sounds good. But if the new version starts breaking JSON more often, mixing up fields, or giving dangerous answers in rare scenarios, the average hides that.
And finally, there is the invisible route change. You think you are comparing the same path, but in some cases the system sent traffic to another provider or to fallback. Then the test is no longer comparing what you planned.
Usually, three simple checks are enough: one changed factor per test, a real sample instead of a "pretty" one, and a review of failures together with the actual route, not just the average score.
What to do after the test
If the test shows a clear difference, do not ship the winner right away. First make sure only one factor changed. If you changed the model, provider, or route along with the prompt, the conclusion can no longer be considered clean.
Then quickly verify the basics: both branches got the same set of requests in the same order, generation settings matched, the response format was the same, and routing did not send part of the traffic down another path. Save the exact prompt text, system instructions, and call settings. Even one small detail can ruin the comparison.
If the results are debatable, do not try to settle everything with one number. Prepare a simple template for manual review and let experts look at the answers blind, without labels like "variant A" and "variant B." Usually, five points are enough: correctness, completeness, format compliance, risk of a harmful answer, and whether a human needs to edit it.
After the test, save not only the final table, but the entire run configuration. You need the requests themselves, responses, prompt version, model name, provider, route, generation settings, response time, and cost. A week later, the team will not remember why one run performed better if that data is not stored together.
If you compare models and providers through a unified OpenRouter-compatible gateway such as AI Router, it helps to keep the actual route, latency, errors, and logs for each request in one place. That makes it easier not to mix up prompt effects with routing effects and to investigate disputed cases faster.
Only after that should you take the next step: repeat the best variant on a new set of requests and check whether the effect holds. If it does, you can move the decision into production and watch live metrics instead of relying on one lucky run.
Frequently asked questions
What should you change in one A/B test?
Change one thing per run. Either the prompt with the same model and the same route, or the model with the same prompt, or the route itself with the same inputs.
If you change two or three parameters at once, you will see a difference, but you will not know why.
Why can't you compare variants on different requests?
Because the request set itself shifts the result. Short, simple questions almost always produce a better average result than long cases with noise, tables, and rare exceptions.
Give both variants the same set of requests, in the same order, and score them by the same scale.
How many requests do you need for a fair test?
Usually, teams use 100–300 real requests. That helps you see how the system behaves in normal work instead of just a few lucky examples.
If you only have 20–30 similar requests, the test can easily show a nice-looking but random result.
Which parameters should you lock before starting?
Freeze temperature, top_p, output token limit, stop sequences, response format, timeouts, retries, and the provider. Even one of these can noticeably change the answer.
A common trap is simple: the team changes the prompt and also gives the model more tokens, then praises the new version for being "smarter."
Why does the average score often mislead?
Averages smooth out failures. A variant may raise the overall score a little, but fail more often on hard or rare scenarios.
Look at where the answer failed, how often the format breaks, how much a session costs, and how latency behaves.
How do you know fallback distorted the result?
Check the logs for each request. If part of the traffic went to another model or another provider because of a timeout, limit, or error, you are already comparing a mix of routes.
When you work through a gateway such as AI Router, it helps to keep the actual request path next to the response and the metrics.
In what order should you test the prompt, model, and route?
Start with the prompt. Compare two prompts on the same model and the same path.
Then keep the better prompt unchanged and compare models. Check routing separately after that. This order gives you a clean result.
How do you decide the winner of a test in advance?
Write the rule before the test starts. For example: a variant wins if accuracy rises by at least 5%, cost does not increase by more than 10%, and 95% of answers arrive in under 2 seconds.
When the team sets the threshold in advance, arguments based on gut feeling almost disappear.
What should you save after the run?
Save the requests, responses, prompt version, model, provider, route, generation settings, cost, and response time. Without that, you will not be able to reconstruct the run a week later.
It helps to keep everything in one experiment card so you can quickly review a disputed case.
When can you ship the winner to production?
Do not rush. First, rerun the best variant on a new set of requests that you did not use in the main comparison.
If the effect still holds on new data, then move the decision into production and watch live metrics.