LLM Routing for Production: How to Choose a Strategy
For production LLM routing, choose based on one task set and on cost, latency, and quality metrics, not on broad benchmarks.

What is the problem with routing in production
One model is almost never the best for every type of request. One does well on short FAQs, another is better at extracting fields from a document, and a third handles long customer conversations. If you send all traffic to one model, you usually overpay in some scenarios and lose quality in others.
A common mistake looks simple: the team picks the cheapest model and hopes it can "handle" everything. For easy tasks, that really does work. But as soon as a request gets messy, requires a strict format, or needs a long context, the cheap model starts cutting corners. It mixes up steps, skips constraints, and fills in details on its own.
The same story applies to speed. A fast model looks great in a demo and on an average-latency chart. But in production, it is not just about getting an answer in 800 milliseconds, it is about getting an answer in the right structure. If the model sometimes breaks JSON, loses fields, or changes the schema, the system spends more time on retries, checks, and error handling than it saves on the first generation.
Another trap is looking only at broad benchmarks. They are useful, but they do a poor job of showing what will happen on your own data. A bank customer request, a product card in retail, a medical inquiry, and an internal knowledge-base search all behave differently. On public tests, two models may run almost neck and neck, while on your sample the difference becomes obvious in cost, quality, and the number of broken answers.
So production routing is not about finding the "best" model. It is about checking three things for each task type: how often the model makes mistakes on your data, how much each answer costs at the real prompt and output length, and whether the model keeps the required format without extra retries.
Even if a team has access to hundreds of models through one gateway, the problem does not disappear. The choice becomes wider, but without testing on your own task set, routing quickly turns into guesswork. And in production, guesswork gets expensive.
What counts as a good answer
A good answer is not just text that sounds confident. It solves the task, arrives on time, costs a reasonable amount, and does not create extra risk for the business. If you do not agree on that upfront, almost any model will look "best" on random examples.
First, split requests into at least four groups: chat and user help, extracting data from text or documents, classifying tickets, and tasks involving code, SQL, or automation templates. Each group has its own quality standard.
In chat, accuracy, tone, and completeness matter. In extraction, strict structure matters more: the right fields, empty values where data is missing, and no made-up details. In classification, you need one label from the allowed list, with no half-page explanation. In code tasks, you need a working result, not a "almost right" draft.
That is why the response format is better defined from the start and kept short. For example: JSON by schema, one class from the list, an answer up to 5 lines, SQL without comments. A simple rule is usually more useful than a vague instruction like "answer well and carefully."
Also mark the cases where mistakes are not acceptable: changing facts in customer-facing answers, skipping a required field during extraction, assigning the wrong class and sending a request to the wrong process, or writing code that breaks the query or changes data in the wrong place.
After that, set thresholds for cost and response time. A good answer in 12 seconds can be bad for a live chat. A very accurate model also does not fit if it makes the task 8 times more expensive without clear benefit. It helps to define a simple range: for example, up to 2 seconds for chat, up to 10 seconds for background processing, and a clear price cap per 1000 requests. Then models can be compared calmly instead of based on the feeling after a couple of good answers.
How to build one task set for comparison
Comparison breaks at the very beginning if the request set is assembled in a hurry. You need the same task set for all models and all rules. Otherwise, you are not comparing routing, you are comparing randomness.
The best approach is to use real requests: from logs, a pilot, manual tests by the support team, or internal users. If traffic already goes through a single gateway, it is convenient to export anonymized examples from there. Demo prompts and polished examples from presentations almost always give a too-smooth picture.
Start by cleaning the sample. Remove duplicates, very similar phrasings, and anything containing personal data: names, phone numbers, email addresses, contract numbers, and addresses. If you do not have much data, that is not a problem. A set of 150-300 real tasks is usually more useful than a thousand repetitive ones.
Keep the length variation. Short requests test simple classifications and quick answers. Medium ones show normal production load. Long ones help reveal where latency, cost, and failure rates grow when the context gets harder.
Also add difficult cases. These are requests where the model often confuses dates, reads a table incorrectly, loses a negation, mixes languages, or answers too confidently when data is missing. There are usually not many of them, but they show the difference between a simple rule and a more complex scheme very clearly.
A good guideline for the sample mix is 60-70% ordinary requests from real traffic, 20-30% borderline cases, and 10-15% rare but expensive mistakes. In each group, keep a few short, medium, and long examples.
Once the set is built, freeze it until the comparison is over. Do not add new examples in the middle of testing, even if one model suddenly looks weaker than expected. Otherwise, the results are no longer comparable in a fair way.
A strong test set is easy to explain one item at a time. You can show any request and say why it is there. For example, in bank support, one short question checks intent classification, while a long customer thread shows whether the model keeps context by the fifth message. That kind of set exposes weak spots quickly, without extra noise.
Which metrics to watch
If you only look at token price, it is easy to pick the wrong model. It is better to calculate the cost of one request or one completed task. A cheap model by token often writes more, makes more mistakes, and sends more requests into retries.
Response time is also better measured by more than the average. The average hides unpleasant spikes, while users notice those spikes most. So next to median, you should also look at p95. If most responses arrive in 2 seconds but every twentieth one waits 12, the service already feels slow.
Quality is convenient to measure with a simple rubric. For example, customer support answers can be scored 0, 1, or 2: did not answer the issue, answered partially, answered correctly and fully. If that scale is hard to formalize, use pairwise comparisons: two answers to one request, and the reviewer picks the better one. That is often more honest than a long 10-point scale.
Technical failures should be counted separately. For production, they are just as important as text quality. If a model regularly times out, refuses to answer without reason, or breaks JSON, it damages the whole route even if it writes well in calm tests.
It helps to put everything into one table:
| Model | Request cost | Median | p95 | Quality score | Timeouts | Refusals | Broken JSON |
|---|---|---|---|---|---|---|---|
| A | 0.04 | 1.8 s | 6.9 s | 1.7 | 0.4% | 1.2% | 3.1% |
| B | 0.07 | 2.4 s | 3.5 s | 1.9 | 0.1% | 0.3% | 0.2% |
The trade-off is clear right away. Model A is cheaper and faster on average, but it handles the long tail worse and breaks structure more often. Model B costs more, but it reduces retries, manual checks, and integration errors. If you are testing routing through one compatible API, a table like this helps compare providers using the same scheme, not just by feeling.
Where simple rules work better
Simple rules win where the request flow is predictable. If you have a lot of short, routine tasks like "summarize the ticket", "identify the topic", or "answer using the template", there is no reason to calculate a complex score for cost, latency, and quality every time. A cheaper model usually does just as well, and the answer comes faster and costs less.
A practical setup often looks like this: short requests with a clear structure go to the cheaper model, while hard cases go straight to the stronger one. You can define the signs explicitly: long context, several documents, a request to compare options, high risk of error, legal or medical meaning. That is easier to check and easier to explain to the team than a long chain of weights and thresholds.
For strict JSON, it is best to keep a separate rule. Do not mix those requests with normal chat. If code will read the answer, it is better to send the task right away to the model that reliably holds the schema, or to use a separate route with a retry. In practice, that one rule often saves more time than fine-tuning the router.
The same applies to failures. If the model did not answer within the limit, switch the request to the backup. Do not wait too long and do not build a staircase of five handoffs. One main route and one backup usually give the best result.
For support, it may look like this: a short customer question goes to the cheap model, a request with long context or unclear meaning goes to the strong model, an answer in JSON for the CRM goes to a separate route, and a timeout or provider error triggers the backup model.
If a team uses AI Router, these rules are easy to test through one OpenAI-compatible endpoint and change the route without rewriting the client side. But the principle does not change: start with two or three rules. If they already deliver the needed quality and stay within budget, it is too early to make the scheme more complex.
When a complex scheme only gets in the way
Complex routing often looks smarter than it works in practice. It seems that a separate classifier, a chain of rules, and several thresholds will produce a much better result. In reality, the system gets more failure points, and the gain is often minimal.
The first problem is simple: the router itself also makes mistakes. If the classifier misunderstands the request, it sends it to a weak model and damages the answer before the main generation even starts. For support, this looks ordinary: a customer asks about a refund, but the scheme classifies the request as a general FAQ and picks the cheap model, which answers confidently but misses the point.
The second problem is the extra step before the answer. Every preliminary call adds latency. Sometimes it also adds separate cost, if you use a model to choose the route too. On short and routine requests, that step can easily eat up the entire cost savings.
There is also a more practical reason: it is hard to live with an overcomplicated scheme. It has to be explained to developers, analysts, support, and the people who track quality. If a rule cannot be described in two or three sentences, it is hard to test and hard to fix.
A complex scheme is especially unhelpful when tasks do not differ much, routing mistakes are more expensive than the savings, models change often, or the team simply does not have time to review the rules all the time. This is easy to see when hundreds of models and providers are available through one gateway. The temptation is strong: build fine-grained logic for every request type. But after the main model changes or pricing updates, the rules age quickly. What saved budget yesterday may simply add latency and confusion tomorrow.
A simple scheme fails less often precisely because it has fewer places where things can break.
How to run a comparison on your own data
A random set of requests almost always gives a misleading picture. It is better to use not "interesting examples" but the normal production flow: frequent requests, long conversations, borderline cases, and tasks where mistakes are expensive.
Collect 100-300 requests for each task type. For support, that may include FAQs, refunds, complaints, document checks, and short internal replies. Do not make the sample too sterile. A few noisy and inconvenient requests are more useful than a perfectly cleaned set.
Then run the same set through several models. Do not change the prompt, temperature, or token limits between runs, otherwise you are comparing settings, not models. If the test runs through an OpenAI-compatible gateway, switching models and rules is easier, but discipline still matters more than the tool.
For each run, record not only the average, but also the spread: the share of answers that pass manual review, median latency and p95, cost per 1000 requests or per conversation, and the number of refusals, interruptions, and answers that had to be corrected manually.
Start by checking simple rules. For example, send short FAQs to a cheap model and long, ambiguous cases to a stronger one. In many cases, that is already enough. A simple rule is easy to explain, quick to debug, and not hard to maintain a month later.
Only after that does it make sense to add a complex router: a classifier, scoring, or a cascade with multiple checks. Look at the difference honestly. If quality improved by 1-2%, but latency got worse, cost went up, and the team now has a harder time finding errors, that setup is unlikely to pay off.
Keep the winning version for a short pilot on real traffic. Usually 1-2 weeks is enough to see failures, rare scenarios, and real costs that do not show up in a lab test.
Example for a support team
In customer support, tasks are usually easy to separate. Questions like "how do I change my password", "where is my invoice", or "when will the payment be charged" can almost always go to a fast and cheap model. It answers with a template, quickly finds the right part of the knowledge base, and does not hold up the queue.
Complaints, refunds, and disputed cases work differently. If a customer writes "I was charged twice", "I disagree with the rejection", or "I want to file a complaint", the cost of a mistake rises sharply. Those requests are better sent straight to a stronger model that handles tone more carefully, sees risk better, and confuses company policy less often.
In practice, teams often start with simple rules. If the text contains words like "complaint", "refund", "claim", "billing error", if the tone is sharp, or if the customer asks for an explanation of a contract decision, the router picks the stronger model. If the request is short, routine, and tied to status, pricing, or account access, the cheap model handles it.
A separate format check is also needed. It catches empty fields in JSON, extra text after the structure, and missing required tags such as ticket category or risk level. Without that check, even good routing quickly breaks in CRM integration.
Say the team ran one day of traffic in that mode. The cheap model handled most routine questions, and the strong one took only the difficult conversations. After that, you should not look at the "overall impression" but at four numbers: daily cost, p95 latency, share of correct routes, and share of answers that passed the format check on the first attempt.
In support, this approach often works better than a complex router with many intermediate scores. The reason is simple: customer intent is often visible from the first words. If the rules cover the common cases and the team reviews misses once a week, the scheme stays clear, cheap, and predictable.
The most common mistakes
The first common mistake looks almost harmless: teams compare models on different request sets. On Monday they use easy support questions, on Friday they add difficult complaints, and then they conclude that the new router got "smarter." In reality, it was not the model-selection scheme that changed, but the sample itself.
The second problem is latency. Many people look only at the average response time and relax. The user does not notice the average, but the long tail - those same 2-5% of requests that take too long. If one scheme gives 1.8 seconds on average and another 2.0, that still means very little. You need p95 and p99.
Another mistake ruins the whole test at once: the team changes the prompt during the comparison. They rewrite the system message a little, add a new example, remove a paragraph, and now the results are no longer comparable. If you are testing model routing, keep the prompt, parameters, and post-processing fixed until the run is over.
A one-metric approach also performs poorly. Low cost without quality checks often leads to more manual reviews. The best quality score can turn out to be too slow for chat. Look at the set of metrics together: answer quality on the same sample, p95 and p99 latency, request and full-flow cost, and the share of format errors, refusals, and empty responses.
Another trap is multi-layer routers with a classifier, rules, fallback chains, and separate branches for rare cases that happen once a month. Such a setup is hard to debug, and it often gives less benefit than two simple rules: send cheap requests to one model, and send complex and risky ones to another.
A good sign of a mature test is simple: you can explain why the router chose that model, and you can show the numbers on one task set.
Short checklist before launch
Before launch, it is worth checking a few things:
- Freeze one task set for repeated runs.
- Describe the quality scale in simple words and add a few typical mistakes next to it.
- Set cost and latency thresholds in advance.
- Test the fallback route when the main model fails.
- Review logs for each request: which model was used, why it was chosen, how many tokens were spent, and how long the answer took.
That is usually enough to remove the chaos before the first release. For example, a support team may run the same set of 200 tickets. A week later, the team changes the latency rule but keeps the same task set and the same quality scale. The comparison stays fair.
If you use a gateway with audit logs and unified access to multiple providers, do not leave that data "for later." In AI Router, such logs help you quickly see where a simple cost rule works well and where it already starts hurting the answers.
What to do next
Start with the simplest scheme. For most teams, that is enough to get an honest baseline: how much one request costs, how long an answer takes, and where users really notice the difference in quality. Complex routing without those numbers almost always turns into guesswork.
For the first two weeks, a short plan is enough. Take one rule that is easy to explain: short and routine requests go to the cheap model, and long or risky ones go to the stronger model. Log the cost, latency, and answer score for each request on the same task set. After a week, review the cases where the expensive model really gave a better result, not just where it "sounded smarter." After any change to the model, system prompt, or RAG pipeline, run the comparison again.
Usually, after a week, the picture becomes clearer. Often it turns out that the expensive model is not needed for all hard requests, only for a narrow class: disputed answers, long instructions, legal wording, or unusual complaints. Everything else can stay on a cheaper model without losing quality.
If the numbers barely change, do not make the scheme more complicated. A two-branch rule is easier to maintain, easier to explain to the team, and easier to fix when the provider or prompt changes. This is especially noticeable in production, where routing mistakes quickly turn into extra bills and slow responses.
If you need one OpenAI-compatible gateway for several providers, and local model hosting, audit logs, and in-country data requirements matter, it may be worth testing AI Router in a pilot. In that setup, it is easier to compare routes on one endpoint and separately see whether local hosting gives you an advantage in latency and operating costs.
A normal start looks boring: a simple rule, clear logs, and a repeatable test. For production, that is a good sign. If after two weeks the scheme holds quality and stays within budget, it is ready to go without unnecessary magic.
Frequently asked questions
Do I need routing if one model already answers well?
Yes, if you have different types of requests. One model may do well in chat but struggle with strict JSON or long context. Routing is not about adding complexity for its own sake, but about avoiding overspending on simple tasks and losing quality on harder ones.
Where is the best place to start with production routing?
Start with a simple two-branch rule. Send short, routine requests to a cheaper model, and send long, ambiguous, or risky ones to a stronger model. This is easy to test, easy to explain to the team, and easy to adjust later.
How many requests do I need for a fair model comparison?
Usually 150–300 real examples is enough for one task set if you keep the mix of scenarios. Include common requests, borderline cases, and rare mistakes that are expensive. You can use fewer, but then randomness will distort the results more strongly.
What metrics should I watch besides token price?
Look at the cost of one request or one completed task, not just token price. Also keep an eye on median and p95 latency, broken JSON, timeouts, refusals, and manual fixes. That mix shows much faster where a model only looks cheap on paper.
When do simple rules work better than a complex router?
Simple rules work well where the traffic looks similar from request to request. Support, classification, short FAQs, and template answers often do not need a complex decision tree. If the rule already meets your quality and budget goals, do not add complexity without a reason.
When does a complex router only create more problems?
It gets in the way when it often makes mistakes and adds another step before the answer. A classifier can send the request to the wrong place, and every extra check increases latency and cost. If the quality gain is barely noticeable, the complex setup just creates more failure points.
How do I compare models correctly on my own data?
Use one frozen request set and do not change the prompt, temperature, or token limits during the test. Run several models through the same set, then compare quality, latency, cost, and formatting errors. That way you test the models, not lucky coincidences.
What should I do if a model sometimes breaks JSON?
Put those tasks on a separate route. If code or CRM will read the result, send the request to a model that reliably holds the schema and add one retry with a strict format check. That is usually more useful than trying to repair a broken response by hand.
Do I need a backup route for timeouts and errors?
Yes, but keep the setup short. One main route and one backup route is usually enough. If the first model misses the limit or the provider returns an error, switch quickly and do not build a long chain of handoffs.
How do I know the strategy is ready for production?
Run the scheme on a short pilot with real traffic and look at it for one or two weeks. If it holds quality on your data, stays within budget, does not break the format, and does not create a long latency tail, it is ready for production.