May 16, 2025·8 min read

When You Don’t Need Fine-Tuning: Data, Prompt, or Routing

When you don’t need fine-tuning: a practical guide to signs that clean data, a strong prompt, eval, and model routing will solve the task better.

Why fine-tuning gets chosen too early

Fine-tuning often feels like the most direct answer when a metric drops. The team sees accuracy fall by 8–10% and immediately starts discussing a new training cycle. The logic makes sense: if the problem is serious, the solution should be serious too. But in practice, training often treats the symptom, not the cause.

The most common mistake is simple: weak instructions are mistaken for a weak model. If the prompt is vague, the answer format is undefined, and there are no good examples, the model starts to drift even on a basic task. Then it feels like nothing will work without a new version. But sometimes all you need is to rewrite the system prompt, remove ambiguity, and strictly define the response structure.

The second problem sits in the data. The team collects examples from different sources, labels them quickly, and ends up with a set where the same situation is tagged in different ways. In that case, fine-tuning locks in the noise. The model does not get smarter; it just gets better at repeating confusion. If the dataset includes dirty text, old response templates, and questionable labels, a new training cycle only makes the problem more expensive.

Evaluation can be misleading too. One test set often mixes simple requests, borderline cases, and truly hard scenarios. The average metric looks bad, but it does not explain where the failure is happening. For example, the model may already handle routine customer requests well, while errors appear only in long emails with nested context or in mixed Russian and Kazakh. In that situation, it is often more useful to split traffic by request type than to jump straight into training.

This becomes especially visible in production. A bank or retail team may decide the model is not good enough for classifying requests, when the real issue is different: short requests should go to a cheaper model, complex dialogues should go to a stronger one, and personal data should be removed before the call while the input text is normalized. Here, proper input preparation and LLM routing often deliver more than rushed fine-tuning.

Warning signs usually look like this:

the metric dropped sharply, but no one broke down the errors by request type
the prompt changed on the fly, and the team has no stable version to compare against
the labels were collected under different rules and were never checked for consistency
one overall test hides where the task is easy and where it is actually hard

If that sounds like your situation, start by tracing the source of the failure. The question is usually not how to train the model faster, but whether training is needed at all.

What to check in the data before you start

Fine-tuning rarely fixes confusion in the data itself. If different labelers assign different labels to the same sample, the model will simply learn that noise. It helps to take a slice of 100–200 examples and look at where people disagree most: class boundaries, answer tone, escalation rules, or how they interpret mixed requests.

If the disagreements are obvious, the labeling guide should be fixed first. That is cheaper and faster than a new training cycle. If two specialists understand "complaint" differently, the model will not guess the right answer on its own.

For teams in Kazakhstan, there is another common imbalance: many Russian examples and too few Kazakh ones. That may be almost invisible in the overall metric, but in production the difference shows up quickly. If you expect equal quality in Russian and Kazakh, check not only total data volume but also coverage by request type in each language.

Where data breaks the result

The problem usually shows up in the same places. The dataset contains outdated answers, old policies, and disputed records. High-volume requests get mixed with rare cases, and the metric hides failures in the main flow. Sometimes the knowledge base is wrong, but the team blames the model. Sometimes it is the other way around: difficult examples make up a noticeable share of the dataset, even though they barely appear in real traffic.

Rare cases are better moved into a separate set and measured separately. Otherwise, one unusual scenario pushes the team toward fine-tuning, even though 90% of requests could already be handled with cleaner data, a better prompt, and corrected sources.

This is especially common in support systems. Suppose the assistant answers from an internal knowledge base, but the base stores an old limit, an old application form, or two contradictory templates. The model faithfully reproduces the error. In that case, the issue is not that it lacks training. It was given the wrong source.

There is a simple test: take 50 recent failures and sort them by cause. If most of them are tied to messy labeling, too few Kazakh examples, outdated articles, or knowledge base errors, fix the data first. Fine-tuning makes sense later, once the dataset is cleaner and the rules are the same for everyone.

When the prompt is enough

If the model generally understands the domain and the failures are in answer format, missing fields, or extra text, fix the prompt first. In many cases, that is already enough.

Prompts perform worst when they mix several tasks at once. If you need to analyze a support request, do not ask for classification, summarization, and an operator recommendation all at the same time. First lock down one goal and one output format. Not "analyze the customer's email," but "return JSON with topic, urgency, and reason for contact." Models handle the frame more easily when the output is clear upfront.

Long rules rarely save the day. Usually, 3–5 real examples work better: a normal case, an edge case, a short input, noisy input, and empty data. Examples define the frame better than a paragraph of instructions like "be precise." If the model confuses fields, show it two good answers in a row instead of ten more general restrictions.

Another common mistake is putting strict fields and free-form comments in one block. Mandatory parts are better separated clearly: first the fixed fields, then a separate field for free text. That way, the model breaks the structure less often. This is especially noticeable where the response is read by code, not by a person.

Edge cases should also be described directly. If there is too little data, the request is ambiguous, or the user asks for something that is not in the input, the model should not invent an answer; it should refuse or ask one clarifying question. This rule often improves quality more than a new training cycle, because it removes the most expensive error: confident hallucination.

One more practical point: keep prompt versions. Even a small change should be tested on the same set of examples. If replacing the instruction with a few strong examples raises accuracy from 72% to 84%, the problem was not model weights. Prompts work especially well where format matters more than knowledge and requirements change every week.

When routing models works better

If some requests pass cleanly while others fail, the problem is often not that the model needs more training. More often, you are sending very different tasks through the same path.

Routing is useful when requests differ a lot in length, language, error cost, and formatting requirements. A 20-token question, a long policy review, and an answer in strict JSON are not the same task. They rarely work well with the same model.

Simple rules are usually enough here. Short, frequent, predictable requests are better sent to a cheaper model. Long documents should go to a model with a larger context window. Russian and Kazakh are worth testing on different models and comparing the real result, not the description in the catalog. If the answer must come back in an exact format, it is smart to have a fallback route to a model that breaks JSON or schemas less often.

This approach pays off quickly. A cheaper model can handle most short requests without losing quality, while a stronger one is only needed for complex cases. You avoid spending budget where there is almost no difference.

The language effect is especially clear. One model handles Russian business language well but struggles with mixed Kazakh phrases. Another is more careful with local phrasing and names. If you see that kind of skew in testing, the first step is to split the flow by language, not to start training.

Response format is also often better solved by routing than by retraining. If the first model reasons well but sometimes breaks structure, send a follow-up request to a stricter model only for the problematic cases. That is much cheaper than training a new version just for one narrow issue.

If the team checks these hypotheses through a single gateway like AI Router, the experiment moves faster. In that case, you only need to change the base_url and keep the existing SDKs, code, and prompts. That is convenient when you need to compare 2–3 routes quickly and see what works best on your traffic.

How to decide in one sprint

Run open-weight models

Test Llama, Qwen, Gemma, DeepSeek, and Phi on your workload.

Run models

One sprint is usually enough to understand whether you need a new training cycle. Do not start with the whole product. Pick one scenario with a clear output: classifying an incoming request, extracting an amount from a contract, or answering from an internal knowledge base.

Choose one metric right away. If the scenario is about class selection, look at accuracy on the relevant classes. If the model answers customers, count the share of useful answers and the number of dangerous mistakes. When there are too many metrics, the team starts arguing about taste rather than results.

Next, collect 50–100 real examples from production. You need not only clean requests but also noise: typos, mixed Russian and Kazakh, short replies, long emails, and rare phrasing. If you use only a demo set, the conclusion will almost always be too optimistic.

Then run the baseline prompt across 2–3 models. That is more useful than arguing for weeks about which model is better in theory. If the team already has an OpenAI-compatible gateway like AI Router, this test is quick: the model changes, while the code and SDK stay the same.

After the first run, do not rush into training. First remove noise from the data, normalize the input format, and add few-shot examples or retrieval if the answer depends on internal documents. Often that is enough to eliminate most errors without fine-tuning.

For each option, it helps to record four things:

the result on the chosen metric
the price per 100 or 1,000 requests
latency under typical load
the type of errors that repeat

If, after cleaning the data, improving the prompt, and switching models, the gap still holds, training starts to look like a sensible next step. Especially if the errors are stable: the model keeps confusing two classes or keeps failing on the same document format.

If the result improves noticeably already at the prompt or routing stage, the answer is clear too. The problem was not in the model weights, but in the input, the model choice, or the context you gave it.

Example: bank requests in Russian and Kazakh

One bank wanted to launch fine-tuning for customer request classification. The task looked straightforward: identify the topic, extract the needed fields, and suggest a draft reply. But the first tests quickly showed that the model was failing not because it lacked training.

The problem was in the data. The same complaint got different labels from different labelers: sometimes it was "transaction dispute," sometimes "service complaint," and sometimes "status request." In Russian and Kazakh, the confusion got worse because short phrases like "card block" or "aksha tuspedi" were interpreted differently. If you train a model on that kind of dataset, it will simply memorize the noise.

The prompt was also trying to do too much at once. The team asked the model in a single pass to classify the request, extract the contract number, detect the language, determine urgency, and write a reply to the customer. For short messages, that sometimes worked. For long emails, the result shifted from run to run.

After that, the task was split into three simple steps. First, the model identified the request type. Then a separate call extracted fields: amount, date, card number, and contact channel. Only after that did another model prepare the reply for the operator or customer.

Routing also had a noticeable effect. Short messages from the mobile app were sent to a fast and cheap model. Long emails mixing Russian and Kazakh, with quotes from earlier correspondence, went to a stronger model with a larger context window. In the end, accuracy improved without a new training cycle, and costs went down because the heavy model was no longer used on every message.

For the bank, the takeaway was simple: if the data disagrees with itself and the prompt tries to do five things at once, fine-tuning rarely saves the day. First, the process needs to be put in order.

Mistakes that lead to an unnecessary training cycle

Reduce request costs

Send simple tasks to cheaper models and keep harder ones on a stronger model.

Check pricing

An unnecessary training cycle almost always starts not with a bad model, but with bad diagnosis. The team sees answers getting worse and immediately builds a dataset for fine-tuning. Two weeks later, it turns out the model was trained on raw logs with duplicates, noise, personal data, and conflicting labels. That kind of run locks in chaos instead of fixing it.

Raw logs are useful as a source of examples, but not as a ready-made training set. If the requests mix operator drafts, auto-replies, old templates, and occasional manual edits, the model will start copying that spread. The dataset should be cleaned and the scenarios split first.

Another common mistake is measuring everything with one average score. An average across ten tasks looks neat, but it hides failures in the most expensive or most frequent scenario. For example, a model may summarize well but confuse intent in customer complaints in Russian and Kazakh. The overall number smooths that out, and the team decides a new training cycle is needed.

Confusion also appears in experiments. If you change the prompt, RAG, and the model itself all at once, you will not know what caused the gain. The same answer may improve because of more accurate context, a new system prompt, or a different model in the routing path. A month later, it becomes hard to reproduce.

A risky setup usually looks like this:

the dataset was built straight from logs, without cleaning or clear selection rules
one overall score hides different error types
the team looks only at offline metrics and does not measure post-release cost
the report mixes the effects of the prompt, retrieval, and model switching
nobody keeps versions of the dataset, prompt, or test set

Post-release cost often breaks the plan too. Fine-tuning may lift accuracy by a few points, but if inference becomes noticeably more expensive afterward, there is no economic sense in it. For some tasks, it is cheaper to route simple requests to one model and complex ones to another. In that case, routing delivers more than training.

Without versions, everything quickly turns into a memory contest. Keep the dataset number, prompt text, RAG parameters, and model route for every run. Even simple discipline here saves more time than another training job.

Quick checklist before you start

Check Russian and Kazakh

Compare quality across languages and choose the right route for real traffic.

Start checking

To figure out whether a new training cycle is needed, a short 2–3 day check is often enough. It quickly shows where the problem is: in bad examples, a weak prompt, the wrong model, or the task setup itself.

Go through this list before you begin:

Define one error you want to remove. Not "the model answers poorly," but "it confuses request type" or "it fails to keep JSON in 18% of cases."
Build a clean evaluation set. 100–300 examples are enough if they are labeled consistently and reflect real traffic.
Give the model a proper foundation: a good system prompt, a few strong examples, and a strict response format.
Compare at least 2–3 models on the same dataset. Sometimes the task does not need fine-tuning; it needs a different model or simple routing.
Count the cost of the solution. If training will take two weeks of team time and the gain is only a few points, the tradeoff is weak.

Most of the time, the biggest mistakes are hidden in steps two and three. Teams take a raw dataset, add a couple of random examples to the prompt, and conclude that fine-tuning is the only way. In reality, the model might already work better now if the labeling noise were removed and the answer template were clear.

Testing several models is also essential. For classification, field extraction, and short operational replies, one model may be noticeably more accurate than another without any training. For that kind of check, it helps to use the same route and the same set of cases so the comparison is fair.

There is a simple rule of thumb. If, after cleaning the evaluation set, revising the prompt, and comparing models, the quality gets close to the target, training is not needed yet. If the same type of error keeps showing up on clean data, then fine-tuning starts to look like a real step rather than an expensive guess.

What to do next

Start with a small set of real cases, not with the idea of retraining the model right away. Usually 30–50 examples from real requests, chats, or internal workflows are enough. For each case, record the input, expected result, actual model output, cost, and latency.

That set quickly shows the baseline. If half of the errors come from messy labeling, missing context, or different phrasings of the same task, it is probably too early to start another training cycle.

Next, it helps to break the review into four separate lines: data, prompt, model choice, and routing. First, check for duplicates, empty fields, and disputed labels. Then see whether the instruction, examples, and format constraints are strong enough. After that, compare models by language, context length, JSON output, and price. Only then decide whether it would be better to send different task types to different models instead of waiting for one universal answer.

If you need to compare providers and models quickly without rewriting code, you can run the same test set through AI Router. It is a single OpenAI-compatible API gateway: only the base_url changes, while the SDKs, code, and prompts can stay the same. For teams in Kazakhstan and Central Asia, it is also a convenient way to test scenarios with data storage inside the country, PII masking, audit logs, and locally hosted open-weight models if the business requires that.

A new training cycle should only be planned after that comparison. If the data is clean, the prompt is already solid, the model was chosen sensibly, and the error still repeats on the same type of case, then training may pay off. Before that stage, it usually turns out that the problem is solved by data, prompt, or careful routing.