Jan 16, 2025·8 min read

Model Evaluation on Your Own Data for Product Use Cases

Evaluating models on your own data helps you choose the right LLM for product tasks: how to collect scenarios, gold answers, and metrics, and compare responses fairly.

Why public benchmarks do not reflect your product

A public benchmark shows how a model solves someone else’s task in clean conditions. A product does not work that way. A user writes with typos, pastes part of a conversation, asks for one short paragraph, and does not want a “smart” answer — they want a useful result in the right form.

Because of that, a model with a high ranking can still work poorly for you. It may answer standard math, code, or general knowledge questions confidently, but get lost in real requests: mix Russian and Kazakh, ignore the brand tone, write too much, or break the required format.

The difference is especially obvious where requests do not look like textbook examples. In support, a customer may send an order number, a screenshot, and an angry message in the same chat. In a benchmark, there is usually a short question and one “correct” answer. That is a different job.

Mistakes also have different costs. If a shop chatbot sounded dry, that is unpleasant but acceptable. If a model misrepresented a contract clause, loan terms, or a medical instruction, the cost of the mistake is very different. That is why you cannot look only at the average score. Tasks need to be split by risk and checked separately.

Teams most often underestimate four things: the request language and language mixing, the tone of the answer, the acceptable length, and the format without which the result cannot be used further.

Even if you can switch models quickly through one API, that alone does not make the comparison fair. Only your task set and your evaluation rules do.

That is why model evaluation on your own data starts not with a leaderboard, but with an agreement inside the team. What counts as a good answer? What must the model always do, and what can it skip? When do you need an exact fact, and when is a short, polite answer enough?

Until those rules are written down, any check will drift. One person gives a high score for natural style, another lowers it for an extra paragraph, and a third looks only at factual accuracy. In the end, the argument is not about the model, but about different expectations from the product.

Which tasks to take from the product

Inventing tasks from scratch is almost always a bad idea. You need real traces from the product: chat logs, support tickets, emails, form submissions, and internal employee requests. If the model will later answer bank customers, help an operator in telecom, or sort requests in retail, it should be tested on those same cases, not on convenient demo examples.

First, collect a raw sample over a clear period, for example the last month or quarter. Then remove personal data before labeling: names, phone numbers, addresses, contract numbers, IIN, and any fragments that could identify a person. This is needed not only for safety. When a labeler sees extra details, they often start judging the “customer story” instead of the model’s answer quality.

After that, sort the examples by task type. Usually four groups are enough:

finding a fact in a knowledge base or document
short summarization of a long text
classification of a request or intent
generating a reply, email, or internal note

This breakdown quickly shows imbalances. In many teams, 70% of the sample accidentally consists of simple FAQ questions simply because there are the most of them in the logs. As a result, the model handles easy cases well but falls apart where the product loses money or employee time.

Frequent scenarios are essential, because they create most of the workload. But rare and expensive mistakes are just as important: a disputed return, a complaint with escalation risk, a long conversation, a mixed Russian and Kazakh question, or an email with an attached policy. There are few such examples, but they are the ones that later appear in incident reviews.

It is better to move difficult cases into a separate set and not mix them with the main sample. It can include conflicting instructions, messy phrasing, long message chains, outdated templates, and situations where even a human does not answer immediately. Such a set is useful for a stress test before launch and for a repeat check after changing the model.

If you compare several models through one gateway, the task set must be identical for all of them. Otherwise the comparison loses meaning. A good sample looks like a real work queue: it has routine cases, borderline cases, and a few nasty examples that quickly show whether the model can go into the product.

How to build a testing dataset

To evaluate models on your own data, you need a set of requests from real product work. Do not take convenient demo examples — use what the team deals with every day: noisy phrasing, incomplete data, long conversations, and strict response-format requirements.

Start by choosing 5–10 scenarios that create most of the workload. Usually these are requests where a mistake costs time, money, or user risk. If your product works in banking, telecom, or the public sector, do not skip cases with PII masking, a mandatory AI-content label, or a strict response structure.

The dataset should include frequent requests, rare cases with a noticeable cost of error, messages with typos and broken context, requests in two languages, and long dialogues where the model has to hold onto details. Collect at least 30 examples for each scenario. If you can get to 70–100, the comparison will depend less on random model luck.

The source matters less than the honesty of the sample. Real inquiries, logs, emails, chats, or forms all work, as long as the team cleans the data according to internal rules.

Next, split the set into two parts. Use the working sample for early runs, prompt edits, and rough tuning. Keep the control set separate and untouched until the final comparison. Otherwise, the team will quietly tune the solution to familiar examples.

Check for duplicates separately. If the dataset has ten almost identical questions like “Where is my order?”, the model may look better than it really is. Keep semantic repeats only when the context, constraint, or expected action changes.

How to freeze a version

Before the first comparison, freeze the dataset. Save the scenario mix, the number of examples in each group, the cleaning rules, and the file version number. After that, do not add new requests “for fairness.” Otherwise, you will be comparing not the models, but different datasets.

How to prepare gold answers and response rules

A weak gold answer can ruin even a careful evaluation. If the interface expects a short two-line reply, do not write the gold answer as a long perfect essay. If your API accepts JSON with the fields answer, risk_label, and next_step, the gold answer should follow the same shape. Otherwise, you will end up measuring not answer quality, but similarity to someone else’s template.

There is rarely one correct answer. In a live product, a user can get several valid versions, and it is better to admit that in advance. For such cases, define the boundaries: which facts the answer must preserve, what can be rephrased, where a shorter text is fine, and where the model must ask a clarifying question.

A good rule is simple and testable. Usually it is enough to fix:

required fields and their order, if that matters to the customer
facts that cannot be invented or changed
acceptable response length
cases where the model must ask a clarifying question
cases where the model must refuse to answer

For disputed tasks, do not reduce everything to one “ideal” gold answer. It is more useful to describe the boundaries of an acceptable response and the clear signs of failure. Then reviewers evaluate the product task, not the style of a specific author.

Which criteria to measure

Add audit and limits

Give the team one entry point with audit logs and request limits at the key level.

Get started

When you evaluate models on your own data, one overall score is not enough. A model may write smoothly but confuse facts. Or it may answer accurately but be too slow and expensive. That is why metrics are better split by task type.

If the answer can be checked unambiguously, measure accuracy. This works for field extraction, request classification, order-status selection, and finding the right tariff by rules. In such tasks, simple metrics work: share of correct answers, share of correctly filled fields, and share of answers without omissions.

With open-ended answers, one number almost always lies. For those, it is better to use a short rubric and score the answer along several axes:

completeness
factual accuracy
format, including tone and length
safety
usefulness

A 0–1 or 0–2 scale is usually enough. The simpler the rubric, the fewer arguments you will have. If reviewers often disagree, the table does not need to get more complex. It is better to rewrite the rule.

Also measure cost, latency, and response length separately. These are not secondary numbers. Two models can deliver almost the same quality, but one answers in 2 seconds and uses half as many tokens. In production, that is often more important than a couple of points in score difference.

Dangerous mistakes should be moved into a separate metric. The average score may look fine even when the model invents return terms in 3% of cases, mixes up medical restrictions, or sends the customer to the wrong place. The share of critical failures should be checked first.

It also helps to distinguish between types of errors. One mistake annoys the user; another creates a business risk. A long support reply is unpleasant. A wrong answer about a transfer limit is dangerous.

Compare models by each scenario instead of rolling everything into one number. Make separate rows for returns, tariff changes, status checks, complaints, and escalation to an agent. Then you can see where the model is strong and where it should not be shipped.

If you run tests through a single gateway, it is convenient to collect quality and operational metrics side by side: provider cost, latency, response length, and error share by scenario. But even without such a system, the rule is simple: look at the model through the product’s eyes, not through a чужой benchmark’s eyes.

Example: customer support

A good testing example is an online store support team. There are many repeated requests, and mistakes become visible quickly. If the model mixes up return deadlines or promises something that is not in the rules, that immediately affects money and trust.

Take real requests from the support queue over the last month and clean the personal data. Then split them into at least two groups: simple and conflict-heavy. Simple cases help you understand basic accuracy. Conflict-heavy cases show how the model behaves under pressure when the customer is angry, asks for an exception, or writes unclearly.

For the first run, 30–50 examples is usually enough. The set should include questions about returning a product after delivery, messages about delayed shipping, requests to check order status, disputed cases where someone tries to bypass the rules, and short and long requests with errors and casual language.

For each example, add the internal company rule that the answer should rely on. For example: a return is possible within 14 days if the product has not been used; pre-order timelines may change; for order status, you cannot promise delivery today if the system does not show that status. Then you are testing not the smoothness of the text, but whether it matches the real policy.

You can give the same request to several models: “I have been waiting for my order for 9 days, no one is answering, I want a refund and compensation.” A good answer is short, calm, and accurate. It refers to the return rule, does not invent compensation, and does not promise a deadline that does not exist in the system. A bad answer sounds confident but adds made-up exceptions like “we usually refund within 3 days” if there is no such rule.

What counts as a good answer

Look at four things: a correct reference to the company rule, no invented deadlines or exceptions, a calm tone without bureaucratic language, and a format of 2–4 short sentences instead of a long letter. In practice, the winner is often not the strongest model in the overall ranking, but the one that makes fewer fantasies on conflict-heavy requests. For support, that matters more than a beautiful writing style.

Where teams make the most mistakes

Start with a pilot

Test the finalists on your own dataset and move the winning option into production through the same gateway.

Start pilot

The most common mistake is simple: the team cleans the data so aggressively that the test no longer looks like live requests. Typos, broken phrases, odd abbreviations, duplicates, and extra context from the conversation disappear from the sample. Then the model performs well, but after launch it starts getting confused by ordinary user messages. Model evaluation on your own data only works when the set still contains the same noise the product lives with every day.

For teams in Kazakhstan and Central Asia, there is another common problem: language mixing without labeling. In one stream you can easily find Russian, Kazakh, and English, and sometimes all three in one message. If such examples are thrown into one pile without tags, the conclusions become distorted. You will not know whether the model is weak overall or only breaks on Kazakh requests, code-switching, and short English inserts.

Another mistake is tied to metrics. The team looks at the average score and gets excited about a difference of a few tenths. But the average often hides expensive failures. If a model answers normally in 95 out of 100 cases, but in the remaining 5 it gets the amount, request status, or escalation rule wrong, that is already a serious business problem. Such errors should be searched for separately: by slices, by task type, and by the highest-risk scenarios.

The comparison also breaks when the team changes the prompt during the test. At the start there is one instruction, then it gets rewritten a day later, then a couple of examples are added, then the model is asked to answer more briefly. After that, you can no longer honestly say which model is better. You are comparing several variables at once. First freeze the prompt, response format, temperature, and post-processing rules. Only then run the set.

The gold answer often pulls the test off course too. An expert writes an “ideal” answer to their own taste: longer, stricter, more literary. But the product usually needs not a beautiful text, but the correct action. If an operator needs a short answer with an exact decision, do not penalize the model for not writing a paragraph of explanation.

A good test looks a little more boring than you might like. It has dirty examples, language tags, frozen launch conditions, and a separate look at hard failures. But then the results are less surprising after release.

Quick check before comparison

Keep one API

Change models by scenario instead of rewriting integrations for every provider.

Connect API

Model comparison breaks at the start if the test set is thrown together in a hurry. Take real product scenarios and immediately note their share. If 50% of your requests are short support questions, 30% are knowledge-base lookups, and 20% are complex cases with multiple constraints, the test set should reflect that picture.

Next, check that for each example you have a clear line between a good and a bad answer. You do not always need a perfect gold answer in one sentence. Often a short rule is enough: answered the point, did not invent facts, kept the right tone, and asked for clarification honestly if the data was missing. It is also useful to write down a clear failure: drifted into vague wording, missed a restriction, gave dangerous advice.

Normal cases are necessary, but the test should not stop there. Add rare and borderline examples: a message with typos, mixed Russian and Kazakh, conflicting instructions, an empty field, overly long input, or an irritated customer. These are the requests where you can see where the model loses meaning and where it knows how to stop instead of making things up.

For a quick check, one table is enough. It usually includes the scenario and its share in the product, the example itself and a short evaluation rule, the same prompt and input format for all models, launch cost, average latency, final score, and a short reviewer comment.

The same prompt is mandatory. If one model gets a different system text, a different field format, or more context, you are comparing not the models, but the test setup. When the team uses one OpenAI-compatible gateway, it is easier to keep the request identical: only the model changes. For tasks like this, AI Router looks like a practical option for teams in Kazakhstan: you can switch providers through one endpoint and not change the SDK, code, or prompts.

Collect cost, latency, and quality in one table from day one. Otherwise, quality lives in one file, cost in another, and the debate never ends. Even 30–50 examples are enough to show which model handles the main load, which one does better on rare cases, and where token savings are eaten up by latency.

What to do after the first tests

After the first run, do not keep all models on the list. If a model regularly mixes up facts, breaks the response format, or drifts into extra reasoning, remove it right away. Usually after that cull, 2–4 finalists remain, and those are worth checking manually.

Manual review almost always changes the picture. In the table, two models may look the same, but in live examples the difference becomes obvious fast: one falls apart on long dialogues, another keeps the response style but sometimes invents details. For product work, those failures matter more than a pretty average score.

Do not stop at comparing the models themselves. After the first round, rebuild the system prompt, clarify the response rules, and check the temperature, token limit, and order of instructions. A small tweak often gives a bigger effect than switching to another model. But after any change, run the test again on the same example set, or you will not know what actually helped.

The dataset itself should not be treated as fixed either. The product changes, users ask new questions, and support finds fresh failures. It is better to agree in advance how often to update the dataset and who owns it. A simple routine works well: add new failures after incidents, review examples after launching new scenarios, remove duplicates and stale cases once a month, and separately mark rare but costly mistakes.

If you compare many providers, extra manual work quickly eats the team’s time. In that case, it is convenient to run the same requests through one OpenAI-compatible endpoint. For teams that care about keeping data inside Kazakhstan, audit logs, PII masking, and one access layer to different models, airouter.kz can simplify the evaluation process itself, not just production launch.

A good first-round result looks simple: weak models are removed, finalists are reviewed manually, the prompt and settings are rechecked, and the dataset is set up for regular updates. After that, the comparison becomes noticeably fairer and much closer to how the product will work on real requests.

Frequently asked questions

Why does a high score in a public benchmark guarantee nothing?

Public benchmarks test someone else’s task in clean conditions. Your product lives in a different world: users make typos, mix languages, want a short answer, and often require a strict format.

What data should we use to test the model?

Take what already flows through the product: chat logs, tickets, emails, forms, and internal requests. Demo examples are usually too neat and rarely look like a real queue.

How many examples do we need for one scenario?

For the first comparison, 30–50 examples per scenario is usually enough. If you can get to 70–100, the conclusions will be steadier and random model wins will matter less.

Should we clean personal data before testing?

Yes, and it is better to do it before labeling. Names, phone numbers, addresses, contract numbers, and other personal details are removed not only for safety, but also so the reviewer does not judge the answer by the customer’s history.

Why split the sample into frequent and risky cases?

Yes, otherwise frequent simple requests will dominate the sample, and expensive mistakes will hide in the average result. Keep common cases in the main set, and move rare and high-risk examples into a separate stress-test set.

Do we need one gold answer for each request?

No, one perfect answer is rarely needed. It is better to write the boundaries in advance: which facts the model must keep, which format is required, and when it should ask for clarification or refuse to answer.

What metrics should we measure besides the overall score?

Do not look only at text quality. It is useful to measure factual accuracy, format adherence, response length, cost, latency, and the share of dangerous errors for each scenario separately.

How do we make the comparison fair?

First freeze the dataset, prompt, temperature, response format, and post-processing rules. If you keep changing those during the check, you will not know whether the model got better or the test conditions just changed.

Why does the average score often mislead us?

Averages hide expensive failures. A model may handle 95 simple cases well and fail 5 difficult ones where the mistake hits money, risk, or user trust.

What should we do after the first test run?

Remove the clear underperformers at once and keep 2–4 finalists for manual review. Then adjust the prompt and settings, run the same set again, and agree on how you will update the dataset after new incidents.