Feb 28, 2025·8 min read

Synthetic Examples for Testing LLMs Before Production

Synthetic examples help test LLMs when real data is scarce. Learn how to build test cases, write expected results, and catch failures before launch.

Why it is easy to miss a failure without tests

When there is little real data, the risk does not go down. Usually it is the opposite: the team sees weak spots less clearly and trusts the first good answers too early. If the model solved a task well once in a demo, that does not mean it will answer the same way on the twentieth, the hundredth, or the most awkward request.

LLMs usually break not on ordinary phrases, but on rare cases. A user writes with typos, mixes Russian and English, gives incomplete context, asks for almost the same thing as before, but changes one important detail. In those places the model mixes up steps, makes up a fact, forgets a constraint, or sounds too confident where it should ask for clarification.

This is especially noticeable before launch, when there are almost no logs. The team has no live history of requests, repeated mistakes, or a long list of complaints. Empty logs easily create a false sense of safety. It feels like if there are no bad examples, there is no problem. In reality, the problems just have not had time to appear yet.

That is exactly when synthetic examples are needed. They do not replace real data, but they give you a solid starting point. With them, you can check in advance how the model behaves on simple, disputed, and unpleasant requests, and quickly see where it is unstable.

A good test set often exposes what a demo hides. The answer changes on the same task if you slightly rephrase the question. The model skips an important constraint in a long message. A polite tone hides a factual mistake. A rare scenario breaks the logic completely. And two similar models can give very different levels of accuracy.

This is especially useful when the team compares several models before launch. If you only need to change the base_url and run the same set of requests, the differences appear almost immediately. Without a test set, people often choose based on a couple of pretty answers, and that is a weak check.

The most common mistake is simple: people test only what is convenient to show. Users do not write like that. They write shorter, messier, and stranger. If you account for that in advance, the first production rollout goes more calmly, and fewer fixes are needed after launch.

When synthetic data really helps

Synthetic data is especially useful at the very beginning, when the team needs a quick answer to a simple question: does the model actually solve your task, or does it just fail gracefully? At this stage, you do not need a perfect dataset. You need a set of requests that helps you spot weak points before the first release.

It works best for first runs and for comparing models against one another. If you send the same dataset through several models, the difference is obvious right away: where the answer is more accurate, where the model goes off into fantasy, where it loses the format, and where it too confidently suggests dangerous actions. For that kind of comparison, test cases are more useful than long debates on a call.

There is also a more practical reason. In many teams, real data cannot be used freely in tests. That is a common situation for banks, clinics, telecom, and the public sector. Customer correspondence, applications, internal documents, personal data — these are often restricted by policy, contracts, or requirements to keep data inside the country.

In those conditions, synthetic data gives you a safe starting point. You can check response formats and business logic without access to live logs, compare several models on the same scenarios, show the risks to legal and security teams, and build a basic dataset before integrating with production.

But synthetic data has limits. It is almost always cleaner than real life. Users write with errors, jump between topics, mix up dates, add unnecessary details, and sometimes pressure the model. If you rely only on invented cases for too long, the team ends up testing a convenient world, not the real one.

So do not build a set only from correct, polite, and tidy requests. Add noise. Let one user write in fragments, another ask for something the system should not do, and a third mix Russian and English. Even on the first run, you will see how well the model holds up in a real conversation.

Synthetic data is good as the first step. It helps quickly filter out weak options and avoid sensitive data too early. After that, you still need checks on real, even if anonymized, examples.

Where to get ideas for cases

It is better to collect scenarios from real user tasks, not from a list of features. People do not think in terms like “tariff lookup” or “status check.” They write: “Why was I charged more than usual?”, “I need a limit urgently,” “How do I close my card if I’m not in the country?” Those are the kinds of phrases that make proper LLM cases.

The best place to start is with places where customer language already exists. Usually, support emails, operator chats, call-center scripts, complaints, escalations, and search queries in the knowledge base are enough. They quickly show how people ask the same question in ten different ways. For testing, that matters more than a full catalog of features. The model should understand intent, not guess the answer from perfect wording.

If you are building an assistant for a bank, telecom company, or public service in Kazakhstan, bilingual use cannot be left for later. The dataset should include questions in Russian, Kazakh, and mixed form. In real life, someone may easily write: “Карта заблокталды, что делать?” or “Рассрочка бар ма for this product?” If the tests run only on clean Russian, the picture will be too pretty.

Frequent requests give you a base, but you cannot stop there. Add rare cases where a mistake is costly: a refund, exposure of personal data, bad advice about limits, confusion about documents, or promising something the service does not do. These cases are rarer, but they are the ones that break launches later.

You also need inconvenient input variants. People make typos, send one word, send an empty message, write in all caps, or write in anger. A model that confidently answers only neat requests will quickly lose form in production.

A useful trick is simple: for each common task, collect 5–7 phrasings. One normal one, one short one, one with mistakes, one irritated one, and one mixed-language one. That makes the dataset feel more like real life, and even at this stage you can see where the pre-launch evaluation will be honest and where the tests are still too sterile.

How to build the first dataset

The first dataset does not need to be large. If there are few real logs, it is better to build cases around situations where a mistake directly affects money, deadlines, or user complaints.

For the first pass, 5–7 scenarios is usually enough. That will not give a full picture, but it will quickly reveal major failures: the model mixes up rules, invents facts, or misses an important constraint.

A practical sequence looks like this:

Choose high-risk scenarios. Look for cases where a wrong answer leads to an extra payout, an error in a document, a policy violation, or another support request.
For each scenario, write the user’s goal in one short phrase. Not “get advice,” but “check application status,” “calculate the amount,” “verify the limit.”
Make 3–5 versions of the same request. One user writes briefly, another gives half the story, and a third makes a wording mistake or combines two questions in one message.
Add context: user role, date, amount, product, channel, and any constraints that affect the answer.
Put everything in one table. Do not keep cases in notes and chats, or the team will quickly get lost in versions.

Context often decides everything. The request “Can I process the payment today?” without details is almost useless. The same request from a cashier on 31.12 for 4,900,000 tenge with a limit of 5,000,000 can be tested properly.

A convenient template might look like this:

Scenario	Request	Context	Expected result
Payment against limit	Can I process the payment today?	Role: cashier; date: 31.12; amount: 4,900,000; limit: 5,000,000	The model confirms the payment is possible and refers to the limit
Application status	What about my application?	Role: customer; application submitted yesterday; no number provided	The model asks for the missing identifier and does not invent a status

In the “Expected result” column, it is better to write a checkable outcome. Not an abstract “answer well,” but specific behavior: ask for missing data, do not invent a contract number, do not promise an action without confirmation.

If the team runs this dataset through several models, comparing answers becomes much easier. You can immediately see where the model follows the rule and where it breaks on the third variation of the same request.

How to write the expected result

Evaluate cases before launch

Run your first tests through a single OpenAI-compatible gateway and quickly spot weak points.

Try the API

It is better to describe the expected result as the model’s action, not as an “ideal answer.” Otherwise you end up judging one nice sentence instead of the overall behavior. For testing, a record like “if the request is missing data, the model asks for clarification” is more useful than a polished paragraph it is supposedly meant to repeat almost word for word.

A bad expected result is vague: “It will answer politely and correctly.” A good one describes checkable behavior: “It mentions only known facts, does not invent an order number, asks one clarification question, and answers in three points.” Then the dataset becomes a working tool, not a collection of pretty answers.

You should also write the prohibitions directly in the case card. If you do not record them, the reviewer will start making up rules on the fly. Usually 2–4 clear constraints are enough: do not invent a fact, do not leave the role, do not give advice outside the allowed area, do not promise an action the system cannot do.

A simple scoring scale works well:

Correct — the model did what was required and did not break any prohibitions.
Partial — the overall meaning is right, but it skipped a step, added something extra, or slightly drifted from the format.
Refusal — the model correctly refused to answer where it should not answer.
Escalation — the model passed the case to a human or another channel when the rule required it.

Do not mix everything into one score. The same answer can be accurate in substance but weak in format. So check accuracy, format, and tone separately. Accuracy answers “is this true?” Format checks the structure: list, JSON, short answer, length limit. Tone matters where the role affects trust and risk, such as in support, banking, or medicine.

It is better to decide who gives the score before the first review. If one analyst marks an answer “partial” and another marks it “correct,” you will not know whether the model improved or you are arguing about taste. Usually the team only needs short rules on one page and 3–5 worked examples. After that, any reviewer judges the case by the same criteria.

How to make the dataset feel real

A model rarely breaks on neat prompts in a table. It more often fails on live language, where a person is in a hurry, jumps between thoughts, and contradicts themselves. That is why synthetic cases should be a little uneven, not “classroom perfect.”

Mix requests of different lengths. One user writes, “Where is my account?” Another sends seven lines with the order number, complaint, extra story, and a request to “answer briefly.” If the dataset only contains medium-length requests, you will not see how the model behaves at the edges.

Add conversational words, typos, omissions, and noise. People write “now,” “it’s not loading,” “it’s the thing from yesterday,” forget dates, and mix up names. They also add details that do not help the task: “I already called three times,” “I’m on the road now,” “answer without formalities.” A good test checks whether the model can separate what matters from background noise.

It is also useful to test a topic change within one conversation. A user may start with a return request and then ask two messages later whether their previous applications are still there. If the model only follows the last question, it loses the thread. If it drags the whole conversation into every answer, it may respond to the wrong thing.

It is worth deliberately adding ambiguity and conflicting facts. For example, in the first message a person says the order has not arrived yet, and in the next they complain about a defect in the received item. In that situation, a good model does not make up an answer — it clarifies what actually happened.

The minimum for a realistic dataset looks like this:

several very short requests without context
several long conversations with extra details
cases where the topic changes during the conversation
examples with contradictions and unclear wording
requests with overloaded context, where the important detail is hidden in the middle

If a case looks too clean, it is almost always too easy. The dataset starts to resemble reality when the model has to do more than continue the text: it has to untangle the mess, ask a clarifying question, and not mix up facts.

Example: a bank assistant before launch

Meet production requirements

Add audit logs, content labels, and key-level limits.

Set up access

A banking chatbot rarely breaks on nice demo questions. Problems start with everyday messages, where the customer mixes up terms, writes on the fly, and expects a fast answer. That is why tests should be built not around perfect requests, but around confusion, haste, and disputed situations.

Suppose the team is building an assistant for a mobile bank. The first case: the customer asks about a transfer limit, but the text mixes up the card and the account. If the bot confidently answers about the card limit when the question was about the account, that is not a minor inaccuracy. That kind of answer only adds to the confusion. In a good scenario, the model first clarifies what exactly is being asked and only then answers.

The second case should be less tidy. The customer writes with typos: “need it urgent tell me why money wont go through transfer needed now.” Here the team is checking not just meaning, but tone. The bot should not get annoyed, lecture the user, or get lost because of typos. A normal reply is short: acknowledge the urgency, ask for the minimum needed data, and if there is a risk of a blocked payment or a payment failure, quickly hand the case to a human agent.

The third case is useful for checking boundaries. The customer asks for advice on a disputed transaction, for example whether they should cancel a transfer that is already “stuck” or whether they can dispute a charge right now. This is dangerous when the model starts giving advice too confidently. The team should check four things: where the bot may answer on its own under bank rules, where it must call an operator, how it explains the handoff without a dry refusal, and whether it invents procedures that do not exist.

After the run, weak spots usually show up. One prompt makes the model ask for clarification more often. Another rule stops it from guessing when the customer mixes up the product, channel, or transaction status. Such a test gives not an abstract score, but a clear list of fixes: where to shorten the answer, where to add escalation, and where to remove an overconfident tone.

Where teams make mistakes most often

The most common mistake is simple: the team writes easy questions. Short, clean, and unambiguous. On such a dataset, the model almost always looks better than it does in real work. Then production brings in broken phrases, mixed languages, and extra details, and the confidence runs out quickly.

This happens all the time with synthetic cases. People unconsciously invent examples they themselves want to pass. If a request reads too smoothly, that is a reason to be suspicious. A good test often feels a little annoying: it has noise, uncertainty, or conflicting conditions.

Another mistake is looking at style and missing the facts. The answer can be polite, smooth, and well written, but still contain the wrong amount, the wrong date, or an invented rule. For a business, that is more dangerous than a dry tone. If a bank assistant beautifully explained a fee that does not exist, that is a bad answer even if the text is pleasant to read.

A dataset often grows in length, but not in breadth. The team takes one scenario and rewrites it ten different ways. Formally there are more cases, but the coverage has barely changed. If you have 40 examples about password reset and only 2 about refusal, escalation, and personal data handling, the dataset is skewed.

It helps to split requirements into two groups. Mandatory items include factual accuracy, following prohibitions, the required format, and safety. Desirable items include tone, brevity, order of points, and presentation. When these groups get mixed together, scoring breaks. The model may fail because of wording even though the answer was correct in substance. Or it may pass because the text is smooth while violating an important rule.

Another common miss is not updating the dataset after changes. The team changes the system prompt, enables new model routing, or moves part of the traffic to another configuration, but leaves the tests untouched. After that, the old evaluation does not mean much. The dataset has to change with the system, or it is testing the past version, not the current one.

Quick check for the dataset

Work with tenge invoices

For teams in Kazakhstan, it is convenient to receive monthly business invoices in tenge.

Get access

A good test dataset can be checked in 15–20 minutes. If it takes half a day, the dataset is almost always overcomplicated: too many similar cases, vague expectations, or an awkward format. For the first pre-launch check, it is better to have a compact dataset that the team understands at a glance.

Start by reviewing the composition. The dataset should include common requests, rare cases, and edge cases. If it only has typical examples, the model will look great in a demo and fail on the first strange request. If it only has difficult cases, you will not understand how it behaves in everyday work.

For synthetic cases, a simple rule works well: each example should fit into one short record. Three fields are enough — user request, context, and expected model action. The expected action should be written as a checkable rule, not as an impression. Not “the answer sounds confident,” but “asks for clarification,” “does not invent the plan,” “hands the request to an operator.”

A short checklist looks like this:

the dataset includes basic, rare, and edge cases;
each case has a request, context, and expected action;
two reviewers assign the same score using the same rules;
the whole dataset is in one file, such as CSV or JSONL;
the same file can be run on several models without manual edits.

A practical format always wins. One file is easier to keep in the repository, update after new errors, and rerun. If the team compares multiple models through one OpenAI-compatible gateway, the same dataset can be run in sequence without rewriting the integration.

What to do after the first check

The first run rarely gives a clean picture. Usually it shows not average quality, but the places where the model behaves strangely: it answers too confidently, drifts from the format, or gets confused by a simple rule.

Save every disputed answer right away. Not just obvious failures, but also the cases where the team argues: “Is this already an error, or is it still acceptable?” Those examples often become more useful than ten obvious tests.

For each disputed case, it helps to record four things: the original request, the model’s answer, what exactly bothered the team, and what answer you now consider acceptable. That quickly creates a working decision log instead of just a folder of bad answers. After a couple of iterations, it becomes the basis for the next dataset.

Next, replace part of the synthetic set with live examples. Even a good artificial dataset will eventually start “teaching to the test” instead of real work. If data access is limited, use anonymized logs, short dialog snippets, or manually rewritten requests from real scenarios.

Often 20–30% live cases are enough to see the difference. For example, in a banking assistant, a synthetic question about a card limit is usually neat and complete, while a real user writes: “why is it failing again.” On such short and frustrated requests, weak points are obvious immediately.

Another useful step is to run the same dataset on two or three models. Do not change the prompt, temperature, or scoring rules, or the comparison will become noisy. Look not only at the average score, but also at the error profile: one model handles format better, another invents fewer facts, and a third gives similar quality at lower cost.

If the team needs to quickly test several models without rewriting the integration, AI Router can make the job easier. It is an OpenAI-compatible API gateway for businesses in Kazakhstan and Central Asia: you can switch models through one endpoint at api.airouter.kz and run the same dataset with the same SDKs, code, and prompts.

After that, do not expand the tests chaotically. Add only the cases that caught a real mistake or cover a new risk. That kind of dataset grows more slowly, but it is much more honest.

Frequently asked questions

Why do we need synthetic examples if there are almost no logs?

Because empty logs can be misleading. When there are almost no live requests, the team only sees the successful demos and misses failures on rare, messy, and ambiguous messages.

Synthetic cases give you a starting check: they help you spot where the model mixes up steps, makes things up, or fails to ask for clarification.

When are synthetic cases most useful?

They work best before the first release and when comparing several models. You run the same dataset through each one and immediately see who keeps the format, who drifts into fantasy, and who struggles with edge cases.

If real data is restricted by policy or contains personal information, synthetic cases let you start testing without risk.

Where should test cases come from?

Use real user wording, not a feature list. Good sources are support emails, operator chats, complaints, escalations, call-center scripts, and search queries in the knowledge base.

Look at how people ask the same question in different ways. Those variants are what best test intent understanding.

Should we include Kazakh and mixed-language requests?

Yes, especially if the product works in Kazakhstan. Users often mix Russian, Kazakh, and English in one message, and the model should handle that.

If you test only clean Russian, the result will look too good. Add at least some cases with mixed language and normal typos.

How many cases do we need for the first dataset?

For the first pass, 5–7 high-risk scenarios and 3–5 phrasings for each are usually enough. That is enough to catch major failures quickly without getting lost in a huge table.

Do not chase volume. A small but honest dataset is more useful than a hundred near-identical examples.

How should we write the expected result?

Write the model’s behavior, not an ideal answer. For example: it asks for missing data, does not invent a ticket number, does not promise an action without confirmation, and responds in the required format.

That way, the reviewer evaluates behavior on the task, not a pretty sentence.

How do we make the dataset feel realistic?

Make the cases a little rough. Add short messages without context, long requests with extra details, topic changes in the conversation, contradictions, and typos.

If an example looks too neat, it is almost always too easy. A live dataset should make the model work a bit, not help it along.

Where do teams make the most mistakes?

Teams often write easy questions they themselves want to pass. They also look at polite wording and miss factual errors, or they bloat the dataset with dozens of almost identical cases.

Another common problem is not updating tests after changing the prompt, routing, or model. Then the dataset is checking a system that is no longer the one in production.

How can we quickly tell whether the test dataset is good?

Check that the dataset includes common, rare, and edge cases, and that each case records the request, context, and expected action. Then have two people score a few examples using the same rules.

If they give similar scores and the whole dataset fits in one file, it is already practical to use.

What should we do after the first run?

Save every disputed answer and turn it into a new case. Then replace part of the synthetic data with anonymized live examples so the dataset does not teach the model only to pass the test.

If you compare several models, run the same file without changing the prompt or scoring rules. Through an OpenAI-compatible gateway like AI Router, that is especially convenient: you can switch models without rewriting the integration.