Nov 22, 2025·8 min read

A Benchmark for the Kazakh Language Based on Real-World Scenarios

A Kazakh-language benchmark should be built on real scenarios: customer requests, forms, search, and support. Let’s look at the dataset, metrics, and common mistakes.

Why polished demos break evaluation

A polished demo is almost always assembled by the team itself. That means the queries in it are neat: no typos, no broken thoughts, no local slang, and none of those odd phrases people type on the fly. In a dataset like that, the model almost always looks smarter than it does in real work.

This is especially noticeable with Kazakh. In a live chat, a person can easily mix Kazakh and Russian in one message, insert transliteration, abbreviations, and fragments from voice input. If your test only includes clean phrases like "Write a polite reply to the customer," you are not testing the real language environment.

A typical work query might look like this: "Salem, yesterday I placed an order, the money went twice, what do I do now?" A person will understand the meaning immediately. For the model, this is already a test of mixed language, spelling noise, and support context. That is where the weak spots show up.

Short and convenient prompts can be misleading too. They do not require keeping long context in mind, do not test multi-turn conversations, and do not show what happens when a phrase is ambiguous. In a demo, the model answers smoothly simply because the task is too easy.

There is another trap: one good answer. If the model handled one example well, that proves nothing. In production, consistency matters: how does it behave on 20 similar requests, on different versions of the same question, after a model or provider change?

A strong dataset includes not only "clean" examples, but also messages with typos, conversational language, mixed Kazakh and Russian, long chains with lost context, and similar requests that require a stable answer.

If the dataset looks like a collection of pretty screenshots for a presentation, it is almost useless. A useful dataset is a bit inconvenient, somewhat noisy, and very close to what customers, operators, and employees actually write inside the company.

Which tasks to include

It is better to start not with rare cases or flashy examples. Take what people do every day. If a wrong answer leads to an extra call, a rejected request, or a lost sale, that task should be tested first.

Usually five task types are enough. They are simple in form, but they quickly show whether the model understands real language, not just a clean prompt.

Everyday support questions almost always belong in the first dataset: order status, returns, card limits, plan terms, delivery times. People ask them briefly, with mistakes, and often mixed between Kazakh and Russian.

Search tasks for a catalog also work well. The user types something different from the product name in the database. They may enter "white dress" in transliteration, "ақ көйлек", or "white shirt," and the system should find the closest result.

Another task type is filling and checking forms. Here the model extracts an IIN, contract number, date, amount, notices missing or invalid fields, and does not confuse similar values.

Another useful scenario is a short summary of long texts: a customer email, a request, an internal instruction, or a chat with an operator. The key is for the model to compress the text without losing meaning and without adding anything that was not there.

Finally, it is worth checking how the request is classified by topic and urgency. A question about balance and a complaint about a double charge require different reactions, even though both may look like short chat messages.

Include tasks with language rough edges. For Kazakh, this is especially important: different word forms, colloquial variants, typos, Latin script instead of Cyrillic, and mixed requests. If your test only contains neat phrases like "what are the return terms," the model will look better than it does in real work. Then the misses will start on phrases like "can I return the item?"

There is a simple filter: keep only tasks where the answer can be checked. If even an expert argues about who is right, that example is too early for the dataset. For the first pass, it is better to take cases with a clear outcome: the right product was found or not, the fields were filled correctly or not, the urgency was identified correctly or not.

At the start, 30-50 examples per task type are enough. That dataset will already show where the model stays confident and where it falls apart on ordinary requests.

Where to get real examples

You do not need invented questions for the dataset. The best sources are traces of real work: support chats, customer emails, form submissions, operator comments, short site search queries, or in-app searches. This material quickly shows how the model behaves outside the demo, when people write messily, rush, and mix up words.

If you are building a dataset for the Kazakh language, do not try to "clean" it into a textbook standard first. Keep the typos, conversational forms, Kazakh-Russian mixing, local abbreviations, and awkward phrasing. Users do not write like editors. If you remove that noise, you are testing not the real product, but a neat showcase.

A good starting point often looks like this: 100-200 anonymized support conversations, 50-100 emails or requests with a long description of the problem, search queries from a website or app, several dozen expensive mistakes, and a separate set of rare but unpleasant cases.

It is better to keep rare cases separate. They do not happen every day, but they hurt the most: an incorrect amount, a missed negation, confusion in the delivery address, the wrong request status, or dangerous advice on a health topic. One such example is more useful than ten smooth FAQ questions. In banking, retail, or telecom, these mistakes are usually already visible in complaints, manual escalations, and disputed operator responses.

Before labeling, remove anything that should not enter the dataset: full names, phone numbers, email addresses, IINs, contract numbers, internal IDs, and internal notes. Do it right away. Otherwise, personal data will quickly spread across spreadsheets, discussions, and hints for labelers. For teams in Kazakhstan, this is simply a practical step: it is better to account for data storage and handling requirements at the beginning than fix them later.

It is useful to save a short context next to each item. Not just the question itself, but also where it came from, what the person was trying to do, and how the conversation ended. A couple of lines often help explain why the same text counts as the correct answer in real life but looks questionable in a test.

How to agree on the right answer

Quality disputes usually start not with the model, but with the gold standard. If the team understands good answers differently, the dataset quickly turns into a collection of random opinions.

Start with a simple rule: one example checks one task. Do not mix fact lookup, polite tone, translation, and formatting in one case. If a user asks in Kazakh how to restore access to their account, that example should test only the access recovery instructions.

For each case, record two things: what counts as correct and what definitely does not fit. A correct answer can be short if the task is simple. An incorrect answer should include not only factual mistakes, but also unnecessary inventions, dodging the question, switching languages without reason, and broken formatting.

This matters especially in LLM evaluation in Kazakh. The same meaning can often be expressed in several normal ways. If the answer preserves meaning, tone, and the needed action, do not punish the model just because it used different wording.

What to record for each example

the goal of the answer in one sentence
acceptable variations in meaning
hard errors
the language of the answer
the required format

For example, the request "Where can I download the payment receipt?" can be considered solved if the model answers in Kazakh, gives a clear path in the interface, and does not invent sections that do not exist. If it answers in Russian or suggests a button that is not there, that is a miss.

Mark the format separately. Does the answer need to be one paragraph, a step list, JSON, or a short template for an operator? Without this, labelers will quickly start arguing about style, even though you wanted to test usefulness.

A short, one-page instruction is enough for labelers. It usually only needs to describe the goal of the dataset, the language rules, how to mark a partially correct answer, and 3-4 examples with explanations. If a rule cannot be explained in one sentence, it is probably too complicated for the first dataset.

A good gold standard does not try to cover every nuance of the language at once. It helps the team look at the answer in the same way and quickly notice where the model is truly helping and where it only sounds confident.

How to build the first dataset step by step

Pay provider rates

Compare models without API markup and calculate the cost per successful answer.

Open API

Do not try to build the perfect dataset right away. For the first run, one scenario and 50-100 examples are enough. If you take fewer, random wins and misses will affect the result too much. If you take more, the team will get tired of labeling and start arguing over minor details.

Sort the examples by difficulty right away: easy, medium, and hard. Easy tasks test basic understanding of the request. Medium ones add context, constraints, or conversational language. Hard ones include mixed Russian and Kazakh, typos, incomplete data, or conflicting instructions. This split quickly shows where the model breaks and where it stays strong.

Then take 2-3 models and run them with the same prompt and the same settings. Do not rewrite the prompt for each model, or the comparison will lose its meaning. If one model got a friendlier wording, you are no longer comparing models, but different conditions.

After the run, do not look only at the overall score. Open the failed answers by hand and read them one after another. Usually that is when it becomes clear what the dataset is missing: real wording, short messages, mixed language, regional words, or non-obvious constraints. Those are the cases to add to the next version.

It helps to mark not only that an error happened, but also what type it was. For example, the model misunderstood the meaning, answered too generally, ignored the no-hallucination rule, or did not understand the Kazakh phrasing. After 20-30 such reviews, the picture is usually very clear.

Once the dataset starts to look like real work, freeze its version before comparing. Give it a name, save it as a separate file, and do not change it until the test is done. Otherwise, the first model will be tested on one dataset and the next model on another.

A good first dataset does not have to be pretty. It has to catch the mistakes that later hurt support, sales, or internal processes. If, after such a run, the team understands what to change in the prompt, response rules, or model routing, the dataset is already doing its job.

How to measure results without overcomplicating things

For practical evaluation, you do not need one "smart" score. You need simple metrics that the team can check by hand and repeat next month. If a metric does not help choose a model or find a weak spot, it is better not to track it.

When the answer is easy to verify, use pass/fail. This works well when the model must return a fact rather than a nice formulation: a contract number, an amount, a date, a request status, or a yes/no answer based on a rule. Either the model got it right or it did not.

For search and data extraction, it is better to look at individual fields rather than the whole answer at once. If the model extracted the customer name correctly but got the amount and date wrong, you already know exactly where it is breaking.

Usually it is enough to count the share of correct values for each field and separately mark format errors if the field must follow a strict form.

Two practical metrics are often forgotten: latency and cost per successful answer. A cheap model that often makes mistakes can end up being more expensive in real work. A simple formula quickly clears things up: divide the total run cost by the number of successful answers. That makes the comparison more honest.

If you run models through a single gateway like AI Router, these numbers are easier to collect in one place. This approach has another advantage: you can change the base_url to api.airouter.kz and run different models through one OpenAI-compatible endpoint without rewriting the SDK, code, or prompts. But the logic itself does not depend on the tool: success, cost, and response time should live together, not in separate spreadsheets.

The overall percentage for the whole dataset almost always hides the problem. Break results down by typos, mixed language, input length, and task type.

That kind of split quickly shows the real picture. A model may score 85% on short clean requests and drop to 52% on long messages with conversational spelling. For support, that is not a small issue; that is an ordinary workday.

Instead of one final number, it is better to keep a short table by task: field extraction, classification, knowledge-base answers, paraphrasing. Each row gets its own metrics. Then it is clear where the model fits right away and where it needs a different prompt, routing to another model, or human review.

Example dataset for customer support

Mask PII before testing

Hide IINs, phone numbers, and email addresses before sending requests to the model.

Enable masking

For the first run, 200 Kazakh-language e-commerce support requests are enough. That already gives a realistic picture: where the model understands the customer and where it confuses meaning, response language, or tone. Such a dataset is more useful than a dozen neat demo phrases.

Do not take only "clean" requests. In real support, people write briefly, with mistakes, emotionally, and sometimes mix Kazakh and Russian. One customer will write "Where is my order?", another will send a long seven-line complaint with dates, amounts, and a threat to stop buying.

The sample should cover topics the team sees every day: delivery and timing, order cancellation, product or money refunds, bonuses and discounts, and disputed cases with payment or duplicate charges.

These topics are useful not for variety alone. They test the model in different ways. Delivery requires precise understanding of status. Returns and cancellations quickly show whether the model invents store rules. Bonuses often fail on small details: expiration date, earning conditions, exclusions.

Add requests in different forms to the dataset. About half can be very short, like "When will delivery happen?" or "My bonuses are not showing." The rest should be longer: with order history, frustration, extra details, and several questions at once. That is where the real AI assistant quality testing shows up.

What to check in each answer

Do not look only at factual accuracy. A support answer has several other simple layers to check:

the model correctly understood the customer's question
the answer is in the same language the customer used
the tone is calm and polite
the text does not promise things the store does not do
the answer does not miss an important detail, such as an order number or refund time

Mark errors that cost money separately. If the model incorrectly promised a refund, mixed up cancellation conditions, or gave a discount that does not exist, that is no longer an ordinary miss. It is a risk to revenue and to the team that will later have to sort out the conflict manually.

In practice, it helps to add a simple "money risk" flag. Then you can quickly see the difference between an awkward answer and an error after which the store loses money, the customer, or both.

Where teams usually go wrong

The first mistake is simple: the dataset contains only neat, literary Kazakh. In real life, almost nobody writes like that. Users mix Kazakh with Russian words, drop letters, type on phones, and do not use diacritics. If you only test the model on phrases like "The payment did not go through," it will show a nice result and fail on a message like "tolem otpedi" or "money disappeared from the card."

The second mistake is close behind. The team cleans the data too aggressively. Typos are removed, punctuation is normalized, short complaints are rewritten into smooth text. The dataset becomes easier to check, but it no longer resembles the real stream. For LLM evaluation in Kazakh, it is better to keep some of the noise. Otherwise, you are measuring the work of an editor, not an assistant.

A single score for different tasks often breaks the picture too. Chat, knowledge-base search, and form filling require different behavior. In chat, the model can ask a follow-up question. In search, it must not invent and should rely on the found text. In forms, field accuracy and answer format matter most.

At minimum, separate the scenarios: conversational requests, questions that require document search, and tasks for extracting or filling fields.

Another common mistake is changing the prompt during comparison. One model is tested with the old instruction, another with the new one, and a third already has a couple of examples added. After that, the numbers lose meaning. First freeze the prompt, answer format, and parameters. Then compare models.

And finally, many people look only at answer quality and forget about latency. That is an expensive mistake. For an internal assistant, the difference between 2 and 8 seconds is sometimes more important than 3% accuracy. If a support bot responds slowly, people still go to a human operator.

Quick check before running

Compare 500+ models

Check the same dataset across models from different providers through one gateway.

Open catalog

Before launching, go through the dataset like an editor, not like a researcher. A good dataset often breaks not on the model, but on small things: duplicates, vague labeling, and cases nobody can reproduce a week later.

Start with contrast. There should be both easy requests and unpleasant ones. If you only have neat phrases like "Where can I download the document?", the model will show a result that is too flattering. Add noise: Kazakh mixed with Russian, typos, short voice transcripts, context-free requests, and blunt wording.

Then remove duplicates. If ten examples differ only by city name or order number, you are not expanding the dataset, just inflating it. It is better to keep one good example per pattern and add a different type of error.

Before running, check five things:

the dataset includes both easy and unpleasant cases
the examples do not repeat the same scenario under different names
any colleague can understand the labeling without a call or verbal explanation
you have a separate list of critical errors
the team can run the same test again in a week the same way

Be strict with labeling. If one person writes "partially correct" and another does not understand how that differs from "correct but rude," the argument will start before the evaluation. Give a short rule for each label and one example. That is usually enough.

Keep the list of critical errors separate from the overall score. For customer support, that can be an invented tariff, a missed restriction, a wrong address, dangerous advice, or an answer in the wrong language. Even if the model gets a high average score, one such miss can make the result unusable.

The last check is simple: open the instruction and imagine that tomorrow another person runs the test. If they need to message you and ask, "What counts as an error here?", the dataset is still rough. When the rules are readable on the first try, you will get numbers you can trust.

What to do after version one

The first version of the dataset is almost always weaker than it seems. That is normal. The value starts not on the day it is built, but when the team runs tests after every meaningful change: a new prompt, a different model, a provider switch, changes in retrieval, filters, or post-processing.

If you do not do this regularly, a benchmark for the Kazakh language quickly turns into an archive. It exists, but it no longer shows what actually broke yesterday. A short, living dataset with 80 scenarios that you run every week is much more useful than a folder with 800 examples nobody returns to.

The best source of new cases is production mistakes. The user asked in mixed Kazakh-Russian, the model misunderstood the intent, mixed up the form of address, or invented a fact — that example should be added to the dataset immediately. Do not wait for a quarterly review. Catch the failure, anonymize the data, record the expected answer, and include the case in the next run.

Compare models not by general impression, but by the same dataset version, the same prompt, and the same scoring rules. Only then can you see whether the system got better or simply answered the easy examples more successfully.

If the dataset helps catch mistakes before release, then it is already useful. That is enough for a strong start.