Code-switching in Chats: What Breaks in Russian-Kazakh Dialogue
Code-switching in chats often breaks meaning, tone, and facts in replies. This pre-release review framework helps catch failures in Russian-Kazakh dialogues.

Where meaning is lost when the language changes
With code-switching in chats, meaning usually breaks not on rare or difficult phrases, but on the most ordinary ones. A user starts a message in Russian, adds a clarification in Kazakh, changes the tone to something more polite or, on the contrary, more direct, and the model clings only to individual words.
The problem is usually not dictionary translation, but loss of intent. A person is not writing a set of equivalents, but a request, a doubt, a complaint, or a gentle refusal. If the model translates a phrase word for word, it loses the point of the message.
A phrase in Kazakh can sound softer than its direct Russian version. A request that looks neutral in one language can sound like a demand after a literal transfer into another. For support, that is already a risk: the chat answers more sharply than it should, and the conversation quickly turns into conflict.
The same thing happens with short insertions like "iya", "zhok", or "okay, tysyndim". For a person, that is a natural transition. For the model, it is sometimes a false signal. It decides that the user agreed, when in fact they just showed they understood the explanation and are waiting for the next step.
The address and role get mixed up
In Russian-Kazakh chats, tone depends not only on the words, but also on the form of address. The model can confuse the polite "you" with the familiar "you", lose the distance, or start speaking as if the customer had already agreed. Even worse, it can mix up roles: who is asking, who is confirming, and who should take the next step.
In a banking conversation, this shows up right away. A customer writes: "Karty buqtalyp qalypty, ne isteymyn?" Then they add in Russian: "I am on a business trip, I need the restriction removed urgently." If the model does not hold the context well, it answers with a general instruction about reissuing the card instead of helping with unblocking.
One mistake like that rarely stays alone. The model misunderstood the intent, chose the wrong tone, then mixed up the role, and after that it builds the whole conversation on a false version of the request. From there, even formally correct phrases sound off.
Meaning is lost not at the level of vocabulary, but at the level of intent, politeness, and sequence of actions. If the model once took "just checking" as "confirming", from that point on it is answering a different question.
What breaks most often
The most visible failures in Russian-Kazakh dialogue are rarely tied to spelling. Usually it is the meaning that breaks: the model takes words from two languages but loses the connection between facts. The text looks smooth, but the answer is already wrong.
A common problem is mixing up the names of services, products, and documents. The user writes in Russian about card reissue, and then in Kazakh asks about a certificate or a power of attorney. The model merges this into one request and suggests the wrong set of steps. In the end, the person asked about a service and got an answer about a document.
Another typical failure is when details move into the next reply. A date from a booking question ends up in an answer about payment. A fee amount gets attached to a different operation. Even a grammatical form can shift the meaning: "for my mother" may be read by the model as a request from the mother, not for her. In a bilingual chat, these shifts are easy to miss because each part of the phrase sounds fine on its own.
When languages are mixed, the model often loses track of who the request is about. That is one of the most unpleasant errors. A person may be asking about themselves, a child, or an employee, but the model attaches the action to the last mentioned person. After that, it asks for someone else’s documents, gives the wrong order of approvals, or confuses who must show up in person.
There is also a more visible failure: the reply is in Russian, but a Kazakh fragment suddenly slips into the middle for no reason. Sometimes this looks harmless, but in practice it breaks a term or instruction. A phrase like "Submit the application today, then qol qoyu kerek" leaves an unnecessary question: what exactly needs to be done next and where.
When the model replies in Kazakh, it often simplifies the meaning more than it should. The main idea remains, but deadlines, exceptions, amount limits, or the list of required documents disappear. For an everyday conversation, that may be acceptable. For money, documents, and consent, that loss of detail already changes the user’s decision.
The same failures tend to repeat:
- the service name got mixed up with the document name
- the date, amount, or word form stuck to a different topic
- the model confused who is acting and for whom
- the Russian reply inserted a Kazakh fragment in the wrong place
- the Kazakh reply shortened the meaning and dropped details
The worst part is that these mistakes look "almost correct". They do not stand out, but they change the action, deadline, or recipient. For chats in a bank, clinic, or government service, one such inaccuracy is already enough for a person to do the wrong thing.
Which messages cause the most failures
The model is more likely to make mistakes not on long, polished questions, but on the live messages people send in a hurry. Russian-Kazakh chats contain broken phrases, local slang, mixed keyboard layouts, and words without obvious context. That is exactly where the meaning drifts first.
Short messages break conversations more often than it seems. A phrase like "did not open", "bolmady", "once more", or "that card" makes sense to a person only together with the previous messages. If the model has weak context tracking, it starts guessing: it suggests the wrong step, changes the topic, or answers too generally.
Slang, transliteration, and simple typos cause a lot of problems. A user may type "perevod otmen bola ma", "kaspi ga tuspedi", "sms kelmed" or mix Cyrillic and Latin in one word. For a person, that is a small thing. For the model, it is several tasks at once: identify the language, restore the word form, and not lose the intent.
A separate risk area is names, addresses, and branch names. These parts of the text look like noise, but they often change the meaning of the request. If a person writes "I was at the branch on Abaya, not on Al-Farabi" or "My name is not Aсылжан, it is Aсылхан", one wrong word leads to the wrong check, the wrong office search, or a profile mix-up.
Failures often hide in replies where the question and the clarification live in the same line. For example: "Is the card blocked, if I paid the debt yesterday?" The model may answer only the first part and miss the condition "if I paid yesterday". After that, the answer sounds smooth, but it does not solve the problem.
Here are a few everyday examples where meaning often shifts:
- "I need a refund, but I bought the product yesterday"
- "My address changed, can I do it without a visit?"
- "The SMS did not come, the number is the same"
- "Is the branch on Seifullin open today?"
When the language changes right in the middle of a sentence, the model sometimes keeps the overall tone but loses the logic of the request. It latches onto the Russian part and ignores the Kazakh part, or the other way around. That is why it is better to test not polished sample phrases, but the messages people actually send from their phones: short, uneven, and a little messy. Those show fastest whether the system understands meaning, not just individual words.
How to build a live dialogue set for testing
Synthetic examples rarely catch real failures. It is better to build the set from real support and sales chats where people are in a hurry, mix languages, shorten words, and jump from one topic to another.
That is especially important for Russian-Kazakh chats. A user may start in Russian, clarify a detail in Kazakh, and finish with a mixed phrase. If the set is too "clean", code-switching will pass the test and break in production.
It is best to start with topics that already put pressure on the team: order or request status, card blocking, limits, transfers, refunds, plan changes, service activation, documents, verification, IIN, complaints, cancellations, and disputed charges.
Do not include only successful conversations. You need both simple and tricky cases. Simple ones check the basics: greetings, language choice, short questions. Tricky ones break meaning: negation, deadlines, amounts, promotional conditions, legal wording, polite requests, and sharp messages.
A good set is truly bilingual. Add dialogues entirely in Russian, entirely in Kazakh, and mixed replies within the same sentence. Do not over-correct the spelling. Typos, colloquial forms, and transliteration often create the failure in the first place.
Alongside each reply, it helps to store not only the "correct answer", but also the expected meaning. Otherwise it is hard to tell later whether the model got the language wrong, the intent wrong, or the details wrong. Usually five short notes are enough: what the user wants, which language is easier to continue in, which facts must not be lost, what counts as an acceptable answer, and what the answer must not do.
For example: "Men kartany zhapsym, but will the charge still go through?" The meaning here is that the client already closed the card, is asking about a future auto-payment, and wants a short explanation without changing the topic. If the model answers only about closing the card and misses the auto-payment, the test should catch it.
It is useful to split the set by dialogue stage. At the start of the chat, the model should understand the language and the topic correctly. In the middle, it should keep the context when the person switches between Russian and Kazakh. At the end, it should summarize correctly: what has already been done, what is left, and what next step was promised.
Such a set ends up uneven. That is a good thing. It looks more like real people and therefore shows what will break before release.
How to run the check before release
Before launch, you do not need a huge dataset. For the first run, 30-50 short dialogues are enough, where people really mix Russian and Kazakh: a greeting in one language, a clarification in the other, numbers, dates, and a request to "explain it simply". Such a set is quick to collect and easy to read by hand.
For each dialogue, define one expected result. One, not five. Otherwise the team will start arguing over wording instead of catching failures. One line is enough: "the bot asks for IIN", "the bot explains the fee without changing tone", "the bot does not confuse the payment date and the statement date".
A simple table with four fields works well:
- test dialogue
- expected result
- model answer
- failure note
After that, run the same set on several models. The point of comparison is to look at the same input, not at similar examples. If the team uses a shared LLM gateway, this step gets easier: you can switch models without changing the integration. For example, in AI Router, for such a comparison it is enough to keep the same compatible call and run the scenarios through a different route or model without rewriting the SDK and prompts.
Do not look only at obvious errors. In Russian-Kazakh chats, meaning often breaks quietly. The model may replace a calm tone with a sharp one, mix up a fact after a language switch, answer only the Russian part of the question, or translate a term in a way that changes the customer’s action.
For example, a customer writes: "Karty buqtalyp qaldy, but the SMS did not arrive." The expected result here is simple: the bot explains the unblocking steps and does not steer the conversation into card reissue. If one model advises waiting for the SMS and another immediately suggests reissuing the card, the difference is already caught before release.
After that, do not change everything at once. Fix the prompt first. Then check the model or route change separately. And each time, run the same set again. That way the team sees exactly what broke the answer: the instruction, the model choice, or the request path.
This kind of run takes a couple of hours, but it saves days of complaint handling. If at least 3-5 dialogues out of 30 change meaning when the language switches, it is better to delay release and fix the problem cases in the test set.
A simple example from bank support
A customer writes to the bank chat in Russian: "What is the fee for cash withdrawal from a credit card?" A couple of messages later they switch to Kazakh: "Menде Gold karta, limit kandaI?" For a live dialogue, this is normal. The person does not think about the language. They are simply clarifying the question in the way that is easiest for them.
The bot reads the request almost correctly, but makes one quiet substitution. It sees the words "Gold", "card", and "limit" and decides it is about a Gold debit card. In the reply, it writes about the cash withdrawal limit for a debit card, even though the customer originally asked about a credit card and its terms.
On the surface, everything looks fine. The answer is polite, the sentences are smooth, and the numbers seem plausible. That is exactly why this kind of mistake is often missed. The reviewer glances at the text and thinks the meaning was preserved. But the meaning has already shifted. The bot changed the card type, and with it the rules changed too.
One wrong word breaks the whole advice. If "credit" turned into "debit", everything else starts to fall apart: the fee, the available limit, the grace period, and the warning about interest. The customer makes a decision on the wrong basis. For a bank, that is no longer a small inaccuracy, but a risk of a complaint and extra support load.
This kind of failure is easy to catch before launch. No complex setup is needed. It is enough to take a short dialogue and check a few simple things: which product the customer named, what exactly they asked after the language switch, whether the bot kept the card type in the reply, and whether it added conditions that were not in the question.
If the team runs such dialogues through several models via one gateway, comparison becomes faster. The same scenario can be run before release and you can immediately see where the model loses context when the language switches.
The value of this test is that it catches not a loud failure, but a quiet one. The chat does not crash. The answer does not look strange. The mistake hides in one word, and that is why it is dangerous.
How to evaluate the answer without a complex scheme
For a quick check, you do not need a table with twenty points. In practice, five short questions are enough.
First, look at the meaning of the original request. If a person asks in Kazakh to temporarily block a card, and the model in Russian starts explaining how to reissue the card, the answer is already bad, even if the language is formally understandable.
Then verify the facts. Numbers, dates, amounts, names, tariff names, and department names break more often than general text. If the question had an amount of 15,000 tenge and the reply says 50,000, that is not an inaccurate translation, but a meaning error. The same goes for product names and internal terms: sometimes the model translates them on its own and only makes things worse.
Tone should also be checked separately. The user may have written calmly and politely, but the reply suddenly became sharp, too casual, or commanding. For bank support, a clinic, or a government service, that is already a noticeable failure, even when the facts are right.
Another common marker is unnecessary translation. If a person writes in a mixed way and uses familiar names in Russian, there is no need to force everything into Kazakh. That makes the answer feel unnatural and can even confuse the user. It is better to keep stable names when they help the meaning stay clear.
Simple scale
To avoid arguing over every example, it is useful to split errors into three levels:
- Critical: the meaning changed, the number is wrong, the action is risky, the reply is in the wrong language and that gets in the way.
- Tolerable: the sentence sounds a little awkward, but the meaning, facts, and steps are still correct.
- Clean: the request is understood, the facts match, the tone is appropriate, and there is no unnecessary translation.
If you are unsure, ask one question: can a person take the correct next step after this answer? If yes, the answer usually passes. If not, it should not go to production.
Team mistakes before launch
Teams often test a chat on two comfortable paths: separate clean Russian and separate clean Kazakh. In a demo, that looks neat, but in real life people write differently. One customer starts a sentence in Russian, adds a Kazakh clarification, and enters the contract number or name in Latin letters. That is exactly where the model most often loses meaning, confuses intent, or answers only half the question.
The problem is rarely in the whole dialogue. Usually it hides in one short spot: a language switch in the middle of a sentence, an address in one language and a detail in another, or a term the model decided to translate even though it should not have. That is why overly long test dialogues get in the way. The team sees that the answer is bad, but no longer knows where the failure began.
It is better to cut a long conversation into short chunks of 2-4 turns. That makes it easier to find the cause: the model did not understand the question, lost context after the language switch, or handled an entity like a sum, date, or name incorrectly.
Another common mistake: the team edits the system prompt, gets one good answer, and moves on. That almost always creates a false feeling that everything is better. Any change should be followed by a rerun of the old tests. Otherwise you fix one scenario and quietly break three others that used to work.
The average score is also misleading. If 90 dialogues passed and 10 rare cases failed, the overall score looks decent. But those rare cases are exactly what later reaches support, complaints, and manual handling. Look not only at the final number, but also at the slices: language switches within one reply, a Russian request with Kazakh named data, a Kazakh question with a Russian service term, and a repeat request after an unsuccessful first answer.
Check PII masking separately. This is often forgotten when the customer writes in mixed language. The model may hide the card number in a Russian template but miss the IIN, name, or phone number if part of the text is in Kazakh and part is written in Latin letters or with typos. For a bank, clinic, or telecom operator, that is no longer a minor defect, but a direct risk.
A good pre-launch test looks boring, and that is normal. Short live dialogues, repeated runs after every change, and separate labeling of rare failures do more good than one pretty overall score in a report.
Quick pre-release check
Before release, it is not enough to check that the chat "answers in general". In a Russian-Kazakh dialogue, the model often breaks on small things: it loses the amount, mixes up the date, changes the customer’s name, or answers only in one language even though the user has already switched.
A good pre-release check should be short and repeatable. If the team changes the prompt, the model, or the route through an API gateway, the same run should be launched after every change, no exceptions.
What the check should include
- The set includes mixed replies in both directions: from Russian to Kazakh and from Kazakh to Russian.
- The tests include messages with amounts, dates, numbers, and names.
- The team has already agreed on which errors block release: switching the reply language, distorting the amount, the wrong deadline, losing the negation in a phrase like "do not translate", and confusing the account owner with the recipient.
- The same set is run after every change.
- Tricky cases are stored separately in the master list and later turned into a new test or marked as an acceptable variation.
That is enough to cut off most unpleasant surprises before release. You do not need a complex setup at the start. You need a set of live messages, clear failure rules, and simple discipline: change something, run the whole list again.
If you want to quickly test the system with one example, take this line: "I transferred 15,000 tenge yesterday, but the recipient has not seen the money." If the model loses the amount, the reply language, or the meaning of "yesterday", it will also make mistakes on simpler messages in production.
What to do next
After the first wave of checks, do not rush to build a large evaluation system. For Russian-Kazakh dialogue, a short regression set works better, one that the team runs before every new model version, prompt, or route. Even 20-30 dialogues catch a lot of failures if they come from real cases.
Start small and keep the set stable. Let it include phrases with language switching inside one message, clarifications after a misunderstanding, short two-word replies, and messages with terms like tariff, card, delivery, or limit. Then add new examples only after real failures in production or a pilot.
The working habit here is simple: compare models only on the same bilingual set, keep the evaluations next to the answers themselves, mark tricky cases separately, and record exactly what changed - the model, the system prompt, the temperature, or the routing.
If you do not do that, in two weeks nobody will remember why one answer was accepted and another was not. One shared file or a simple table already removes half the confusion. It is also useful to keep a short note there: "switched the reply language", "ate the negation", "translated the term too loosely", "was polite but missed the meaning".
When the team compares models from different providers, extra integration differences only get in the way. In that situation, a single compatible gateway helps. For example, AI Router lets you run the same set through different models and providers via one OpenAI-compatible endpoint. That is convenient for comparison: you do not need to rewrite the code for a new route each time, and the results are easier to compare.
Do not wait for a perfect set. To begin with, a short selection that can be run in 10-15 minutes and discussed quickly with support, product, and QA is enough. In a month, the team will have not a vague impression of how code-switching in chats works, but a list of specific failures that should no longer make it into release.