Jun 29, 2025·7 min read

Transliteration in Search: How to Account for Three Versions of a Term

Transliteration in search helps people find articles even when they type a term in Russian, Latin script, or with a typo. Here we break down the dictionary, the index, and the checks.

Why one term turns into several variants

People almost never type a term exactly the way it was once written in the glossary. They type from memory, on the fly, from a phone, or after a call where they only heard the name by ear. That is why one term quickly turns into two or three forms: the original, the Russian version, and a mixed one.

It starts with ordinary behavior. A user does not switch keyboard layouts right away, forgets the official spelling, or does not know it at all. They may search for "OpenRouter", "опенроутер", or "open router" and be sure it is the same query.

This happens especially often with brands. People remember the name with their ears, not their eyes. After a meeting, one person writes "DeepSeek", another writes "Дипсик", and a third splits the word into two parts because it is easier to read that way. That is not always a mistake. Often it is just a convenient way to pass along a familiar sound using the letters that are available at the moment.

Inside a company, the teams themselves make the confusion worse. Developers keep the term in Latin script because that is how it appears in code and documentation. Support and editors convert it to Cyrillic so the text reads more easily. Sales and account managers often write it the way customers say it.

Even short names spread quickly. "AI Router" in notes can become "аи роутер", "AI роутер", or "эйай роутер". For a person, the difference is small. For search in a knowledge base, these are already different strings.

Small differences also pile up: someone writes a term as one word, someone else uses a space; one person adds a hyphen, another does not; part of the team uses a number, another uses a word or an abbreviation. Search will not guess on its own that all of these forms are close in meaning. If the index stores them as separate words, an article with the official spelling will not answer a query in spoken form.

So variants appear not because users are careless. They are created by speech, keyboard layout, habits across teams, and the environment itself, where one term lives at the same time in the interface, in code, in chats, and in articles.

Where search loses the right answer

The failure usually starts not in the article, but in the word match. In a knowledge base, a piece of content may be called "OpenRouter API", while a person searches for "оупенроутер", "open router", or "опен роутер". If search looks only for an exact spelling, it does not understand that this is the same term.

The problem is especially noticeable in short queries. The user types one word, and search has very little context. An extra space, a different keyboard layout, or a Russian rendering of an English name can already break the match.

Autocomplete often makes things worse. It usually pulls the most common forms from titles or past searches. If only the "OpenRouter" form is fixed in the index, the suggestion will not help someone who started typing "оупен". They do not see the right direction and decide that the topic is not in the knowledge base.

Old articles also create noise. Editors may once have written "Chat GPT", then switched to "ChatGPT", while leaving "чат гпт" in older instructions. As a result, one term breaks into several isolated variants, and each article lives with its own spelling.

Losses usually happen in the same places: exact title matching, autocomplete that knows only one form of a term, older materials with a previous name, and filters that treat rare forms as noise.

For the user, it all looks simple. They enter a familiar word, get zero results, and close the tab. They do not think about the index, tokens, or normalization. They just see an empty result page and decide the knowledge base is weak.

That is especially unpleasant in work systems where people look for answers on the move. If an engineer searches for an article about Qwen and uses the version the team is used to, but search stays silent, they will not try five more spellings. They will go to chat, ask a colleague, or miss the needed instruction. That is how one unaccounted-for term variant turns into extra questions and lost time.

Which variants are worth collecting ahead of time

Transliteration in search rarely comes down to just a pair like "router" and "роутер". People enter terms the way they saw them in chat, in the interface, in an old guide, or in a message from a colleague. That is why the same meaning quickly splits into several forms.

You should collect not every possible distortion, but only the forms people really use. Usually a few groups are enough:

different alphabets: "router" and "роутер", "API" and "апи";
merged and split spelling: "файнтюнинг", "файн тюнинг", "fine tuning";
common typos and missing letters, if they repeat in real queries;
abbreviations and full forms: "LLM" and "large language model", "база знаний" and "БЗ";
old and new names, if a section or feature has already been renamed.

It is especially useful to collect variants in advance for product names, model names, roles, legal terms, and internal abbreviations. Search usually handles ordinary words without a variant dictionary. Terms do not.

If your knowledge base has an article about PII masking, some people will enter "PII", some will enter "персональные данные", and someone may simply type "пд". For them, it is one meaning. For an index without normalization, these are three different queries.

The same thing happens with more ordinary cases. An article is called "Настройка prompt cache", but search returns forms like "промпт кеш", "promptcache", and "кэш промптов". If they are not connected, the user will decide there is no answer in the knowledge base, even though the text has long been published.

The rule is simple: add variants where the term affects article discovery, not the beauty of the glossary. Start with titles, tags, feature names, and common zero-click queries. The list is usually small, but it quickly improves knowledge base search.

How to build a variants dictionary

It is better to build the dictionary from real queries, not guesses. Take search logs from the last month and pick frequent queries that point to the same term but look different. Usually you will quickly see pairs like "OpenRouter", "open router", and "оупенроутер", or "Qwen", "квен", and "qwen 3". That is already enough to show the scale of the problem.

Then merge duplicates into one term card. One card equals one meaning. In it, it is useful to store the main form, the found variants, and a short note about where the term is used. If you mix two close but different concepts into one record, search will start confusing documents.

What to store in the card

The minimum is simple: the official form from the interface or documentation, conversational forms from chats and tickets, Latin and Cyrillic forms, repeated typos, and the list of sections where the term appears. That is usually enough to connect a user query with the right article.

Do not try to guess every variant at once. First collect the forms that have already appeared at least a few times. That way the dictionary grows from live data, not from the team’s imagination.

A good source of new variants is support. Ask the team to flag queries where the user clearly looked for the right article but search found nothing. This is especially useful where Russian, English, and local spellings live side by side. For products with many model and provider names, the problem appears very quickly.

How to keep the dictionary from spreading out of control

Once a quarter, it helps to clean the dictionary. Remove forms nobody uses, merge duplicate cards, and check whether new official names have appeared. If a term is obsolete, do not delete it right away. Leave it as an old form so search can still find the needed material.

This kind of dictionary is rarely large. But it quickly covers the most visible search misses and reduces the number of queries where the user "almost found it" but did not reach the answer.

How to account for variants in the index

Invoices in tenge without manual work

Bring work with multiple LLM providers into one monthly B2B invoice.

Request access

If a user searches for "Qwen 3", "Квен 3", or "qwen3", search should lead to the same article. To do this, you do not need to rewrite the text itself. It is better to keep a separate normalization layer and separate index fields.

First, bring the query and the indexed terms to one format. Usually lowercasing, using one space instead of several, and removing extra dots, hyphens, and random symbols is enough. After this processing, the query " qwen-3 " becomes the same form as "Qwen 3".

Then define clear rules for Cyrillic and Latin script. Do not try to guess everything on the fly. It is much more reliable to keep a mapping table for common cases: "q" -> "к", "w" -> "в" or "у" depending on your rule, "xai" -> "икс аи", "llama" -> "лама". This approach works better when the rules are simple and the same for all documents.

A term is better stored with two things: the main name and the additional variants. The main form is needed to keep the index organized; the additional variants are needed for real queries. For example, for "Qwen 3" you can store the variants "qwen3", "квен 3", "квен3", and the spelling your team uses internally.

In practice, it is useful to index content in at least four fields: display_title for the original title, term_canonical for the main normalized form, term_aliases for the list of variants, and body for the article text. Then search can find the document by the main form and by variants, while the user still sees a clean title without artificial edits.

If you have many terms, do not dump all variants into one large field. Otherwise ranking will become noisy. It is better to give the main form the highest weight, the variants a medium weight, and the article text a lower one.

This is especially noticeable in systems where dozens or hundreds of model names live side by side. In AI Router, teams work with a large set of models and providers through one API, so forms like Qwen, Llama, DeepSeek, or xAI quickly spread across documentation, chats, and tickets. Without normalization and aliases, search starts understanding only the "official" language, not the one people actually use.

Example with one article

It is easy to see how this works with one simple record. The knowledge base contains an article called "Настройка QazTech Pay". For the editor, everything is clear: the brand is in Latin script, the title is clean, and the term is consistent.

For users, the picture is different. One person types "QazTech Pay" as it appears in the product interface. Another writes "Казтех пэй" because that is easier to hear and remember. A third types "qaz tech pay" with a space in the middle because they do not remember the exact spelling. The meaning of the queries is the same, but the string is different each time.

If search looks only for an exact title match, it will work only for the first version. The second and third queries will either return nothing or push the right article lower. The user will not spend time figuring out why. They will simply decide the article does not exist.

In practice, it is useful to store more than one representation of the term for a single article: the original name, the Cyrillic version, the split form, and the normalized version without extra spaces or shifting case. Then the index connects all of these forms to the same article.

It can look like this:

title: Настройка QazTech Pay
aliases: [QazTech Pay, Казтех пэй, qaz tech pay]
normalized_terms: [qaztechpay, казтехпэй]

This approach solves more than the "Latin vs. Cyrillic" debate. It catches spaces, different capitalization, and the habit of writing a brand the way it sounds. For a knowledge base, that is a small thing. For a person, it is the difference between "found the answer in 10 seconds" and "your search does not work."

If the term is important and appears often in support, do not rely on one nice title. It is better to connect all the live variants people actually use to the article right away.

Mistakes that break search

Change only the base_url

Keep your SDK, code, and prompts as they are, and try AI Router in your stack.

Start testing

Search usually breaks not in ranking, but earlier — when the team changes the term too aggressively before indexing, or right at query time. For transliteration, this is especially noticeable: one wrong edit hides the needed article deeper than any weak algorithm would.

The first common mistake is translating a term that users usually do not translate. If someone searches for "OpenRouter" or "ОпенРоутер", but the index stores only the word "маршрутизатор" as a translation of router, search loses the meaning. Product names, libraries, and brands are almost always better kept as a set of variants, not as one "correct" translation.

The second mistake is relying on one rigid letter-substitution table. On paper it looks neat: q = к, w = в, x = кс. In real life, it does not work that way. People write "Qwen" as "квен", "куэн", and simply Qwen. "xAI" may be typed in Latin script, Cyrillic, or with a space. If you run every word through one scheme, you will lose the forms people actually enter.

Another common failure is ignoring short and conversational forms. Users rarely write a term exactly as it appears in documentation. They shorten it, remove hyphens, change case, and write by sound. "AI Router", "эйай роутер", "аироутер", and "ai-router" are often the same thing for a person. For an index without a dictionary, these are four different queries.

A variants dictionary ages quickly if nobody owns it. New terms come from support, sales, implementation, and customer conversations. If nobody collects fresh forms at least once a month, the knowledge base starts understanding only the language of the article authors.

The most unpleasant mistake is changing the query so aggressively that the meaning changes. Someone searches for "Gemma", and the system decides they meant "gamma". Or they type "RAG", and search sends them to general articles about the common word rag. It is safer not to replace the original query, but to expand it: keep the original, add variants, and rank exact matches above any guess.

If search finds less after changes than it did before, the problem is usually not the index itself. More often, the team decided too early how the user "should" spell the term.

Quick checks before launch

One API instead of a zoo

Hook up AI Router and send requests to 500+ models through one endpoint.

Connect API

Before release, you do not need to guess whether people will find the right article. It is easy to check with a short set of real queries.

Collect the 20 most frequent terms from logs, support chats, and article titles. Then test them manually: a person enters the query, opens search, and reaches the right material on the first try. If the article hides on the second page or loses to a less exact match, the problem is already visible.

A minimal set of checks looks like this:

run common terms without extra words and check that the right article appears immediately;
enter old names, abbreviations, and previous versions of the term;
review autocomplete and check whether it shows the forms people actually use;
save all zero-result queries in a separate log;
give the editor a simple way to add a new alias without releasing the app.

Also check renames separately. This is a common break after migrations and documentation updates. A user searches for the familiar term, and search stays silent even though the needed article is already there under a new name.

Autocomplete is also worth checking by hand. If the system knows a variant in the index but does not show it in suggestions, the user often decides the material does not exist. The suggestion should guide, not guess for the person.

For teams working with a large stack of models and APIs, this kind of check saves a lot of support time. One evening of tests gives a much more honest picture than a week of broad metrics. After launch, the routine is simple: once a week, review zero-result queries and add new variants right away.

What to do next

Do not try to cover the entire language of your knowledge base at once. It is better to take a narrow area and bring it to a working state. When transliteration in search is weak, the first noticeable improvements usually come not from complex logic, but from order in the term dictionary.

Start with 50 words and names people search for most often. Usually the same problems appear quickly: Latin and Cyrillic mixed together, old and new spelling, abbreviations, brand forms, and one-letter mistakes. Even such a short list already reduces the number of empty answers.

After that, you need a simple working rhythm. Gather common terms from search logs and support, record the live variants, assign one dictionary owner, and review zero-result queries once a week. It is better to test changes on a small set of typical queries before and after the index update.

One dictionary owner is not bureaucracy; it is a way to avoid arguing from scratch every time. For teams that deploy LLMs in production and work with many model and provider names, like in AI Router on airouter.kz, this is especially useful: new terms appear constantly, and without an owner the search quickly starts understanding only the language of the documentation. That person does not need to invent "perfect" names. Their job is simpler: review zero-result queries, add aliases, and make sure search speaks the language of users.

Frequently asked questions

Why do people search for one term in several different ways?

People write a term the way they remember it from a conversation, chat, or interface. That is why Latin, Cyrillic, spaces, merged spelling, and spoken-style versions all live side by side. For a person it is one meaning, but for search without normalization they are different strings.

Which term variants should we collect first?

Start with the forms that already make articles hard to find: Latin and Cyrillic, merged and split spelling, old and new names, and common abbreviations. That is usually enough to remove most empty results without adding noise to the index.

Should we add all typos?

No, do not collect every typo. Add only the typos and forms that really repeat in logs, support chats, or tickets. Otherwise the dictionary grows too large and search starts confusing similar but different queries.

Where can we get variants for the dictionary?

The best source is search logs from the last month. After that, check support chats, tickets, and conversations where people clearly tried to find an article but could not reach it through search. That is where live forms like “Qwen,” “квен,” and “qwen3” show up quickly.

What is the best way to store variants in the index?

Keep the original title separate from the normalized form and from the aliases. That way the user sees a clean headline, while search matches both the official name and conversational variants like “open router” or “qwen3.”

Should product and model names be translated?

Usually not. Brands, models, and product names are better stored as a set of variants, not as a single translation. If someone searches for “OpenRouter,” search should lead them to the OpenRouter article, not send them to the generic word “router.”

What should we do with old names after a rename?

It is better to leave the old name as an alias. Users often remember the previous term longer than the documentation team does, and they keep searching for it. If you remove the old form right away, you will get zero results even though the answer is already there.

Why does search not find an article even though it exists?

Most often, search only looks for an exact match or does not know the conversational version of the term. The problem gets worse with autocomplete if it only suggests the official form from the title and does not show what people actually type.

How can we quickly test search before launch?

Take a short set of common queries and run them by hand. Check that the right article opens on the first try for the official form, the old name, the abbreviation, and the Cyrillic version. If the query returns nothing or the article drops too low, the fix is needed before release.

Who should maintain the variants dictionary?

Assign one owner for the dictionary, even if the team is small. They do not need to argue about perfect names; they just review zero-result queries, add new aliases, and make sure search understands the language users speak, not just the language articles use.