Skip to content
Aug 24, 2024·8 min read

Open-Weight Model: Choosing for the Internal Stack

How to choose an internal LLM: compare open-weight models by size, languages, response format, and GPU needs on real-world tasks.

Open-Weight Model: Choosing for the Internal Stack

Why it is easy to get this wrong

A general ranking almost always leads you in the wrong direction. A model may look great in tests and still be weak at the thing your company will actually pay for. For business, the gap between “writes well” and “solves the task reliably” is huge. If you choose an open-weight model only by its place in a table, the team often buys someone else’s scenario, not its own.

The problem usually starts earlier than it seems. The same internal LLM stack may include a chat for employees, knowledge-base search, field extraction from contracts, and request classification. That is four different tasks. A model that holds a pleasant conversation does not always keep the response format without mistakes. And a model that neatly pulls out dates, amounts, and document numbers may answer in a dry way and handle multi-step dialogue poorly.

Teams often mix a pilot with real production load. On 20 manual prompts, almost any decent model looks convincing, especially if the project team picked those prompts itself. In real work, everything changes: long documents arrive, users write off-template, the queue grows, and answers must come back in one format without manual cleanup. Because of that, a model that looked good in a demo can start breaking the process a week later.

Inside the internal environment, a mistake costs more than in an external chat. It is not enough to look at text quality. You need to know where data is stored, who can access prompts, whether auditing is possible, and how the system handles personal data. For teams in Kazakhstan, this is often not a formality. Requirements around storing data in the country or within the company immediately rule out part of the options, even if they are strong on quality.

A good example is handling incoming emails in a bank or retail company. The team takes a large model with a high benchmark score, tests it on a few emails, and is happy with the result. Then it turns out that the model confuses categories in short messages, sometimes breaks JSON, and the security team does not allow the chosen access scheme. They have to roll back and compare again.

Teams do not fail because they look at models badly. The mistake happens when they compare the wrong task, the wrong data, and the wrong operating mode. If you separate the short pilot from the real workload right away, half of the false leaders will disappear on their own.

What to lock down before comparing

Comparison breaks at the very beginning if the team starts with something too broad, like “we need a good model.” An open-weight model should be tested on what people do every day: handle requests, extract fields from a contract, write a short document summary, or find an answer in an internal policy.

Usually 3-5 scenarios are enough. For each scenario, gather 30-50 real examples. Include not only neat documents but also messy cases: typos, OCR after scanning, internal abbreviations, emails with chunks of tables, and files where Russian and Kazakh are mixed in the same text. Otherwise the test will go smoothly, and in production the model will start confusing terms and breaking structure.

Then define the answer you need. For request routing, a strict JSON with set fields works best. For comparing a report, a table is more convenient. For a short answer to an employee, free text is fine, but even there it is better to set the length, style, and a ban on made-up facts. If the format is not set in advance, the team often chooses the most “pleasant” answer instead of the most useful one.

For each scenario, it helps to write down five things in advance: what input the model gets, what language or language mix the text comes in, what output format is needed, how many seconds you can wait for an answer, and how much a request may cost.

A small example shows why this matters. Suppose the procurement team checks incoming documents and expects JSON from the internal LLM stack with the fields “supplier,” “amount,” and “date.” If one model answers in 2 seconds and makes 3 mistakes out of 100, while another answers in 7 seconds and often returns free text instead of JSON, the second one loses, even if its wording sounds better.

These boundaries remove taste-based arguments. After that, you compare not the model’s “smartness” in general, but its behavior on your data, in your languages, and within your budget.

How model size affects the result

Model size affects more than just quality. It changes latency, response cost, and how much GPU you need in the environment. For an internal LLM stack, this quickly becomes a practical question: wait 1-2 seconds or 10, run one card or a whole node.

Small models, such as 7B-8B, are often a good fit for simple tasks. They answer quickly, are cheaper to run, and fit more easily into local infrastructure. If you need to label requests, extract a contract number, or turn text into a template, this size often gives a solid result.

But small models have a weak spot. They are more likely to lose the instruction on a long prompt, mix up fields in JSON, and handle long documents less well. This is especially noticeable when one request contains rules, exceptions, and pieces of source text in Russian and Kazakh.

Medium models, around 14B-32B, are usually more stable in work scenarios. They follow response formats better, hallucinate less during data extraction, and handle long context more calmly. For requests, acts, emails, and internal policies, this is often the most sensible range.

Large models, 70B and above, tend to win where a document needs deeper analysis. For example, when the model has to compare contract clauses, find a conflict in the terms, and explain the risk in simple language. On tasks like that, the difference between 14B and 70B can be noticeable even without fine-tuning.

At the same time, a bigger model is not always the best choice for your task. Sometimes a 14B model with good fine-tuning answers more accurately than a larger base version. If you are choosing an open-weight model, look not at the number of parameters alone, but at the errors in your own set of examples.

A rough rule is simple: 7B-8B works for simple classification, short answers, and basic field extraction; 14B-32B handles strict instructions, longer context, and a stable format better; 70B+ is needed for complex documents, ambiguous cases, and multi-step reasoning.

Also check quality after quantization. In a pre-compression test, a model may look excellent, but after 4-bit compression it may start breaking JSON, skipping numbers, or handling terminology worse. If the team wants to reduce GPU requirements, compare the quantized version you will actually run in production. That is the version that shows the real balance of speed, cost, and quality.

Language and terminology

Even a strong model often fails not on logic, but on the words people use every day. For an internal LLM stack, this shows up immediately: in a demo, everything looks fine, but in real conversations the model mixes up meaning, translates terms badly, or loses the thread of the email.

If you are choosing an open-weight model for Kazakhstan, do not test only clean Russian. Give it Kazakh, mixed input, and normal office language, where one paragraph may contain Russian text, Kazakh phrases, English product names, and a couple of abbreviations. That is the kind of material that quickly shows whether the model understands context or is just guessing from a pattern.

A good test set includes several request types: a short question in Russian, the same question in Kazakh, a mixed request with terms from your industry, a long service email with nested context, and text with typos, colloquial words, and abbreviations.

Industry terms break especially often. In banking, these may be “scoring,” “delinquency,” “KYC,” and internal product names. In telecom, it may be tariffs, service bundles, and operational codes. In medicine, abbreviations can change the meaning with a single letter. If the model replaces a term with something broader, that is already a signal. The answer may sound smooth, but it will be useless for an employee.

Long emails test something else. A model can often handle a short question using general statistics. But an email half a page long, with customer history, dates, amounts, and a request to prepare a response, quickly shows whether it can hold the context to the end. Some models answer the first two paragraphs well and miss the details in the middle.

Also stress-test errors. People write “the contract didn’t load,” “pls send it,” “invoice,” switch keyboard layouts, and leave out punctuation. If the model starts inventing translations, correcting words that were not wrong, or changing the meaning of a phrase, log that case. What you need is not a vague “like it / don’t like it” score, but a list of specific failures: where it lost the term, where it mixed up Russian and Kazakh, and where it simplified the wording so much that the legal meaning disappeared.

In practice, one such error log is more useful than ten abstract benchmarks. It shows which model can go into a pilot and which one should stay for draft tasks.

Response format without manual cleanup

Compare models without rework
Run the same query set through AI Router and keep your SDK, code, and prompts unchanged.

If the model’s output goes into CRM, scoring, or an internal service, pretty text only gets in the way. The team does not need a “smart” paragraph, but a stable JSON that code can accept without guessing or extra checks.

This is where models often slip. One adds an explanation before the JSON, another changes a field type, and a third writes an empty string where you expected null or an empty array. In testing, that looks minor, but in production it turns into hours of post-processing.

Set a strict response template from the start. Specify field names, the type of each value, and rules for empty cases. If a field is not found, the model should return null, not “not specified,” “no data,” or its own comment.

A good test is simple: the same set of documents, the same prompt, and the same response schema for several models. If you run such comparisons through one OpenAI-compatible endpoint, as in AI Router, the work becomes much easier: the model changes, but the code around it stays the same.

Check at least three task types: extracting fields from a contract, invoice, or form; assigning labels such as document class, risk, topic, or text language; and writing a short summary in 2-3 sentences without fluff or repetition.

This is where it becomes clear who keeps the structure and who writes “like a person” but in a way that is inconvenient for the system. For business, the second option is usually worse, even if the text sounds nicer.

Edge cases need separate testing. Give the model a document with a missing date, two amounts in one paragraph, mixed Russian and Kazakh, and abbreviations from your industry. Then see how it behaves under uncertainty: does it leave the field blank honestly, or does it confidently invent an answer?

It also helps to count not only accuracy but the cost of integration after the model’s response. If the team has to add date normalization, currency cleanup, removal of introductory phrases, and broken JSON handling, then the format does not fit. Even if the model itself is cheap, integration will eat up the savings.

A simple rule helps filter out weak options quickly: if a developer cannot connect the model’s output to the existing pipeline in one evening without a layer of hacks, pick another model or tighten the schema. A convenient response format saves more than a few percentage points on a nice demo.

How to estimate GPU requirements

Start the estimate not with the GPU name, but with three things: how much memory the model weights take, what context the real task needs, and how many requests will run at once. For an internal LLM stack, that is usually enough to rule out half the wrong options quickly.

Quantization lowers the base memory footprint, but it does not solve everything. A hypothetical 8B-parameter model in 4-bit may fit on a single mid-range card if you are testing one short request. The same model with a 16k context, JSON output, and several parallel users may hit memory limits because of the KV cache and cause a sharp rise in latency.

A single request and a whole department’s work are different modes. If a lawyer or analyst sends one prompt every few minutes, what matters is stable latency on one stream. If 30 employees are classifying emails or request summaries at the same time, you need to count throughput: how many requests per minute the system can handle without a queue or slowdown.

Then leave room for headroom. In practice, part of memory goes not only to generation, but also to support processes, monitoring, embeddings, rerank, and overnight batch jobs. If the server is loaded to the limit on a quiet day, it will slow down at peak time exactly when the business expects an answer in 2-3 seconds.

Deployment options

Local deployment on one server is convenient when the load is steady and data cannot leave the company. The downside is simple: if the model sits idle at night, the hardware still comes out of the budget.

A shared cluster is better when several teams share GPUs across tasks. But then you need to watch the queues carefully. One heavy fine-tuning job or batch inference job can easily hurt latency for an interactive scenario.

External deployment works if you do not want to run your own GPU team. For companies in Kazakhstan, this is usually considered together with data residency and auditing requirements.

The final evaluation only comes from a real log-based test. A toy test with short prompts almost always lies. Take 100-200 real requests: long emails, documents, Russian and Kazakh text, and the required response format. Then look at p95 latency, memory usage, and peak behavior. Usually this is the point where it becomes clear whether one GPU is enough or a very different plan is needed.

Example for a typical business task

One endpoint for the team
Switch models by task without rewriting integrations for every pilot.

Let’s take a common internal support task: an employee asks a question, the system searches the knowledge base and old emails, and then returns a short summary for a manager. In a real company, such questions are rarely “clean.” Part of the text is in Russian, part of the terms are in Kazakh, and the emails contain lots of abbreviations, product names, and internal codes.

A manager usually does not need a long model answer. They need a strict JSON that can be sent straight into a CRM or ticketing system. For example, like this:

{
  "topic": "delivery delay",
  "risk": "medium",
  "next_action": "check the status with logistics and reply to the customer by 12:00"
}

This example quickly shows the difference between model classes. A small 7B-8B model often does a decent job of finding obvious matches in the knowledge base and making a simple summary. But it more often confuses similar topics, handles mixed Russian and Kazakh text less well, and sometimes breaks the response format. Instead of JSON, it may return a paragraph, add an extra field, or write risk in free text.

A medium model, such as 14B-32B, usually behaves more steadily. It understands better that “invoice,” “contract,” “act,” and internal Kazakh terms may belong to the same process. It loses context from the email less often when an employee forwards a long thread, and it more often returns JSON without manual cleanup.

A large model is not always needed. If requests are repetitive, the knowledge base is clean, and the response schema is strict, a large model gives little benefit for its GPU cost. This is especially noticeable in the morning and at the end of the month, when the queue grows and response time immediately affects the SLA. Under that load, the team often wins not with the largest model, but with a good search layer and a medium-sized model.

You can check this without a long pilot. Take 100-200 real requests, keep the mixed Russian-Kazakh wording as it is, check not only text quality but JSON validity, and then measure latency during a normal hour and at peak time.

For an internal LLM stack, this is a good test because it exposes weak points right away. If the small model gets the topic and risk wrong, the savings disappear fast: employees spend time checking it. If the large model gives the same result but answers twice as slowly, there is no point paying for it. For comparisons like this, teams in Kazakhstan often run several open-weight models on the same requests inside local infrastructure or through a gateway like AI Router, where the model can be changed without rewriting the code.

Where teams lose time and money

Most of the money is not lost on the model itself, but on bad decisions around it. The team takes the biggest model, connects it to every scenario, and expects one option to cover employee chat, contract review, and field extraction from requests. Usually the opposite happens: costs rise, latency rises, and quality in some tasks barely changes.

A common mistake is looking only at a nice demo. In the demo, the model answers smoothly, keeps structure, and understands the request the first time. Real data brings up abbreviations, broken PDFs, internal terms, mixed Russian and Kazakh, and old document templates. After that, it turns out the choice was made for someone else’s scenario, not its own.

Another source of loss is response format. If the team did not check JSON, fields, value order, and resilience to small failures in advance, developers end up fixing the parser for weeks. One extra comment in the answer, one line like “let me explain my choice” — and automated processing breaks. In the end, the model may be smart, but the product works worse.

Money is most often wasted in the same places: one large model is used for everything, testing is done on presentation examples instead of your own data, latency is not measured at 10, 50, and 100 parallel requests, it is assumed the parser will survive any output format, and edge cases are not kept in a shared error log.

Teams also often misjudge latency. One request in 3 seconds looks acceptable. But when 40 operators start the same scenario at once, response time starts getting in the way of work. If the model runs in an internal environment, GPU requirements, queueing, and cost growth at every new peak are added to that.

Many people think an error log is unnecessary bureaucracy. That is a bad kind of saving. Without it, the team argues by feeling over and over: “this model seems better,” “this one confuses fields more often.” A proper log quickly shows where the model loses meaning, where it breaks format, and which requests lead to long answers.

A simple example: the procurement team asks to extract the amount, deadline, and contract number from incoming files. If you choose a model that is too large without checking format and load, you can end up with an expensive and slow service that still needs manual cleanup of the output. It is much cheaper to take a model suited to the task, run it on your own documents, and immediately check response stability.

Quick checklist before the pilot

Count the pilot budget
Pay by provider rates and get B2B invoicing in tenge.

A pilot usually breaks not on model choice, but on input data and evaluation criteria. If you use too clean examples, an open-weight model almost always looks better than it will in real work.

  • Collect 100-200 live requests from the real flow. Do not fix typos, broken phrases, mixed Russian and Kazakh, extra fields, or strange wording.
  • For each scenario, define the response format in advance. If the system needs JSON with the fields “category,” “reason,” and “risk_score,” check exactly that.
  • Before launch, set a limit for latency and cost. For an internal assistant, 2-3 seconds may be normal, while for batch processing the cost difference becomes noticeable quickly.
  • Run the same set separately on Russian and Kazakh input. Watch where the model mixes up terms, blends languages, and loses meaning in short replies.
  • Decide right away how the team will handle security and auditing: personal data leaks, log storage, access to keys, AI-content labels, and a trace for every request.

A small example makes the difference clear very quickly. A bank may test request summaries on neat Russian texts and get a good result. In production, voice transcriptions, slang, errors, and short messages in Kazakh will arrive. Then it becomes clear whether the model is suitable for the internal LLM stack.

If at least two items on this list are not ready yet, the pilot is better delayed by a few days. That pause usually costs less than replacing the model, reworking the response format, and fixing the audit after the first launch.

What to do after the first comparison

After the first table of scores, do not choose one model for everything. In practice, a set of 2-3 models for different tasks almost always wins. One handles cheap high-volume requests, another keeps strict JSON better, and a third is needed for complex documents, long context, or more confident work with Russian and Kazakh.

This approach is usually cheaper and calmer to operate. If you keep one open-weight model for all scenarios, the team will quickly run into either cost, latency, or quality problems on narrow tasks.

A short pilot on the real request stream brings more value than another week of lab testing. Usually 5-10 working days on live tasks is enough: support requests, contract review, request routing, and search in the internal knowledge base. Look not only at the average quality score, but also at how much manual cleanup employees still need to do.

During the pilot, it helps to put the metrics into one simple card for each model:

  • answer quality on your request set
  • latency, preferably with p95 rather than only the average
  • cost per useful result, not just the price per million tokens
  • share of broken response format
  • how much time was spent on integration, logs, and access control

Often it is the last point that ruins the beautiful choice on paper. A model may answer well, but if the team spends a week fixing parsing, auth, retries, and audit, it is already a bad candidate for the internal environment.

If you need one OpenAI-compatible endpoint and the data must stay in Kazakhstan, you can run the same pilot through AI Router or compare scenarios on airouter.kz. This is convenient when the team wants to switch models without rewriting the SDK, code, and prompts, and at the same time check whether the chosen setup is really production-ready.

After the pilot, define routing rules. Short classification tasks can go to a more compact model, field extraction from documents to a model that keeps the response format stable, and complex legal or medical texts to a stronger model with a separate budget limit.

And one more practical step: refresh the test set every month. Real requests change quickly. If you do not do that, yesterday’s winner will start losing on new documents, new wording, and new user mistakes.

Frequently asked questions

Why not choose a model only by benchmark?

Because a general ranking shows the average picture, not your task. A model may write well, but still fail on JSON, confuse categories in short emails, or lose meaning in long documents with mixed Russian and Kazakh.

How many scenarios do we need for a proper comparison?

Usually 3-5 real scenarios are enough. Use what employees do every day: classify requests, extract fields, search the knowledge base, or summarize a document briefly.

How many examples should we use for testing?

For each scenario, collect at least 30-50 live examples, and for a pilot it is better to use 100-200 requests from the real flow. Do not clean up typos, noisy OCR, or mixed languages, otherwise the test will look too good.

What model size is best to start with?

For simple classification and short answers, 7B-8B is often enough. If you need a stable format, a longer context, and fewer mistakes on working documents, it makes sense to start with 14B-32B.

Should Russian and Kazakh be tested separately?

Yes, and it is better to test both right away. Give the model clean Russian, clean Kazakh, and mixed input with abbreviations, internal terms, and typos, because that is where meaning often breaks down.

Why is response format more important than nice text?

If the answer goes into CRM, a ticketing system, or scoring, code expects a strict structure, not a nice paragraph. One extra comment before JSON can break the pipeline and create extra work for developers.

How do we know if one GPU will be enough?

Look at three things: memory for the weights, context length, and the number of simultaneous requests. Even a small model fits on one card in a simple test, but with long context and parallel load it can quickly run into memory and latency limits.

What should we check after quantization?

Compare the version you will actually run in production. After 4-bit quantization, a model may break JSON more often, skip numbers, and handle terms less reliably, so a pre-quantization test says little about real operation.

Should one model cover all tasks?

A combination of 2-3 models often works best. One is used for cheap high-volume requests, another for strict formatting, and a third for complex documents that need higher accuracy and longer context.

When should the pilot be postponed?

Move the pilot if you do not yet have live requests, a fixed response format, a latency limit, and security checks. A couple of days of preparation is usually cheaper than changing the model, fixing the parser, and reworking the audit after launch.