Jul 25, 2024·8 min read

LLM Cost in Tenge: How to Build an Annual Budget

We show how to calculate the cost of LLMs in tenge for a year: tokens, exchange rates, traffic spikes, a test buffer, and a clear budget for the team.

Why the annual budget often differs from reality

The problem usually is not the math. The team takes the model’s average price, multiplies it by the current request volume, and gets a neat number. Three months later that number no longer looks like reality, because a live LLM product almost never runs on the average scenario.

The mistake is usually built in from the start. The calculation uses a calm month instead of a growth plan. If support, sales, and internal teams use the service, traffic rarely grows in a straight line. New channels, seasonality, feature launches, and even a successful pilot all push it up. In the end, the budget is based on yesterday’s load, while you pay for tomorrow’s.

There is a second trap too: the same scenario can use very different numbers of tokens. A customer request can be short, or it can pull in a long thread, a system prompt, chat history, parts of the knowledge base, and a large response. Even a simple support case can cost differently on Monday and at the end of the quarter. Add retries, repeated requests, and prompt A/B tests, and the spread becomes obvious very quickly.

The third reason is very practical: teams almost always underestimate experiments. While a project exists on paper, it seems as if the money goes only to production. In reality, spending also goes to test environments, model comparisons, new use cases, manual runs, and debugging. If the team wants better answers, part of the limit will almost inevitably burn not on users, but on finding a working setup.

For Kazakhstan and Central Asia, there is another source of mismatch: the exchange rate. Many models are priced in dollars, while the budget is approved in tenge. The rate moves faster than the financial plan. That is why you cannot treat expenses in tenge as a fixed amount for the whole year. They move even if token volume does not grow.

A simple example makes this clear. The team planned for 10 million tokens per month at an average rate and got a clean estimate. Then it added RAG, stored more context, increased answers for complex requests, and tested two new models at the same time. Volume did not rise by 10-15%, but almost doubled. If you add an exchange-rate jump on top of that, the difference between plan and actual no longer looks surprising.

That is why a good budget is built not on one average price, but on ranges: normal load, peaks, tests, and a currency buffer. Otherwise, the estimate only looks neat until the first month of operation.

What to include in the calculation

A yearly budget falls apart when the team calculates not real scenarios, but one averaged number. You need not a forecast "for the month," but a set of clear expense lines for each request type.

Start by breaking the load into scenarios: support chat, knowledge base search, document summarization, application review, and an internal assistant for employees. Each scenario has its own prompt length and response size. So input and output tokens should be counted separately, not merged into one number.

In practice, the difference can be large. A short customer question may use 600-900 input tokens and produce 150-300 output tokens. A long complaint or contract can easily consume several thousand tokens on input and a much longer response.

Also record the model price and the date you used for the calculation. It is a simple thing, but it saves you during budget approval a month or a quarter later. If the provider changes pricing, you can immediately see where the estimate changed.

A basic table usually needs five fields: the scenario and chosen model, average input and output tokens, the number of requests on a normal day, the number of requests at peak and how long that peak lasts, plus the pricing date and the currency from which tenge will later be calculated.

Do not forget the "dirty" costs that tend to disappear between departments. In a real system, part of the requests are repeated because of timeouts, part is spent on A/B tests, quality review, manual restarts, and prompt tuning. If this is not in the model, the budget is almost always too optimistic. For a more mature estimate, it is better to set aside a separate reserve instead of hiding it in the average spend.

If you host the model yourself, add another layer of costs. You do not only need GPUs, but also storage, networking, monitoring, reserve capacity, maintenance, and team time. Even if it feels like the model has already been "bought," it still costs money to run every month.

For companies in Kazakhstan, it is useful to separate two buckets from the start: costs for external models and costs for local hosting. For example, with AI Router you can count API calls separately from tasks that require data residency, audit logs, or your own open-weight models on local infrastructure. That makes the financial model cleaner and leaves fewer gray areas.

How to build the calculation step by step

The usual mistake is familiar: the team takes the average price for 1 million tokens and multiplies it by total volume. That approach is too rough. A working model should show which scenario costs what, when traffic grows, and by which rule the total is converted into tenge.

Build the calculation using the same structure for every scenario. Then it becomes clear where the main load is and where spending is barely noticeable.

List the scenarios with the biggest request flow. Usually these are support chat, knowledge base search, document review, and an internal copilot for employees. Do not start with rare cases. Budgets are usually eaten by 2-3 high-volume scenarios.
Pull the actual input and output size from logs. Look not at the team’s impression, but at the average number of tokens in the prompt and in the response. It is also useful to take the 90th percentile if responses sometimes grow a lot. Otherwise, the average will paint too pretty a picture.
Calculate the monthly cost for each scenario separately. The formula is simple: number of requests per month x average input tokens x input price plus number of requests per month x average output tokens x output price. If you use two models in one scenario, calculate them separately.
Spread the volume across months. User growth rarely moves in a straight line. Support gets spikes at the end of a quarter, retail sees them before promotions, and banks see them during reporting periods. If you expect 30% growth over the year, do not smear it into a thin layer. Show which month the load really goes up.
Convert everything into tenge using one exchange-rate rule. Do not change the rate from table to table. Choose one internal company rule: a planned rate for the year or a conservative rate with a buffer. Then finance will see the logic, not a pile of random figures.

A small example. If support handles 200,000 requests a month, the average prompt is 900 tokens, and the response is 350 tokens, you can already get a clear monthly amount, then multiply it by the seasonal pattern and build a yearly budget without guessing.

If the team works through AI Router, it is convenient to use actual volumes by model and avoid rebuilding the calculation when the provider changes. But the principle is always the same: first scenarios, then tokens, then price, then monthly growth, and only after that the amount in tenge.

How to budget for exchange rates without guessing

The mistake usually does not start with the exchange rate, but with the spreadsheet. The team takes provider prices from March, the exchange rate from June, and the demand forecast from September. The result looks tidy, but it does not line up well with real invoices.

First choose one exchange-rate source and stick to it across the whole model. Do not switch sources between sheets and do not plug in a rate "by eye" just to get the total you want. If the company’s finance team already uses an internal benchmark, take that one as well. Then the LLM budget will not argue with accounting, only with your demand and token usage.

One update rhythm is better than "precision"

Next decide how often you update the exchange rate in the calculation. For a yearly budget, one rhythm across the whole document is usually enough: once a month if spending is large and invoices come regularly; once a quarter if the launch is just starting; or whenever the budget is revised, if the company rarely changes its plan.

The problem is not that the exchange rate changes. The problem is chaos. One sheet gets recalculated, another is forgotten, a third is adjusted manually. After that, the budget loses its meaning.

The practical version is simple: keep three scenarios. The base case shows the plan at a calm exchange rate. The cautious case adds a moderate buffer. The stress case is not for panic, but for discussions with finance and procurement. If the difference between the base and stress cases breaks the service’s economics, it is better to see that before launch.

For example, if you calculate 120 million input and output tokens per month and pay for part of the models in foreign currency, do not multiply the entire year by today’s rate. Take a base rate for the main plan, separately calculate the same volume at a higher rate, and see how many tenge get added to the annual spend. That is what currency risk looks like in a clear form.

When the risk can be reduced

If the vendor bills B2B invoices in tenge, part of the uncertainty disappears. With AI Router, calculations are in tenge, while the rates remain at provider level without an API markup. For yearly planning, that is convenient, because the discussion then is not about the exchange rate, but about load, model choice, and the reserve for experiments.

One final rule is simple: do not mix prices from different dates in the same table. If you updated the exchange rate, update the price date too. If you are not ready to recalculate everything, it is better to keep the previous version intact. One honest date is better than a mixed table from different months.

How to calculate peak load

Compare 500+ models

Send requests to 500+ models from 68+ providers without a new integration setup.

Try the API

For the budget, the dangerous part is not the average day, but the busiest hour. That is when limits break, latency rises, and token spending changes sharply. This matters more than a pretty monthly average.

Start with logs from the last 2-3 months. Look for the hour with the highest request volume: a sale, month-end, a mass mailing, Monday after a weekend. If you have little data, use a realistic peak scenario, not a normal workday.

Total request volume by itself does not tell you much. You need concurrent load: how many requests the system handles at the same minute or even the same second. One service can make 10,000 requests a day and stay calm, while another hits limits at just 40 simultaneous requests.

In practice, it is useful to collect at least four numbers: the maximum requests per minute during the busy hour, the number of simultaneous requests, the average and upper response length, and the share of retries and timeouts.

Response length often grows during peaks. Users write longer messages, agents ask for more detail, and the bot calls extra tools more often. If a response normally takes 400 tokens, it may grow to 650-800 during the peak hour. That changes the calculation noticeably.

Also add a reserve for retries. If the app repeats a request after a timeout, you pay again. In practice, it is convenient to budget a 5-15% buffer for retries, and more for unstable integrations. If you have chains of several calls for one user request, count the full path, not just the first model response.

After that, check not only the budget, but also the technical ceilings. The model and the provider have request and token limits per minute. Your application has its own queues, timeouts, and API-key-level restrictions. If the team works through a gateway like AI Router, check both layers: the selected model’s limits and the rate limits at the key level.

A simple test is more useful than long debates. Take the busiest hour, multiply concurrent requests by the length of the full exchange, add a retry buffer, and see whether that volume fits within the per-minute limits. If it does not, the yearly budget is already too low, even if the average month looks fine.

Example calculation for a support team

Support almost always has mixed traffic: short chat questions and long emails with lots of context. If you calculate them using one average number, the budget will quickly drift. It is better to separate simple and complex requests across different models from the start.

Chat brings in 24,000 requests a month, of which 75% are simple and 25% are complex. Email adds another 8,000 requests: 60% simple and 40% complex. Simple requests go to a fast, lower-cost model priced at 1,200 ₸ per 1 million input tokens and 4,800 ₸ per 1 million output tokens. Complex requests go to a stronger model: 12,000 ₸ per 1 million input and 36,000 ₸ per 1 million output.

For chat, let’s use the following usage: a simple request uses 1,000 input and 250 output tokens, while a complex one uses 2,500 and 900. For email, the numbers are higher: a simple email uses 1,800 and 600, while a complex one uses 4,500 and 1,400.

That means simple requests per month produce 26.64 million input tokens and 7.38 million output tokens. That comes to 67,392 ₸. Complex requests produce 29.4 million input tokens and 9.88 million output tokens. That comes to 708,480 ₸. The base monthly total is 775,872 ₸.

Now add the evening peak. Suppose 35% of all chats arrive between 6:00 PM and 10:00 PM, and the team does not want people waiting in a queue. So 30% of simple chat requests in that window are sent not to the cheaper model, but to the stronger one. That is 1,890 conversations a month. The price difference for one such conversation is 18.6 ₸. The peak alone adds another 35,154 ₸, bringing the monthly total to 811,026 ₸.

It is better not to hide pilot runs for new prompts inside the main expense line. Suppose the team runs 700 complex emails through a new prompt chain once a quarter. At 6,000 input and 1,500 output tokens per email, that is another 88,200 ₸ for one pilot. In a yearly plan, it is more convenient to keep that amount as a separate reserve: 352,800 ₸ per year.

This template shows not an abstract price, but the real cost by channel, request type, and load mode. It is much easier to defend the budget when you can show what the normal flow costs, what the evening peak consumes, and what goes to experiments.

Where teams most often make mistakes

Migrate your current integration

If you already use OpenAI or OpenRouter, just change the API address.

Move traffic

The most common mistake is simple: they count only input tokens. That is not enough for a budget. In most working scenarios, the model’s response costs as much as, or sometimes more than, the request.

If a support bot gets 700 input tokens and returns 250-300 output tokens, you cannot count input only. Across hundreds of thousands of conversations, the difference is no longer a minor detail, but extra millions of tenge a year.

The second mistake appears when the team takes the price of one model and acts as if that will stay true all year. In practice, request routing changes. Some traffic goes to a cheaper model, some to a stronger one, and some to locally hosted open-weight models if latency, data residency, or your own fine-tuned versions matter.

For teams working through AI Router, this is especially noticeable: technically you can change the base_url quickly and keep working through one OpenAI-compatible endpoint, but the financial model must account for more than one price and the share of traffic on each route. Otherwise the number on paper will be one thing, and the invoice another.

Another common miss is not counting service calls. The team budgets only the main request to the model and forgets retries, moderation, PII masking, classification, response checking, and other helper steps. Each such call costs money. Sometimes they add 10-25% to the volume, and even more with unstable integrations.

The reserve for tests is also often set to zero. That is almost always a mistake. Before launch, the team runs prompts, compares models, changes system instructions, and checks quality on new datasets. After launch, experiments do not disappear. They increase, because the product is alive and business requirements keep changing.

And one more mistake even strong teams make: they calculate the year using today’s exchange rate. That is convenient in a spreadsheet, but not in real life. If provider rates are tied to foreign currency and you approve the budget in tenge, you need not one rate, but at least a base and a stress scenario.

A good working model looks boring, and that is normal. But it saves you from surprises. It counts input and output, traffic shares, service requests, peaks, a test reserve, and two exchange-rate scenarios. That is the kind of table that usually survives both the pilot and the growth phase.

Quick check before budget approval

One API for providers

Connect one OpenAI-compatible endpoint and change request routes without rewriting code.

Connect the gateway

What usually breaks a budget approval is not the total amount, but the gaps in the calculation. If the table does not show where the number came from, who owns it, and how quickly it can be recalculated, the discussion immediately turns into an argument.

A good model does not look smart. It looks checkable. Any manager should understand in a couple of minutes where the base case is, where the peak is, which exchange rate you fixed, and how much money you deliberately left for tests.

Before the meeting, check five things. Every line should have an owner. Every number should have a source: a log from the last quarter, product data, a contact center forecast, or the vendor’s pricing. The table should show three scenarios: base, peak, and stress. The test reserve is better kept separate, and the exchange rate and fix date should be on the first screen.

A separate reserve line usually saves the conversation. For example, support plans for 12 million tokens a month for production load and another 2 million for testing new prompts. If you add them together, the reserve quickly disappears. If you separate them, the decision becomes clear: production is protected as a mandatory expense, tests as a controllable one.

If you use a gateway or a provider that invoices in tenge, it is worth noting that in the data source as well. For teams in Kazakhstan, this format makes yearly budgeting easier, because finance sees the amount in local currency right away, not a conversion on the last day.

The final check is very simple: give the spreadsheet to a colleague and ask them to update one model, the exchange rate, and the peak load. If the person can do it in 15 minutes, the calculation is ready for approval.

What to do next

Once the draft budget is ready, do not take it straight into the next year. First test it against real data for 2-4 weeks. Logs will quickly show where you underestimated volume and where you overestimated.

Put four things into one table: number of requests, average input tokens, average output tokens, and the share of expensive scenarios. Then recalculate spending not from assumptions, but from actual data. Even a short observation period is usually more useful than the team’s "average estimate."

The working rhythm is simple: once a month, update the actual token spend for each scenario, separately pull in the exchange rate you really pay with, compare plan and actual in percentages and in tenge, and then immediately revise the yearly forecast. Do not let the error accumulate until the end of the quarter.

Another useful step is to separate scenarios by model. Do not keep all tasks on one expensive model just because it is simpler at the start. Knowledge base search, request classification, field extraction from documents, and draft responses can often go to a cheaper model. Complex dialogues, ambiguous cases, and high-risk tasks are better left to a stronger one.

This reduces spending volatility and makes the forecast calmer. It also makes it easier for finance to explain why one part of the load is cheap and another is not.

If you want one API, invoices in tenge, and data storage in Kazakhstan, it makes sense to compare two approaches: direct integrations with several providers and working through a single gateway like AI Router. This option gives you one OpenAI-compatible endpoint, monthly B2B invoicing in tenge, and the ability to keep data inside the country. For some teams, that is not about convenience, but about accounting, compliance, and speed of launch.

And the final test. Open the estimate and ask which three numbers you can update on the first business day of next month. If there is no answer, the budget model is still too rough.