Jan 27, 2025·8 min read

How to Calculate an LLM Budget for Multiple Teams

We’ll show how to break an LLM budget down by team, limits, and use cases so costs don’t grow after the pilot and the move to production.

Why the pilot gives a misleading picture

A pilot is almost always cheaper and simpler than real work. Only a few people are involved, they ask careful questions, and they quickly stop bad attempts. Production is different: there are more users, requests come in all day, and conversations run longer. People clarify answers, rephrase their questions, and come back to the same task several times.

There is another problem too. In a pilot, a team usually tests one narrow scenario. For example, legal only tests a short contract summary, and support only looks at a draft reply to a customer. After launch, the full workflow appears: uploading long files, extracting fields, classification, text generation, rechecking, sometimes translation, and internal rule checks. One scenario quickly turns into a chain of several model calls.

That is why a budget built from pilot data is often too low. In testing, people ask for a short answer just to see whether the idea works. A month after launch, they already need a detailed breakdown, a table, links to document sections, and an explanation of the choice. Both input tokens and output tokens grow. The number of requests per person grows too: if the tool is built into daily work, an employee uses it not 2-3 times a day, but 15-20.

Finance also often sees a picture that is too smooth. A report may have one line for the LLM, even though it hides very different types of costs: expensive reasoning models for hard tasks, high-volume cheap requests, retries, logging, PII masking, audits, and key-level limits. If you add all of that together into a single number, the pilot looks understandable, and then the budget starts to spread out for no obvious reason.

Even if the team works through a single gateway, you should not count just one number, but several layers of spend. AI Router, for example, simplifies access to different models through one OpenAI-compatible endpoint and provides unified B2B invoicing in tenge, but it does not replace a proper cost structure on its own. Otherwise, after the pilot it becomes hard to tell what actually grew: the number of users, answer length, document volume, or the price of the chosen models.

What the costs are made of

Many people look only at the price per 1 million tokens and seriously underestimate the estimate. In reality, you pay not only for the model’s response, but also for everything around the call: long instructions, repeats, logs, quality checks, and a buffer for switching providers.

It’s better to calculate the budget by scenario. A call summary, knowledge base search, and operator reply draft all create different token usage. One scenario has a long input and a short answer; another is the other way around. If you use an "average temperature," the plan breaks quickly.

Four things usually affect the cost of a single request:

input tokens from the user and context
the system prompt, which lives in every call and quietly inflates the bill
output tokens, especially when the model writes long explanations
repeated requests, retries after timeouts, and regenerated answers because of a poor result

Repeats are usually what create the unpleasant surprise. A user asks to "make it shorter," the service does a retry, the developer adds a second call to check the format, and one business request turns into three or four model calls. If this is not in the estimate, the pilot looks cheap, and production does not.

Prompt caching can noticeably lower the price, but only where part of the request repeats. It works for long system instructions, templates, and large chunks of context that the team sends again and again. If requests are unique every time, the savings will be modest.

Quality control is another separate budget line. The team spends money not only on generation, but also on eval sets, regular runs, log storage, audits, incident review, and support work. In banking, telecom, or the public sector, this often also includes PII masking, content labels, and data retention under local requirements. In a pilot, these costs are almost invisible, but in the live environment they become permanent.

You also need a reserve for switching models or providers. Prices change, limits change, and sometimes a model just performs poorly on a specific task. If a company can quickly switch routing between providers, the risk of downtime is lower, but a financial buffer is still needed. A practical buffer is at least 10-20% on top.

A good estimate accounts not for one "ideal" call, but for the whole request path in production. Then costs stop being a surprise within the first few weeks of work.

How to split the budget by team and task

One total bill for all teams almost always gets in the way of control. Support, legal, and product have different request frequency, different context sizes, and different costs of mistakes. That is why the budget is better split not by department on instinct, but by specific scenarios.

Start by breaking the workload into separate flows: chat, knowledge base search, summarization, and document review. These scenarios cost differently. Chat creates many short requests, search spends budget on retrieval and long context, summarization often produces a large output, and document review may require a more expensive model and repeated runs.

Next, assign each scenario an owner. Not an abstract department, but the person responsible for the metric and the spend. In support, this might be the contact center lead; in legal, the owner of the contract review process; in product, the manager of internal search. Then one person watches the prompt, the model, answer quality, and the monthly burn rate.

A simple accounting scheme works well:

support: operator chat and dialogue summarization
legal team: contract and attachment review
product team: search, internal assistant, and tests

After that, set each team its own monthly ceiling. It’s better to have two limits: a soft one and a hard one. The soft limit warns that the team has reached, for example, 80% of the budget. The hard limit stops expensive scenarios, moves them to a cheaper model, or requires separate approval.

If traffic goes through a shared gateway, it’s useful to issue a separate API key for each team and cap spending at the key level. That makes it easier to separate one group’s spending from another’s and avoid a situation where one department consumes the entire pool.

Do not mix R&D and production load. Internal tests, evals, red teaming, model selection, and load testing should come from a separate budget. Otherwise, after launch you may suddenly discover that engineering experiments were added on top of user requests.

Another common mistake is forgetting seasonality. In retail, the peak comes before promotions; in banking, during reporting periods; in an internal service, after a company-wide release. Keep at least a 15-25% reserve for such weeks. Then the financial plan will not crack in the first month of growth.

Which limits to set in advance

If you do not set limits before launch, the team quickly gets used to the generous pilot mode: they choose the expensive model "just in case," ask for long answers, and keep sending the same request around in circles. Then the bill grows not because of one big mistake, but because of a hundred small habits.

One shared monthly ceiling is not very helpful. It shows the problem too late. It’s better to set limits at several levels right away:

for the user - for manual requests and experiments
for the service - for a bot, internal API, or product feature
for the team - so one group doesn’t eat the whole budget
for expensive models - with separate access or approval
for answer length - for tasks where long output is not needed

Most of the time, it’s not the requests themselves that become expensive, but poor discipline. If an employee sends a large document to the model and asks to "answer in detail," spend grows both on input and output. That is why a max tokens limit is almost always necessary. For classification, field extraction, or a short summary, it’s better to lock in a cheaper model and a short response format from the start.

Test and production access should also be separated. In testing, the team can try different models, but within a small quota and on anonymized data. In production, separate keys, an approved list of models, and clear rules are needed for who can turn on a more expensive route. If the company uses AI Router, such limits are easy to tie to the key and traffic type.

A useful rule is simple: if a task does not affect the customer directly, send it to the cheap model first. Only move the request to a stronger model when the first one fails on quality or confidence. This approach saves a noticeable part of the budget without manual review of every request.

Another required control layer is notifications. It’s better to trigger them not at the end of the month, but on a sharp change in behavior. For example, if the service spent twice as much as usual in a day, the team started choosing the expensive model more often, or the average token volume grew by 30-40%. These signals catch the problem on the day when it is still easy to fix.

How to calculate a quarterly budget

Lower the cost of simple tasks

Send drafts and classification to a cheaper path, and route complex requests to stronger models.

Set routes

Quarterly planning almost always breaks in one place: the team takes the pilot average bill and multiplies it by three months. That is too rough. An LLM budget needs scenario-based planning, because a short support chat request and a long report generation task cost differently.

First, collect not teams, but recurring tasks. One team usually has several: knowledge base search, ticket analysis, email drafting, summary generation, and document review. The budget is calculated by such scenarios and then rolled up by department.

For each team, list 3-5 common scenarios and estimate how often people or systems will run them.
For each scenario, measure the average input and output in tokens, as well as the number of calls per day. If you don’t have much data, use a week of real work, not a demo.
Calculate three monthly modes: base, high, and peak. The base mode shows normal load, the high mode is for an active month, and the peak mode is for a campaign, reporting period, or spike in requests.
Add request retries, integration errors, manual restarts, and growth reserve to the spend. If you have prompt caching or response reuse, subtract the expected savings separately.
Before launch, agree on limits with finance: a monthly ceiling per team, a warning at 70-80%, and rules for what gets turned off first if spending goes over.

It’s convenient to put the calculation into a simple table: scenario, model, tokens per call, calls per day, cost per call, and monthly cost in three modes. After that, the quarter is calculated without guesswork: add up the three months for each scenario and include a reserve. Usually 10-20% is enough if the load is still fluctuating.

If you work through a single API gateway and see several providers in one environment, add another column: the backup model if price or latency increases. That is useful both for cost control and for discussions with finance. When the limits and assumptions are written down in advance, there are far fewer arguments about the bill at the end of the month.

Example for three teams

A company has three teams, and each has its own load profile. Support answers customers in chat, sales drafts emails after calls, and legal reviews contracts. If you put all of that into one shared budget, the number is almost always wrong.

Let’s take a hypothetical month. Support handles 18,000 conversations. The context is short: the customer’s question, a couple of replies, and a ready answer. On average, one conversation uses 1,500 tokens. If the team works on a low-cost model, spending stays moderate and fairly steady from day to day.

Sales makes fewer requests, about 2,500 per month, but each request is longer. The manager uploads meeting notes, asks for a summary, and then writes an email to the customer. One such cycle can easily use 4,000-6,000 tokens. In money terms, that is already a meaningful line item, even though there are fewer requests than in support.

Legal often has the heaviest scenario. One contract, its appendix, the old version, the new version, and a list of edits quickly expand the context to tens of thousands of tokens. If the legal team reviews 300 documents a month, it may spend more than support and sales combined.

For clarity, the budget by function might look like this:

support: 60,000 KZT per month
sales: 110,000 KZT per month
legal: 240,000 KZT per month
reserve for peaks and repeat runs: 90,000 KZT

The weak spot is obvious right away. One expensive scenario inside the legal function can use up the team’s entire limit in a week. For example, if legal starts running every contract twice — first for risk review and then for version comparison — the monthly spend can quickly double.

A total limit of 500,000 KZT does not protect you from that situation. It only stops overspending after one function has already taken money away from the others. Support may end up without the budget to answer customers, even though it stayed within plan.

That is why the budget is better divided not only by team, but also by task type. Sales should have a separate limit for emails and another one for meeting summaries. Legal should have one limit for short checks and another for long documents. If traffic goes through a single gateway, this is easy to split across different keys and access rules.

Where budgets usually start breaking

Keep your data in Kazakhstan

Use models on your own GPU infrastructure if you need data to stay in-country.

Route traffic

Budgets do not break on token price alone, but on bad assumptions. The team takes the average price of one request, multiplies it by the number of users, and gets a picture that is too calm. In real work, requests are rarely "average": one short question costs almost nothing, while analyzing a long contract, email, or chat history can raise spend many times over.

A common mistake is mixing pilot, internal testing, and production into one expense model. In a pilot, people behave carefully, traffic is small, and scenarios are simple. After launch, repeated attempts, more attachments, longer conversations, and new features appear. If this is missing from the estimate, the budget starts drifting in the very first month.

What is usually underestimated

Pricing is uneven: 80% of requests may be cheap, while the rest consume half the monthly bill.
One expensive model for all teams almost always creates extra costs.
Chat history and attachments increase spend quietly until the bill arrives.
Night batches, retries, and background tasks often do not make it into the first estimate.

Another weak point is shared access to one strong and expensive model. It is done for simplicity, but then HR, support, analysts, and developers solve very different problems through the same route. For drafts, classification, summarization, and tagging, that is usually too expensive. It is much more sensible to split scenarios by quality level right away and set separate limits.

Long attachments also break the plan faster than it seems. One 120-page PDF, two weeks of correspondence, or a full CRM history in every request changes the economics of the entire process. If the system sends the whole context again every time, spend grows even when the user only wants one paragraph fixed.

Night processes are rarely noticed in time. During the day, the team only counts live user requests, but at night batches run: ticket labeling, answer rechecks, embedding recalculation, dialogue summarization, retries after timeouts. These calls are invisible to managers, yet they steadily add up to a large volume.

In practice, it helps to split the budget along two axes: by team and by load type. If you have one gateway and shared OpenAI-compatible access, this is easy to do through separate keys, limits, and audit logs. Then you can see where the money goes to production, where it goes to internal testing, and where background tasks consume it.

The most useful question before launch is simple: which requests will be rare but very expensive. That is usually what breaks the quarterly plan.

Checks before launch

Switch models without rework

Move traffic between providers through one API when price or latency changes.

Compare models

Before launch, an LLM budget often looks neat only on paper. One missed limit or unclear approval flow, and in a peak month the spending goes above plan.

First, assign a budget owner for each team. Not a group and not "everyone a little bit," but one specific person. That person confirms limits, sees deviations, and decides when access can be expanded or a model can be changed.

Then prepare a short cost card for each scenario. Four lines are usually enough: which model is used, how much one request or 1 million tokens costs, what monthly volume is expected, and what counts as normal. If the team has employee chat, a call summarizer, and internal search, count them separately. Otherwise, one expensive scenario will be hidden inside the total.

Next, check the limits not on an average month, but on the heaviest one. Take a period with reporting, promotions, a release, or seasonal peak. If both the number of requests and the length of answers grow during that time, the daily and monthly ceilings should handle the load without manual panic.

Agree separately on escalation rules. Who gets the alert at 80% of the limit? Who can temporarily raise the threshold? Who turns off secondary scenarios if spending rises? If the answers live only in the tech lead’s head, that is already a risk.

Finally, plan in advance how you would replace a model if prices rise, latency increases, or the provider changes terms. It’s better to decide upfront which tasks can be moved to a cheaper model without a visible drop in quality, and which ones must not be touched. If the company uses an OpenAI-compatible gateway like AI Router, such switches are usually easier: you do not need to change the whole SDK stack and prompts, just the route or model for a specific scenario.

This check takes a few hours, but later it saves weeks of arguments after the pilot.

What to do next

Build a working scenario table. Not a quarterly forecast in general, but a living document: team, task, model, average request size, expected frequency, limit, and owner. Update it once a month, or the budget will once again run into guesswork.

A shared pool almost always creates disputes. One team quickly burns through the limit, another freezes its launch, and the financial picture becomes muddy. It is much easier to give each team its own monthly ceiling and separately agree on overage rules: who approves it, in what case, and for how long.

The minimum that really works looks like this:

split spending by teams and task types
give each team a base limit and an emergency reserve, but do not mix them
once a month, compare plan vs. actual not only by total, but also by the cost of each scenario
look for cases where it is cheaper to switch models than to cut access for people

The last point often gives the biggest effect. If support uses an expensive model for simple replies and quality barely changes with a cheaper one, cutting access makes no sense. It is better to change the model for that specific scenario and keep the team working at a normal pace.

If different teams already use their own set of models, do not try to move everyone to one provider right away. That takes time and is rarely worth it. First, clean up the accounting and limits, and only then decide where it makes sense to unify traffic, routing, and access rules.

For companies in Kazakhstan and Central Asia, AI Router can simplify accounting: one OpenAI-compatible API for different models and monthly B2B invoicing in tenge. It does not solve the budget by itself, but it makes reconciliation much easier if product, analytics, and support work with different models.

If you do three things — build a scenario table, set limits by team, and review model choices once a month — the financial plan usually stops breaking right after the pilot.

Frequently asked questions

Why can’t we just use the pilot budget as is?

Because a pilot is almost always cleaner than real work. There are fewer people, conversations are shorter, and there are almost no retries. After launch, the number of requests grows, the context gets longer, and more calls are needed for checks, formatting, or regeneration.

What should be counted separately in an LLM budget?

It’s best to build the budget separately for each scenario. Usually you count input tokens, the system prompt, output tokens, retries, repeated user requests, logs, evals, PII masking, and audits. If you put everything into one line, you won’t know what actually got more expensive.

How should the budget be split between teams?

The simplest way is to divide the budget by tasks and owners, not by departments at a glance. Support, legal, and product have different request costs and different usage frequency. Give each team its own scenarios, its own limit, and a separate API key so you can see spending clearly.

What limits should be set before launch?

Set both a soft limit and a hard limit right away. The soft limit warns you when a team is getting close to the ceiling, and the hard limit stops an expensive route or moves requests to a cheaper model. Also limit max tokens and access to expensive models, or the bill will quickly grow from small habits.

Do we need a reserve in the budget, and how much should we set aside?

Yes, without a reserve the plan often breaks in the very first busy month. For normal uncertainty, 10–20% extra is often enough, and for seasonal peaks it’s better to keep 15–25%. That buffer covers traffic spikes, model changes, and extra retries without urgent reapproval.

How do you calculate a quarterly budget without major mistakes?

Don’t use the pilot average. Use real scenarios across three modes: base, high, and peak. For each scenario, measure input and output tokens, calls per day, and model price. Then add retries, manual restarts, background tasks, and a reserve.

Where does the budget usually start to crack?

It usually breaks under long documents, chat history, repeated requests, and one expensive model for everything. Nightly batches, response rechecks, and background processes also quietly raise spend. On paper it looks small, but over a month it adds up to a meaningful amount.

When should requests be sent to a cheaper model?

That switch is useful for tasks that don’t need the best model on every request. Classification, short summaries, field extraction, and drafts can often go to a cheaper route first. It only makes sense to move to a stronger model when the first one doesn’t meet quality requirements.

How can we tell in time that spending is going off track?

Watch not only the bill at the end of the month, but also sharp changes during the day and week. It helps to catch growth in average token count, a jump in daily spend, frequent use of an expensive model, and spikes in retries. If traffic is separated by keys, the source of the problem shows up much faster.

Will a single API gateway help control spending better?

Yes, it makes accounting easier because it gives you one entry point for different models and one consolidated bill. But the gateway itself does not solve budget discipline. You still need to track scenarios separately, issue different keys to teams, and keep clear limits; then reconciliation becomes much simpler.