Skip to content
Mar 31, 2025·8 min read

LLM expense report for accounting and the CTO without manual reconciliations

An LLM expense report brings tokens, models, and teams into one format so accounting and the CTO can reconcile the numbers without manual work.

LLM expense report for accounting and the CTO without manual reconciliations

Why LLM numbers do not match

Discrepancies do not start with arithmetic, but with different units of measurement. Engineering looks at tokens, requests, and models. Finance looks at amount, currency, period, and cost center. Both sides can be right, but their numbers will not match until you bring the data into one format.

Even the same usage volume does not produce the same bill. The same workflow can run through different models, and those models can have different rates for input and output tokens. In some places cache is billed separately, in others it is not. That is why the line "2 million tokens" by itself explains almost nothing. For engineers, that is usage; for accounting, it is not yet an expense.

The exported data also adds confusion. One provider calculates by requests, another by tokens, a third shows a price per 1 million tokens, and a fourth sends only a monthly invoice. Even model names do not always match: one name in the logs, another on the bill. Add UTC, local time, and different month-end cutoffs, and the numbers drift even more.

From the outside, this looks like a small operational issue. In practice, it regularly eats hours. At the end of the month, someone exports CSV files, someone compares them with app logs, and someone else manually allocates costs across departments. One such reconciliation can easily take half a day.

Usually the mismatch appears in four places: engineers track usage while finance tracks payment; the model in the report and the one on the invoice have different names; amounts arrive in different currencies and for different periods; and the expense cannot be quickly tied to a specific team.

That is already enough for a monthly report to stop making sense. The CTO wants to see which product and which team burned through the budget. Accounting needs a document that can be closed without long back-and-forth. If the format is not agreed in advance, manual reconciliation turns into a monthly routine.

Companies in Kazakhstan often have an extra layer of complexity. Part of the spend goes through foreign providers, while internal B2B accounting is needed in tenge. Without one conversion rule and one collection point, the data diverges even when the requests themselves ran correctly.

What should be in one format

A good report should answer two questions at once: how much was spent, and why the amount increased. If a table contains only money, the CTO cannot see the reason. If it contains only tokens and models, that is not enough for finance.

Start with the period. In every export, specify the reporting month and the date when you pulled the data. This immediately removes part of the dispute when provider usage is processed later or some requests move to the next day in UTC. A simple note like "May 2026, export on June 3, 2026" saves a lot of time.

Next, add business context. Every row or group of rows should have a team, a product, and a budget owner. Otherwise, expenses quickly turn into a generic "AI" bill that nobody feels responsible for. If one service is used by support, mobile, and the data team, split the cost right away instead of doing it manually at month-end.

The technical part should also stay simple. Record not only the model, but also the provider and the request type. The same model can cost different amounts with different providers and be billed under different rules. Request type also affects the total: chat, embeddings, rerank, batch, and fine-tuned inference should not be mixed in one column.

Token tracking is where confusion usually starts. Show input tokens, output tokens, and cached tokens separately if the provider bills them separately. If cache is hidden inside the total, the report looks more expensive than it really is, and the argument quickly shifts to the calculation method.

For money, there should be one rule across the whole file. If some amounts come in dollars and others in tenge, convert everything into one currency using one exchange rate and write that rule in the report header. For companies in Kazakhstan, the total in tenge is usually more convenient, especially if part of the spend already arrives as a monthly B2B invoice in the same currency.

A good format usually fits into 8-10 columns and can be read without verbal explanations. If accounting can see the money total and the CTO can immediately understand which team, which model, and which request type produced it, the report is structured well.

How to choose a unit of measurement

When tokens, requests, dollars, and tenge sit next to each other in one table, arguments are almost unavoidable. You need one base that all models and all teams can be mapped to. For LLMs, the most convenient option is to calculate cost per 1 million tokens, split by token type.

The same volume of requests can cost different amounts even within one model. The reason is simple: input and output are billed separately, and cache often has its own rate. If you compress everything into one line called "total tokens," the report becomes shorter but loses meaning. A team may reduce input by 20%, and the bill will not change if expensive output goes up.

That is why, for each model and each team, it is better to keep four fields: input tokens, output tokens, cached tokens if the provider bills them separately, and the rate per 1 million tokens for each type. Then you can see where dialog costs increased, where cache kicked in, and where the model is simply answering too verbosely.

Do not mix different task types

Chat, embeddings, and batch jobs should not be lumped together without a separate note. Their pricing logic is different, and so is their business value. For chat, input and output usually matter most. For embeddings, the spend is often tied to large-scale indexing. For batch jobs, the price may be lower, but the volume is higher.

A simple example: the support team sent 8 million tokens to a chat model, while catalog search processed 40 million tokens through embeddings. If you write that as one total, it will look like search is more expensive in usage, even though the chat model may have consumed the actual budget.

There is another rule that noticeably reduces month-end disputes: fix one exchange rate for the whole reporting period. Do not convert some operations at the daily rate and others at the invoice rate. Choose the rule once and apply it to every row.

A good unit of measurement makes the report boring. That is a good thing. When the format raises no questions, people discuss the actual spending instead of the arithmetic.

How to build the report step by step

If you pull data from different dashboards at different times, the numbers will almost always diverge. So define three things in advance: one reporting period, one time zone, and one rate table.

Start with raw usage data, not money. You will calculate money later, after you bring the rows into one structure.

First, choose the period, for example a calendar month, and a time zone. Then export usage for all providers for the same interval. In every export, keep the same set of fields: date, model, API key or project ID, number of input and output tokens, cached tokens, number of requests, and source cost if the provider provides it.

Next, link each API key to a team, a product, and, if needed, an internal cost center. It is better to keep this mapping in a separate reference table rather than editing it by hand at month-end. Then normalize model and plan names. One provider writes the full version name, another uses a short alias. For the report, this must be one entity, otherwise spend for one model will be split across several rows.

After that, convert tokens into money using one rate table. It should include the price for input, output, and cached tokens, the currency, and the rate effective date if the pricing changed during the month.

When the calculation is ready, roll everything into a summary table. Usually these columns are enough: period, team, product, model, provider, input tokens, output tokens, cached tokens, number of requests, and amount. Add a monthly total and a reconciliation by team separately.

What to check before finalizing

The sum of all rows should match the sum by provider for the same period. If it does not, do not look for an "accounting error" first; look for time-zone differences, missing keys, or an old rate in the calculation.

If traffic goes through a single gateway, part of the work becomes easier. For example, with AI Router you can route requests through one OpenAI-compatible endpoint and then reconcile monthly costs in tenge from one B2B invoice. That does not replace internal team-level breakdowns, but it does significantly reduce the number of manual joins between different dashboards.

How to allocate costs by team

One endpoint for LLM
Route traffic through a single gateway and reconcile costs faster.

When all LLM spend sits in one total, disputes start quickly. Finance sees the bill, the CTO sees rising costs, and teams see only their own slice of work. You need a way to split expenses so the rows can be read without extra calls.

Start with the simplest rule: project API keys versus shared keys. A project key immediately ties the expense to a team or product. Keep shared keys only for things that cannot honestly be assigned to one owner, such as a sandbox, a short pilot, or an internal bot.

If the shared pool becomes too large, the report quickly loses meaning. This happens often: expensive requests go through the shared key, and then people try to split them by intuition. In the end, the numbers exist on paper, but nobody trusts them.

Keep test and prod separate. Otherwise, the team that spent a week testing a new feature on an expensive model looks the same as the team with real user traffic. Two separate columns usually remove half the questions during approval.

For each team, it helps to show the monthly limit, actual spend, variance in money and percent, plus a short comment if the amount changed significantly. You do not need a long analysis. One clear sentence is usually enough: "We launched knowledge base search for 12,000 users" or "We tested a long-context model in QA for two weeks." Without that note, the spike looks like a mistake.

If you already have a single source of logs and audit records, building this breakdown is easier. But the rule does not change: every expense should have an owner, an environment, and a clear explanation.

Monthly report example

Imagine July. The company has two major spending items: a support chat for online store customers and summarization of internal reports for analysts. When these numbers are brought into one monthly report, both accounting and the CTO look at the same figures.

TeamScenarioModelInput tokensOutput tokensInput, ₸Output, ₸Total, ₸
RetailSupport chatModel A18.2M6.1M38,00096,000134,000
AnalyticsReport summarization by dayModel B4.8M11.7M27,000168,000195,000
AnalyticsReport summarization at nightModel C5.1M12.4M19,000103,000122,000
Total--28.1M30.2M84,000367,000451,000

This kind of summary quickly shows where the budget is going. At first glance, support might seem more expensive because it has more conversations and incoming messages. But after consolidation, the opposite becomes clear. The main cost comes from output, because the model is not just reading input text, but writing long answers, rewrites, and summaries.

This is especially noticeable for analysts. They upload reports, tables, and meeting notes, and in return they get long summaries. They have fewer requests than support, but the answers are longer, so the output line grows faster. In this example, output makes up more than 80% of the monthly bill.

For the CTO, this report is useful not only for the total, but also as a clue for where to find savings. Nighttime summarization tasks rarely need the most expensive model, so part of the workload can be moved to a cheaper one. The table already shows it: during the day analysts work on one model, at night on another, and the cost for the same type of work is lower.

For accounting, the advantage is different: the report format does not change after such a switch. Even if you moved part of the workload between models or providers during the month, the overall template stays the same — period, team, scenario, tokens, and amount.

Where mistakes happen most often

Do not change your current stack
Keep your familiar SDKs and prompts when you move traffic to AI Router.

Most problems start not in finance, but in the source data. The team tracks tokens by one rule, finance closes the month by another, and then nobody understands why the report does not match.

The first common mistake is mixing periods. Logs may run in UTC, while the company closes the month using Almaty time. Because of this, requests from the last hours of the month land in one token report, but in the bill they end up in another. With a small volume, this is barely noticeable. In production, the difference quickly becomes obvious.

The second mistake is working only with model aliases. In code, the team may see something like "support-main" or "gpt-4.1", but billing depends on the actual model, provider, and version at the time of the request. If the same alias pointed to different models on different days, the average price for that alias gives a false picture. The report needs a chain: alias, actual model, provider, and rate.

The third mistake is mixing internal experiments with customer traffic. Then testing a new feature, prompt evaluation, and load testing all look like product expenses. The CTO sees rising costs, while accounting does not understand what belongs to R&D and what can be assigned to a specific service.

The fourth mistake is currency. Even careful teams add invoices in dollars and tenge together, and each time they use a rate from a different source. As a result, the same token volume produces different amounts in different tables. It is better to fix the exchange rate for the reporting period once and write it into the methodology.

And one more trap: treating all tokens as the same. For many models, price depends on token type — input, output, cached. Sometimes different request types are also included in the calculation, but the summary table collapses everything into one column called "tokens." After that, cache savings disappear on paper even though they exist in money terms.

Before sending, check five things:

  • do the reporting period, billing data, and invoices match;
  • do the rows contain the actual model, not just the alias;
  • is customer traffic separated from tests and internal runs;
  • is one currency conversion rate fixed;
  • are tokens split by type instead of merged into one total.

If even one item is missing, the report will almost certainly need to be rebuilt.

Quick check before sending

Find the model that fits your budget
See where the same task is cheaper and still good enough.

Before sending, do one more quick pass. It takes 10-15 minutes, but it saves hours of emails and calls later.

First, look at the rows as regular expenses, not just as a set of tokens. If a row has no team, product, or budget owner, it will be hard to approve. For the CTO, it is an unclear cost; for accounting, it is a line without an owner.

Then check the rates. Each model should have one rate for the chosen period. If the price changed in the middle of the month, split the usage into two rows with different dates instead of mixing everything into one amount.

Next, compare the total with the provider or gateway data. The amount should match except for a clear difference, such as rounding or a separate VAT line. Any noticeable mismatch is better explained immediately in one sentence: "Up 42% due to the support pilot launch, 18 million tokens over 6 days."

There is a simple test that works well: show the report to someone outside the AI team. If the accountant or product leader understands the rows without a half-hour call, the format is already working.

What to change next month

If you built this month’s report manually, do not try to rebuild everything at once. Start with the basic rules. Fix the field dictionary: one model name, one team name, one currency, and one way to calculate usage. As long as one file says "gpt-4.1-mini," another says "GPT 4.1 mini," and a third just says "mini," reconciliation will keep eating hours.

Then move new teams to one template. Usually this set of fields is enough: period, team, project, model, provider, input tokens, output tokens, cached tokens, and amount. That is enough for finance to see the money and for the CTO to see where the load is growing and why the bill is changing.

It helps to assign one owner to the reference table. Without that, names quickly drift, especially if one team tests new models every week and another exports data from its own system however it can.

A next-month plan usually looks simple: approve the dictionary of model and team names, give everyone one report template, agree on who checks usage before month-end, and reconcile the report total with the invoice in advance.

If manual reconciliation is still too heavy, it may be worth routing requests through one API gateway. When traffic goes through a single entry point, usage comes back in a more consistent format, and you do not need to gather numbers from different dashboards. For teams already working with an OpenAI-compatible stack, this is especially convenient: for example, in AI Router you can change only the base_url to api.airouter.kz, keep your current SDKs and prompts, and then receive one monthly B2B invoice in tenge. In reporting, that is useful not because of a nice diagram, but because there is less manual stitching and fewer disputed rows.

A good target for next month is simple: the report should be built in one pass. If by the end of the period you can explain any large amount in a couple of minutes, the system is already working.