Unified Token Accounting Across Providers Without Disputes
Unified token accounting brings input, output, cache, and service fields into one data model so invoices, logs, and reports match.

Where token disputes come from
Disputes rarely happen because of one big mistake. Usually they come from a dozen small differences in accounting. One provider reports prompt_tokens and completion_tokens, another splits spend into input, output, cache, and service fields right away, and a third only returns usage at the end of the stream or in a separate export. Finance and engineers look at the same requests but see different numbers.
Even the same prompt does not produce the same counter. Models have different tokenizers, system inserts, and rules for handling JSON, tool calling, and reasoning. Sometimes the provider adds its own service instructions, and the team never sees them in the application log. On paper it is one request, but the token count is already higher.
Most of the time the problem comes down to four things:
- providers name and group usage differently
- the same request passes through different tokenizers
- cache is tracked separately only by some providers
- the invoice includes things that are not in the application logs
Cache is especially confusing. For one provider, cached input tokens appear in a separate field and are cheaper. For another, they are part of the general input, and the discount is only visible on the bill. For a third, cache exists, but the API response barely shows it. So the engineer says, "we sent 2 million input tokens," while finance sees an amount as if part of them were billed at a different rate.
There is another reason too: teams reconcile different sources. Finance uses the monthly invoice. Engineers pull request logs, often only successful ones. But invoices may include retries, background security checks, service calls, minimum billing rounds, and delayed asynchronous jobs. That is not necessarily an error. It is simply a different accounting rule.
When a company works with several models through a single gateway, the gap becomes even more obvious. For the application it is one endpoint, but behind it live dozens of providers with their own usage formats and billing rules. If you do not agree on a shared scheme in advance, month-end close turns into a chat dispute.
What to count in one record
One record should describe one actually billable model call. Not the whole chat and not a user's session for the day, but one request to one specific model from one specific provider. Otherwise reconciliation breaks quickly: engineers look at app logs, finance looks at the provider invoice, and the numbers do not line up.
For unified token accounting, it is not enough to store only the grand total. You need separate fields for the parts of the request that really affect price.
Start with input tokens. That includes not only the user's text, but also the system prompt, role template, safety instructions, and any inserts the application added before sending the request.
Store output tokens separately too. That is the model's answer the user received, plus any extra parts of the response if the provider includes them in billing. When the team only sees the combined input and output total, by month-end it is no longer clear what grew: longer prompts or overly verbose answers.
Cache needs its own column. If a repeated part of the prompt landed in the prompt cache, do not mix those tokens with regular input. In money terms this is often a different rate, and for analysis it is a different kind of spend. One request may have three different numbers: full input, cache hit, and the actually billed uncached input.
Do not hide service tokens inside the total either. These usually include tokens for tool calls, JSON schemas, service messages between agent steps, and internal passes if the provider returns them separately. In agent scenarios this is especially visible: from the outside the user asked one question, but inside the system made several more service exchanges.
Usually this set of fields is enough:
- user input tokens
- system input tokens
- model output tokens
- cache tokens and service tokens
- rate, billing unit, and currency
It is better to store the rate and currency next to the counters rather than in a separate table that half the system later references. Rates change, providers bill differently, and retroactive recalculation almost always ends in a dispute.
If you work through a single OpenAI-compatible layer like AI Router, it is easy to normalize these responses into one record at the entry point. Then later you do not have to figure out what each counter meant for each provider.
How to normalize provider responses into one format
Different LLM providers send the same numbers under different names. One uses input_tokens, another prompt_tokens, and a third returns only the total. If the team does not normalize everything into one shape, billing quickly turns into an argument about terminology.
You need one internal record that every provider response gets mapped into. Do not try to store things "as everyone else does." Store them "the way you do," and map other fields into your own format.
Minimal schema
Usually these fields are enough:
request_idandattempt_idprovider_id,model_id,tariff_versioninput_tokens,output_tokens,cache_read_tokens,cache_write_tokensservice_tokensor separate service-token fields- flags
is_stream,is_retry,is_fallback raw_responseandnormalized_at
This schema solves half the problems. Finance sees the same counters across all providers. Engineers can see which response produced those numbers.
It is best to store the raw response next to the normalized record. When a provider changes the usage format or suddenly adds a new counter, the team can quickly recheck older calculations. This is especially useful in months with retries and fallbacks: the final amount is there, but without the raw response it is hard to tell which call produced it.
Do not mix zero and unknown values. If the provider explicitly returned 0, store 0. If it did not send the field at all, use null and note the reason. Otherwise the report will make it look like cache or service tokens were definitely not used, when the provider simply did not disclose them.
Also record provider_id, model_id, and tariff version separately. The model may keep the same name, but the price may change in the middle of the month. Without tariff_version, it is hard to prove later why two identical requests cost differently.
Streaming, retries, and fallback need strict discipline. For streams, calculate the total only after the response completes. For retries, store each attempt separately and tie them together with a shared request_id. For fallback, record the original route: which model did not answer, which one received the request, and whose tokens were billed.
If several providers sit behind one endpoint, this setup is much easier to maintain when normalization is centralized instead of implemented differently in every service.
How to implement accounting step by step
Start with one shared usage record, not with a dashboard. As long as each team has its own fields and its own naming, there will be no unified accounting. First agree on a minimal set: request_id, date, provider, model, input_tokens, output_tokens, cache_read_tokens, cache_write_tokens, service_tokens, currency, unit_price, total_cost.
This schema should live in one place and change by version. If you rename a field or decide to count cache differently in the middle of the month, finance will get one number and engineers another. The rule is simple: new rules take effect only with a new reporting period.
What to do in the first week
- For each provider, document the field mapping in a table, not in the team’s memory. For example, send
prompt_tokenstoinput_tokens,completion_tokenstooutput_tokens, and keep cache and service tokens separate. - Calculate cost at the request level. Do not wait until the end of the day and do not multiply an average price by the total volume. For the same provider, price may differ by model, token type, and cache mode.
- Save not only the final amount, but also the calculation formula. Then any dispute is resolved quickly: you can see which tokens were billed and at what rate.
After that, introduce a daily reconciliation. The total from the logs for the day should match your aggregated table and what later appears in the invoice. A small difference from rounding is normal, but it is better to set a threshold in advance, for example within 0.5-1%.
If the team sends requests through a single gateway, this stage is easier: you already have one request stream and centralized logs. But that still does not remove the need to reconcile with the invoice.
A small example quickly shows whether the scheme works. A bank sends some requests to OpenAI, some to Anthropic, and an internal chat uses a hosted open-weight model. For the day’s report, all of these calls must become records of the same type. Then finance sees the amount in tenge, and the engineer understands which tokens it came from.
At the end, assign roles. One person updates the mapping, another checks the daily total, and a third approves rule changes for the new month. When responsibility is clear from the start, LLM billing stops being a constant source of arguments.
What money rules to agree on immediately
Money usually drifts away from engineering numbers not because of complicated formulas, but because of different accounting rules. Those rules need to be fixed at the very beginning.
First rule: use the price of the model that actually answered the request. An alias is convenient for code, but it is a poor reference for money. A label like "smart-default" tells finance nothing if the gateway sent traffic to a specific model from a specific provider. The event should include provider, model, tariff version, and request time.
Second rule: do not throw all tokens into one bucket. Input, output, cached, and service tokens often cost differently. Sometimes service tokens do not need to be shown to the client as a separate line, but inside the system it is still useful to keep them separate. If the provider sent fields for cached input or reasoning tokens, keep them as they are.
Agree on credits and free quotas before the first invoice. You can subtract them from the monthly total. You can show them as a separate discount line. You can charge them to the internal budget of the team that received the quota. Any of these approaches works. The only bad option is changing the interpretation every month.
Rounding needs ordinary discipline. Keep the raw token volume and cost with at least 6 decimal places, and round to 2 decimals only in the final report total. If one dashboard rounds each line and another rounds only the total, a dispute will appear even with honest data.
Also separate test and production traffic. Not by project name in the table, but by environment tag, key, budget, and export rules for reports. Then load tests, sales demos, and nightly experiments will not end up in the product’s financial report.
The minimum rules look like this:
- one accounting record per request or batch
- the actual model and provider in every record
- separate fields for input, output, cache, and service tokens
- one rule for credits, quotas, and rounding
- a clear environment tag: test or prod
If traffic goes through a single gateway, some fields can be collected centrally. But money rules are still defined by your team. Set them once, and month-end close becomes a normal operation rather than an argument about whose numbers are more correct.
Where teams most often make mistakes
Disputes almost always start not because of large sums, but because of small differences in accounting rules. One report counts everything returned by the provider, another tries to "clean up" the numbers, and by the end of the month no one understands where the mistake is.
If the team routes requests through several providers, the difference is visible right away. Especially when one provider reports cache read separately, another counts service tokens inside input, and a third returns usage only after a retry.
Typical mistakes look like this:
- cached tokens are added to regular input, and the cache-hit discount disappears from the summary table
- repeated requests after a timeout are not marked separately, and spend appears twice as high
- the price is taken from the documentation instead of the actual invoice
- usage is reconciled in UTC while the money report is built by local date
- service tokens are mixed with user text
In practice it looks simple. The team sees 2 million input tokens for the month and calculates budget at the full rate. Then it turns out that 600,000 of them came from cache and should have been cheaper. Another 120,000 came from retries after timeouts. And part of the spend belongs to service overhead that could be reduced without affecting the answer.
If you combine several providers through one layer, these mistakes are easier to catch at a single accounting level. One useful rule here is simple: keep both the provider’s raw fields and your normalized schema side by side. Do not replace one with the other.
A simple test is more useful than any argument: take one day, convert all timestamps to one time zone, separate cache tokens, retries, and service tokens, then recalculate the amount using the actual rates from the invoice. Usually after that the mismatch either disappears or becomes clear within a few minutes.
A month without manual disputes
A support bot answers customers all month. It sends routine questions to one provider and complex long-context conversations to another. The same system instructions repeat thousands of times, so part of the long prompts goes into cache.
By month-end, engineers look at usage for each request. They see input and output tokens, cache-hit frequency, and average response length. Finance looks at something else: how much each token type costs at provider rates, where cache is cheaper than regular input, and whether service tokens are a separate line or hidden inside input.
If the team works through a single gateway, this data is easier to reconcile. Responses from different providers already pass through one format, and all you need is one reporting table.
One summary at month-end
| Provider | Input | Cache | Output | Service | Total |
|---|---|---|---|---|---|
| A | 1 200 000 | 420 000 | 310 000 | 0 | 184 200 тг |
| B | 640 000 | 150 000 | 290 000 | 86 000 | 233 900 тг |
| Total | 1 840 000 | 570 000 | 600 000 | 86 000 | 418 100 тг |
In this picture, the engineer sees 2,440,000 tokens of requests if you add input, cache as part of input, and output. Finance sees 418,100 tenge and asks why the amount is higher than expected. The answer is in one line: provider B has 86,000 service tokens, and the 570,000 cache tokens cannot be billed at the normal input rate.
After normalization, the dispute disappears. The team follows a simple rule: input, output, cache, and service tokens live in separate columns, and the final amount is calculated only after applying rates to each column. Then the engineer quickly finds requests with rising usage, and finance can see what made up the bill without manually comparing CSV files and screenshots.
This summary is useful not only at month-end. If the total suddenly jumps in week two, the table immediately shows the source of the difference: more complex conversations moved to the second provider, cache dropped, or service overhead got longer after a new prompt.
Quick checks before closing the month
You do not need a large audit before month-end. A few short checks are enough to catch almost every cause of disputes. If unified accounting already exists, these checks take 15-20 minutes, not half a day of back-and-forth between finance and engineering.
First check the arithmetic. The sum of daily requests should match the monthly total. If the daily numbers add up to 3,842,190 tokens and the summary report says 3,861,000, the problem is almost always filters, duplicates, or logs that arrived late. Do not look for an explanation in the rates until the base numbers match.
Then look for records without provider_id and model_id. Even a small share of such rows distorts the picture. Finance sees spend, but the engineer cannot tie it to a tariff or source. It is better to move those records into a separate list right away and keep them out of the final report until they are manually labeled.
Check retries separately. A repeated request should not disappear inside the overall counter. Otherwise one team thinks the model got more expensive, while another knows it was just more retries because of timeouts.
Rounding often creates a quiet but annoying error. One report rounds every record, another rounds only the daily or monthly total. At scale the difference becomes noticeable. The rule should be one and written clearly: where you round, to how many decimals, and at which stage.
Another trap is cache. If cached tokens are mixed with regular input, the team only sees total input and loses the real cost picture. Cache should be shown separately: regular input, cached input, output, and service tokens. Then it is clear where you are really paying more and where the system is saving money.
A good final test sounds boring, but it works: take one day from the middle of the month and trace the path from the raw log to the report total. If that path is readable without guesswork, the month usually closes smoothly. If even one day already has gaps and different counting logic, a dispute at month-end is almost guaranteed.
What to do next
Do not try to cover every team, model, and bill at once. For a pilot, one service and one closed month are enough. That makes it easier to spot mismatches, check formulas, and understand where accounting breaks before a large rollout.
Start with raw data. Put usage from all providers into one storage without on-the-fly "improvements." First save the response as it is, and only then build the normalized record. It is a boring step, but it is exactly what saves you when finance asks why the report shows 12,430,000 tokens and the invoice shows a different number.
A minimal plan looks like this:
- choose one product with stable traffic and a clear owner
- collect all usage responses, including input, output, cache, and service tokens
- fix the recalculation rules in one scheme and do not change them within the month
- reconcile the result not only by tokens, but also by the money in the invoice
Also agree with finance on which fields count as the basis for billing. Usually period, provider, model, project or cost center, token type, unit price, currency, and total amount are enough. If this list does not exist, the dispute will start again at the first 2-3% deviation.
I would not start with a pretty dashboard. First you need a simple register where every row is explainable. If an analyst or engineer cannot manually trace the path from the provider’s raw response to the amount in the statement, the scheme is still too rough.
When there are many providers, manual stitching gets tiring fast. In that case it is convenient to unify routing and accounting through AI Router at airouter.kz: a single OpenAI-compatible API gateway where you can keep your familiar SDKs, code, and prompts, and then centrally normalize usage across providers. For teams in Kazakhstan and Central Asia, monthly B2B invoicing in tenge is also useful.
A good pilot outcome is simple: after one month you have a single table, clear rules, and a list of discrepancies with a reason for each case. After that, you can expand the scheme to other services without manual disputes over every row.
Frequently asked questions
Why do tokens in logs and in the invoice often not match?
Usually you are comparing different sources. Engineers look at request logs, while finance looks at the invoice, which already includes retries, cache, service calls, rounding, and sometimes late asynchronous tasks.
What counts as one usage record?
Count one record as one actually billable model call. Do not merge the whole chat, a user session for the day, or several attempts of one request into a single row.
What fields do you need in a unified accounting schema?
At a minimum, store request_id, attempt_id, provider, model, input tokens, output tokens, cache tokens, service tokens, price, currency, and total cost. That is enough to check both volume and money without guesswork.
How should cached tokens be accounted for?
Do not mix cache with regular input. Keep the full input, cache_read_tokens or a similar field, and the amount billed at the normal rate separate, otherwise the cache discount will disappear in the total.
What should you do if the provider did not send part of usage?
Set null if the provider did not send the field, and 0 if it explicitly returned zero. That difference saves you from falsely assuming there were definitely no cache or service tokens.
How do you count retries and fallback without confusion?
Store each attempt separately under a shared request_id. For billing, use the model and provider that actually handled that attempt, not the alias from the code.
When should cost be rounded?
Round only the final report total, and keep tokens and cost in the database with enough decimal precision. If one report rounds every row and another rounds only the total, a dispute will appear even with honest data.
How do you avoid mixing test and production traffic?
Separate environments by tag, key, budget, and export rules, not by project name. That way demos, load tests, and night experiments do not end up in the production financial report.
Should you keep the provider's raw response?
Yes, keep it next to the normalized record. When a provider changes the usage format or the price changes in the middle of the month, you can quickly verify the old calculation and see where each number came from.
Where should you start if you want unified accounting without a big project?
Start with one service and one closed month. Collect the raw responses, introduce one schema for all providers, calculate cost at the request level, and reconcile the log total with the aggregated table every day.