Nov 08, 2024·8 min read

Token Spike: How to Find the Cause Before the Bill After Release

A token spike after a release is easy to miss. Learn how to check prompt length, call frequency, retries, and strange post-release behavior before the bill arrives.

When token growth becomes a real problem

Growth by itself does not mean something is broken. If a release brought in more users and token usage went up by roughly the same 10-20%, that looks like normal load. The problem starts when spend grows faster than the useful action. The number of requests stays the same, but tokens per scenario are suddenly 1.5x to 2x higher.

This kind of spike is often noticed too late. Billing lives separately from production: providers and gateways first collect usage, then roll it up by hour or by day, and finance sees the total even later. By the time the team checks the money, the bug may have been running for a full day.

The first signals usually show up earlier. One of three things normally changes: the prompt gets longer, the LLM call frequency goes up, or the release introduces strange behavior. The bot suddenly makes two requests instead of one, a background job runs through old conversations, retries jump because of a new error.

A quick test is simple: look at tokens per useful action, not total tokens. For chat, that means tokens per conversation or per message. For classification, it means tokens per document. If the business metric barely changed but spend went up, that is no longer growth — it is a defect or a bad setting.

You do not have to go straight into code. First check the basic numbers: how many requests per minute there are now, how many tokens each call uses, which models are being called more often, whether there is a spike in errors and repeated requests, whether cache disappeared, and whether the system prompt got bigger.

If you have a single gateway with logs and usage stats by key, these shifts are usually visible within the first few hours. If you do not have that layer, use application logs and usage data from the provider.

Normal growth rarely looks like a sharp step. A bug does: at 11:40 the release went out, at 11:45 tokens per request were up 70%, and by evening the counter was already far off target. In that situation, you cannot wait for billing. You need to catch the deviation on the day of release.

How to find what actually changed

When you see a spike after a release, do not look at total spend first. Split the problem into two parts: did the number of requests go up, or did each request become heavier? For billing, that is almost the same thing. For finding the cause, it is a completely different story.

The useful formula is simple: total tokens = number of requests x average tokens per request. Sometimes both numbers rise a little and together create a very ugly result. If calls went up by 30% and the average request length went up by 25%, total spend already looks like an incident.

Compare four metrics by day or by hour:

input tokens per request
output tokens per response
number of requests per user, screen, or task
number of requests per product action

If input tokens increased, check the prompt length first. After a release, the system prompt often grows without anyone noticing: new rules were added, a large JSON schema was included, tool-calling instructions got longer, more chat history was added, or extra knowledge-base text was appended. A particularly bad case is when the app starts sending the full conversation instead of just the last 3-5 messages.

If input barely changed, look at response length. The model may have started answering at greater length because of a new template, a different model, a removed max_tokens limit, or broader instructions. A task that used to take a couple of lines can suddenly turn into 400-600 tokens of explanation.

Also check call frequency separately. After a release, one screen sometimes makes not one request but three or four: the main answer, classification, summarization, title generation, moderation. On top of that, the frontend may repeat the call when the response is slow, and background jobs may run twice. From the outside, it looks like a general spike, even though prompt length had nothing to do with it.

A good data split is by key, model, and route. That makes it faster to see where the logic broke: one endpoint grew its context size, while another simply doubled its request rate.

For cost control, it helps to keep a simple table by scenario: requests per action, input tokens per request, and output tokens per response. Usually one column quickly shows what went wrong.

What usually breaks right after a release

After a release, tokens almost never grow without a reason. Usually a small setting breaks, and tests did not catch it.

A common example is a new flag that was supposed to enable detailed answers only for part of the traffic, but ended up turning on for everyone. The issue looks harmless until you look at the numbers: the model used to answer briefly, and after deployment it started writing 4-5 times longer. The team is happy that the answers are more complete, while spend is already climbing.

Retries after a timeout create just as many problems. The client did not get a response in time, assumed the request was lost, and sent it again. But the first call had already gone to the model and was also billed. The user sees one answer, while the system pays for two or three.

After a release, this often happens because of a new retry policy in the SDK, mobile app, or backend. If nobody compared timeout counts with repeated-call counts, a token spike is easy to confuse with normal traffic growth.

Another common source of spend is junk in the prompt. Suddenly it starts including logs, raw JSON, the full chat history, or an overly long search context. One such tail can add hundreds or thousands of input tokens, even though there is very little useful information in it.

You should also check the fallback model. If the primary model is unavailable, the service may switch to a backup that answers much more verbosely or does not enforce a strict response-length limit. The switch itself is useful, but the rules need to be set very precisely.

A quick check usually looks like this:

compare the average input and output length before and after release
find repeated requests with the same identifier or payload
open a few real prompts and check whether logs, JSON, or extra history were included
confirm that the model did not change and the response-length limit is still in place

If only one chart increased, the cause is often simple. If prompt length, call frequency, and fallback-model usage all rose at the same time, the problem is no longer in one place — it is in the whole release chain.

How to check the cause step by step

Start with traffic shape, not with the bill. Higher spend almost always comes down to three reasons: the request got longer, the number of calls went up, or the code started doing extra work after the release.

The fastest path is to compare average input and output tokens for the same scenario before and after the release. If input grew, look for a new system prompt, extra context, chat history, or duplicated documents. If output grew, check response limits, output format, and whether the app is asking for a longer explanation.

Next, break the traffic into clear pieces:

Compare the 24 hours before release and the 24 hours after it by average tokens per request.
Split the data by endpoint, model, and API key.
Tie the spike in time to the deploy, the feature flag, or a route change to another model.
Take 20-50 live requests and inspect what actually went into the prompt.
Check retries, timeouts, and parallel calls for the same action.

At step three, you often find a cause nobody expected. For example, one flag turned RAG on for only part of the users, and instead of 3 short fragments, the prompt started receiving 12 long ones. On the chart, that looks like a general spike, even though the issue lives in just one flow.

Looking at real requests is almost always more useful than a clean aggregate metric. Take a few normal requests and a few from the spike zone. Compare the system instruction, history, embedded context, response format, and number of tool calls. Sometimes one extra log block adds more tokens than the entire user message.

Then check the call mechanics themselves. After a release, the app may retry on timeout, launch two parallel calls instead of one, or send the same request again through a queue. From the outside, it all looks like ordinary load, but LLM call frequency has already gone up by 1.5x to 2x.

If traffic goes through a single gateway, the analysis is usually easier: you can quickly inspect logs at the key, model, and request-time level. That makes it easier to separate a long prompt from extra retries and understand where the request economics broke.

Which metrics to keep in view

Connect open-weight models

Use 20+ models on AI Router GPU when low latency and local storage matter.

Try it

One overall spend chart is almost useless. It shows that money went out, but not why. During a spike, you need several metrics at once, all in the same cut: by service, API key, model, and scenario.

The first layer is tokens per request. Look at input and output separately. Growth in input tokens per request usually points to a longer system prompt, extra context, duplicate conversation history, or new logic that sends more data than before. Growth in output tokens is more often tied to the model answering in more detail, a lost response-length limit, or extra regeneration.

The second layer is call frequency. Even if prompt length did not change, requests per minute can easily double the bill. That is why it helps to keep two charts side by side: requests per minute by service and requests per minute by key. Then it is easier to tell whether the problem is one release, one client, or one specific integration.

A practical metric set usually looks like this:

input tokens per request by model and scenario
output tokens per request with p50 and p95
requests per minute by service, API key, and key routes
error and retry share, including retries after 429, 500, and timeouts
cost by model and scenario, not just total daily spend

The error metric is not for reporting — it is for money. If 429s or 5xxs increased after the release, clients and background jobs often start repeating calls. The same is true for timeouts: one user request can turn into two or three LLM operations instead of one.

It is better to track cost not only by model, but by scenario as well. If a knowledge-base search suddenly became more expensive, you will immediately see that the problem is in that path, not in the whole service.

Where teams usually go wrong

A token spike rarely appears out of nowhere. More often, the team hides the cause by looking only at the total bill for the day. That number shows that spend increased, but not what caused it: the prompt got longer, calls became more frequent, or one bad scenario started burning tokens in bursts.

Averages often get in the way too. If nine requests are short and the tenth suddenly uses 20,000 tokens, the average still looks acceptable. In reality, those spikes are what create the painful bill after a release. So do not look only at average — check the upper tail too: p95, p99, and the maximum tokens per request.

Another common mistake is mixing test traffic with production traffic. After a release, QA runs scenarios, developers test new prompts, and automated tests may hit the same API key as prod. Then the team looks for the problem in users, even though internal checks burned the tokens. Separate keys, environment tags, and separate dashboards help here.

A long debug prompt in production is more common than people think. It gets added for a couple of hours to capture more context: raw logs, full message history, internal instructions, answer examples. Then the fix ships, but the extra text stays behind. If the system prompt grew from 900 to 3500 tokens, the bill will jump very fast even without traffic growth.

Teams also forget max_tokens limits and retries all the time. One bad release can easily start an expensive loop: the model does not answer as expected, the app retries 3-4 times, each time sending the same long context and asking for an even longer response. In logs, that looks like normal instability. On the bill, it is a separate cost line.

It helps to split data not by one number, but by several cuts: prompt and completion separately, prod and test separately, successful responses and retries separately, new and old release versions separately, spikes by user, feature, or cron job separately.

If your logs include the release version, environment, and scenario type, the cause can often be found in 10 minutes. Without that labeling, the team usually argues from intuition and loses half a day.

The most painful mistake is simple: checking spend only the next morning. By then, the bug has already been heating up all traffic. It is better to alert on a sharp rise in tokens per request, on the number of retries, and on unusually long completions right after deployment.

A simple example from a work week

Keep costs predictable

Pay provider rates with no API markup and track spend by model.

Connect

On Tuesday, the team shipped a small release for an internal support assistant. Nothing risky was planned: slightly more detailed answers, a more convenient input field, a few error-handling tweaks. An hour after launch, spend jumped so sharply that it could no longer be explained by a normal daily peak.

At first, they thought it was only because responses were longer. And yes, the average answer did get much larger: instead of short 120-150-token replies, the assistant often reached 400-500. The reason was simple: the prompt now included the instruction “explain in more detail and suggest the next step.” That is annoying on its own, but not catastrophic.

Then they looked at call frequency and found the more expensive bug. After a frontend change, the request was sent not on button press, but almost on every typed character. A person typed a 40-character phrase, and the system managed to send dozens of calls. The user saw one conversation, while the backend handled a whole series of extra requests.

The growth did not stop there. That same release added a new retry layer. If the model answered slower than usual, the app repeated the request, and then an intermediate service did the same. One failure turned into two or three identical calls.

Within one hour, several metrics grew at once:

tokens per request
requests per user
share of repeated calls with the same text

That combination is what creates the worst bill. Not one big prompt, but several small mistakes that amplify each other.

Fixing it took less time than finding the cause. The team returned sending to explicit user action only, removed the extra retry layer, and set a hard response-length limit. After that, traffic returned to normal almost immediately.

What to check before raising an alert

Keep your code as is

Change only the request route and keep using your familiar SDKs.

Start

You should raise a signal not after the first spike, but after a quick manual check. Very often the reason is simple: the release added extra text to the request, raised the response limit, or started a new background flow that quietly burns tokens.

Start with a few simple questions. Did the current system prompt get longer than the previous version? Is the model receiving the full chat history instead of the last 6-8 messages? Did max_tokens change? Did a new cron, batch, or background job get turned on? Are there retries after 429, 500, or timeouts?

It helps to watch not only the total tokens, but also the combination of three numbers: tokens per request, requests per minute, and retry share. If only the first number grew, look at prompt and response length. If the second grew, check background processes and new scenarios. If the spike comes with errors, retries are almost always to blame.

In practice, this check takes less time than reviewing the bill after the fact. And it almost always finds the cause before finance sees it.

What to do next

After a release, do not wait for the bill at the end of the month. Put in simple signals that catch the problem on the same day: tokens per request and requests per minute. If either metric goes up, the team can quickly understand where to look — in the prompt, in traffic growth, or in strange behavior from the new version.

It is useful to set not only an alert for total spend, but also an alert for deviation from the normal baseline. If the average response suddenly becomes twice as long, check the prompt template, system instructions, and response format. If response length did not change but the number of requests increased, look at background jobs, repeated frontend sends, and extra calls after a timeout.

Usually four rules are enough:

track tokens per request and requests per minute separately
set a maximum response length for each scenario
fix the allowed number of retries
review anomalies after every release with a short checklist

Limits are better set in advance, not after an incident. In many teams, the spike does not come from one big mistake, but from two small ones at the same time: the response grew by 40%, and the client started making an extra retry. Together, that creates a noticeable cost jump within just a few hours.

The release checklist should include a short block: compare average input and output length, call frequency for the main scenarios, retry count, and error share by code. It takes very little time, but it lets you catch anomalies before accounting sees them.

If traffic goes through AI Router, the analysis is usually easier. Audit logs, model spend, and key limits are all visible in one place, so it is easier to see which service caused the spike and where it started. For teams in Kazakhstan, this also removes part of the operational confusion: data stays inside the country, and a suspicious spike can be checked without manually gathering information from multiple systems.

If you make these checks part of every release, a token spike rarely has time to turn into a painful bill.

Frequently asked questions

How can I tell quickly that this is not just normal traffic growth?

Look at tokens per useful action, not just the total volume. If users and requests stayed about the same, but tokens per dialog, document, or screen increased by 1.5x to 2x, that points more to a bug than to normal growth.

What should I check first after a release?

Start with three things: tokens per request, requests per minute, and the share of retries after errors or timeouts. Then compare those numbers with the period before release for the same scenarios and the same models.

How do I tell a long prompt from extra calls?

Split spend into two parts: how many calls increased, and how many tokens each call uses. If input grew, look for a longer system prompt, extra history, or a large context block; if frequency grew, look for retries, background jobs, and extra frontend requests.

What mistakes most often inflate spend?

The usual culprits are detailed responses for all traffic instead of only some users, retries after timeouts, and prompt clutter like logs or raw JSON. Also check whether traffic was routed to a fallback model with longer answers.

Is it enough to look at the overall daily bill?

No, a daily total only shows that spend went up. It is much better to investigate by hour, by release, by API key, by model, and by specific scenario, otherwise you will waste time guessing.

Which metrics should I keep on the dashboard?

Keep input tokens per request, output tokens per response, requests per minute, and the retry rate side by side. It also helps to track p95 response length, because the long tail is often what drives the bill up.

How do I check whether retries caused the higher spend?

Compare the number of timeouts and 429 or 5xx errors with the number of repeated calls in the same period. If one user request often turns into two or three identical calls, the problem is almost certainly retry logic, not audience growth.

What should I do if the model suddenly starts answering too long?

Check max_tokens, the prompt template, and the model the service is calling after the release. If the task does not need long text, restore a hard limit and remove instructions that make the model explain too much.

Which alerts are needed first?

Usually an alert on a sharp rise in tokens per request, a jump in requests per minute, and a spike in retries after errors is enough. These signals should be reviewed on the day of release, not the next morning.

Why is a single gateway like AI Router useful here?

A single gateway gives you logs and stats in one place, so the team can quickly see where tokens increased: in the prompt, in call frequency, or in one problematic integration. If traffic goes through AI Router, it is easier to connect the spike to the model, key, and release time without manually collecting data from different systems.