Jun 16, 2025·8 min read

LLM Limits Between Teams: A Quota Scheme Without Downtime

LLM limits between teams: how to split quotas by product, environment, and time of day so production never stalls and tests and batches don’t eat the shared pool.

Why a shared pool breaks quickly

A shared pool seems convenient as long as traffic is steady and everyone behaves carefully. In real life, that almost never happens. One bad batch, a massive prompt rework, or a nightly test run can take nearly all the token budget in a few minutes.

After that, the familiar conflict begins: prod and tests fight for the same resource. For the gateway, there is no difference between a request from a customer-facing service and one from a test environment where someone is checking a new feature. If you don't separate limits in advance, test activity quickly pushes out production traffic.

At first it looks like a small annoyance. Latency rises somewhere, retries start somewhere else, and then 429s appear. Teams rarely connect these symptoms into one problem, because the failure doesn't look like one big outage. It spreads across different services and hides as separate errors.

The scenario is usually very simple. The search team starts batch processing 200,000 product descriptions. At the same time, the support chat is answering customers, and the engineering team is running load tests on staging. If everyone shares one limit, the noisiest stream wins, not the most important one.

After that, everything turns into manual control. The on-call engineer watches the graphs and starts cutting traffic on the fly: disables tests, lowers request frequency, asks teams to pause tasks. This works badly. Decisions are made in a hurry, without clear rules and without room for the next spike.

Even a single OpenAI-compatible gateway doesn't fix this by itself. AI Router, for example, gives you one compatible endpoint for accessing different models, but the fight for tokens doesn't disappear when the pool is shared. You need boundaries: who can spend how much, in which environment, and at what hours. Otherwise, it becomes first come, first served.

Which axes to split limits by

One company-wide limit almost always gets eaten by the noisiest consumer. That's why quotas are better split along several axes at once. Then the support chat won't lose to a nightly batch run, and prod won't go down because of experiments in dev.

First, list every product that calls the models and assign an owner to each one. This is not bureaucracy; it's a way to quickly understand who is responsible for spend, traffic spikes, and quota changes. If a product is missing from the list or has no owner, it usually starts consuming the limit quietly until something breaks.

Then split traffic by environment. Prod, stage, dev, and sandboxes should not live in the same corridor. Prod should have its own guaranteed limit and the highest priority. Stage can have lower but stable limits so releases go through without a manual fight for resources. Dev and sandboxes are better off with stricter limits, otherwise one bad test can burn through the whole team's budget quickly.

Set rules by time as well. During the day, load is more often driven by interactive scenarios where users expect an immediate answer. At night, batch jobs are easier to run: labeling, archive summarization, mass checks. During release windows, stage and prod get a temporary buffer, while background jobs should be slowed down.

It makes sense to split requests by traffic type too:

interactive traffic with answers in seconds
batch jobs without a hard SLA
internal experiments and tests
emergency reserve for critical services

In practice, the setup looks simple. Suppose you have a support chat, an internal assistant for employees, and nightly document processing. The chat gets a separate quota in prod during the day and a reserve during peak hours. The employee assistant shares its limit with office activity and can wait a bit. Nightly document processing gets a large limit after 22:00, but during the day it yields to interactive services.

A good order is this: product first, then environment, then time, then traffic type. In that form, quotas last longer and break less often.

Product quotas

It's better to count limits by product, not by people. One internal assistant can survive a queue and some slowdown, but a customer chat cannot. If everyone shares one pool, the noisiest service quickly burns through the budget, and the ones with SLAs suffer.

First, give each product a guaranteed minimum. This is not the desired volume, but the lower bound that covers a normal day. It's convenient to record it in two numbers: tokens per day and the allowed spike per minute. Then the service doesn't go hungry during working hours and doesn't fail from a short burst of requests.

Then set a ceiling. Even successful products need one. Without a ceiling, one launch, a bug in the client, or an unlucky batch process can use up the entire shared budget in an hour. For internal tools, the ceiling should be strict. For customer-facing services, you can leave a small extra buffer, but only within a separate reserve.

At the start, a ratio like this often works: 50-60% of the pool is reserved for products with external users and SLAs, 20-30% goes to internal services, 10-15% is kept in reserve for customer traffic and emergency spikes, and another 5-10% is set aside for pilots and experiments separate from production services.

Pilots should not be mixed with services where downtime immediately hurts sales, support, or operations. A pilot has a different rhythm: little traffic today, then tomorrow the team runs a big evaluation or prepares a demo. If such a project sits in the same pool as prod, the conflict starts very quickly.

These shares are not carved in stone. After a big release, a new channel, or a seasonal peak, the old setup often no longer works. Check actual consumption once a month and again after launches that change traffic. If a product keeps hitting the ceiling, that's a reason not only to raise the quota, but also to check prompts, cache, and unnecessary calls.

Quotas by environment

One of the most common failures starts not in prod, but next to it. If prod, stage, and dev pull tokens from the same pool, tests and experiments can easily eat the budget that production traffic needs.

Prod should live separately. Give it a main limit and a small buffer for spikes: the morning peak, a marketing email, a rise in support requests. The buffer doesn't need to be huge. It's enough to cover a short surge, not every possible record.

Stage usually doesn't need a large volume, but it does need predictability. The team should know that nightly tests, manual pre-release checks, and new prompt runs will definitely fit within the allocated limit. If stage competes with prod every time, release discipline breaks down quickly.

Dev and sandboxes are better kept under a strict ceiling. That's where people most often run long prompts, try new models, and forget to turn off loops. Soft limits hardly work here. A hard limit is more honest: you hit the cap, wait for the next window, or ask for a temporary increase.

On release days, stage almost always needs more than usual. Don't make that a permanent setting. It's better to raise the limit manually for a few hours and then put it back. That way you won't leave extra capacity open for the whole week.

A simple rule is this: prod should survive someone else's mistake, stage should handle releases calmly, and dev should be cheap and manageable. If one environment can drain the whole pool, the scheme still isn't working.

Quotas by time of day

Remove manual on-call work

Key-based limits and audit logs help you untangle imbalances without a nighttime scramble.

Launch gateway

The same limit at 11 a.m. and 2 a.m. almost always fails. The load is different, and the cost of delay for the business is different too. That's why quotas should be split not only by service, but also by hour.

During the day, protect everything that works in real time: support chats, knowledge-base search, guidance for agents, internal assistants for employees. If such a service waits in line for tokens because of a background task, users notice it immediately.

Usually a simple rule helps: during business hours, keep a separate capacity reserve for interactive scenarios, and set a lower burst for background jobs. Their daily volume can stay the same, but during peak hours they should not suddenly take over the whole channel.

As a starting point, a setup like this is often enough:

from 9:00 to 20:00, chats, search, and agent tools have priority
at night, batches, reindexing, mass document processing, and test runs go first
in intervals like 11:00-14:00 and 16:00-18:00, non-priority tasks get a tighter burst
on weekends, a window opens for heavy runs that would have disrupted prod during the day

This mode is especially useful where there are many short requests during the day and long ones at night. For a bank, that might mean customer chat during the day and batch processing of requests at night. For retail, search and an agent assistant run during the day, and on Saturday night the team launches a catalog reindex.

There's also a classic trap — the end of the month. During this period, reports, reconciliations, exports, and internal checks suddenly increase. If these exceptions are not defined in advance, teams start asking for manual limit increases, and the scheme breaks at the worst possible time.

It's better to decide right away who gets an extended window and for how many hours. Not for everyone, only for tasks with a clear deadline and owner. Then night and weekend quotas stay predictable.

How to build the scheme step by step

Start not with numbers, but with a list of everyone who calls the LLM API at all. Usually that quickly reveals not only production services, but also test environments, internal bots, batch jobs, and manual scripts from analysts. Every consumer should have an owner. Otherwise, when there is a shortage, nobody will be able to quickly decide what to cut first.

Then capture the load profile for at least a week. Look not only at the total volume, but also at peak hours, spike length, and the share of limit errors. One service may create a steady background all day, while another spikes every 15 minutes after a mass mailing or when a queue starts.

First, set the base protection

Once the demand picture is clear, assign each consumer a priority and a guaranteed minimum volume. This protects important services from starvation. If you have a customer chat and nightly batch processing, there isn't much to debate: the chat gets reserve capacity, the batch waits for a free window.

It's useful to keep three priority levels:

critical for revenue or the customer journey
an internal working service with a clear SLA
background or experimental load

For each level, set not just one overall ceiling, but several limits at once. One daily limit is almost useless: a service can burn through it in an hour and stop for the rest of the day. So set limits per minute, per hour, and per day. The per-minute limit cuts sharp spikes, the hourly limit smooths long peaks, and the daily limit protects the budget.

Check the scheme against old spikes

Don't roll out quotas blindly. Take the 2-3 heaviest days from history and run the scheme like a test. See where prod starts refusing requests, where tests consume too much, and how much reserve is actually left during peak hours.

After such a check, people usually adjust three things:

raise the minimum for services with a strict SLA
reduce the burst for test and dev environments
move background processing to nighttime hours

At the end, document the escalation rules. Who gets a temporary limit increase, who approves it, how long it lasts, and who rolls the setting back. Otherwise, every exception quickly becomes the new normal, and quotas turn into a shared uncontrolled pool again.

Example allocation for three services

See usage by key

AI Router audit logs quickly show who used the quota and when 429s started.

Open access

Let's take a daily budget of 10 million tokens and 200 requests per minute. It's easier to discuss rules with a real example.

Service	Environment	Time	Base quota	Reallocation rule
Support chat	prod	08:00-22:00	4 million tokens per day, 120 rpm	First draws from the shared reserve
Analytics batch	prod	22:00-08:00, weekends	3 million tokens, 60 rpm at night	Barely runs during the day; gets reserve after the chat
R&D sandbox	dev/stage	10:00-19:00	1 million tokens, 20 rpm	Cannot rise above the cap during the day
Shared reserve	shared	all day	2 million tokens	Priority: chat, then batch, then experiments

During the day, the rules should be strict. Support chat gets a guaranteed pool because any pause is immediately visible to the customer. Even if analytics or tests suddenly grow, they should not eat the chat's daytime budget.

The batch is better moved to night and weekends. For it, a delay of a few hours is usually not as painful as it is for a live conversation with a user. During the day, you can keep a small limit only for urgent tasks so the team doesn't have to wait until night for every fix.

The R&D team works in a sandbox with a daytime cap not because it matters less, but because its load is often bursty. One bad test with a long context can easily blow through the shared budget in an hour. A fixed cap protects prod, and after 22:00 this team can get the free capacity for experiments.

Imagine a normal weekday. At 17:30, support chat saw a spike in requests. The system first gives reserve to the chat, then cuts the nightly batch if it already started early. The sandbox gets nothing above its daytime limit at that moment.

After 22:00, the setup changes. If the chat has returned to its usual nighttime level and the batch hasn't used its full volume, the remainder can be opened up for experiments. But the shutdown order must also be strict: first stop the tests, then move the batch, and only at the very end touch the chat.

Where teams most often go wrong

Most often the problem starts not with a lack of total limit, but with poor division. When every service gets the same quota, the setup looks fair only on paper. In reality, support bots, internal search, and nightly analytics use the LLM in very different ways, and an equal approach quickly leaves prod without room to breathe.

The second common mistake is the same limit for prod and dev. That almost always hurts users. In development, people run tests, change prompts, restart scenarios, and can easily burn through the volume that prod needs for live traffic. Dev needs its own ceiling, lower and stricter, especially during business hours.

Problems also start when teams only look at the daily limit. If a service spends half its daily quota in the first hour after release, it will live on leftovers until evening. So it's worth keeping limits not only for a day, but also for short windows: 1 minute, 5 minutes, 1 hour. Then a spike won't eat the entire daily pool.

Another common miss is no reserve. A release, an incident in a neighboring system, or a rise in retries after a timeout quickly changes the picture. If the entire volume is already divided among teams, there is nothing left to save prod with. It's normal to keep a small unallocated buffer for incidents, rollbacks, and days with unusual load.

The scheme is also broken by the lack of control over request retries. A bad release can double traffic not because of users, but because of retries, duplicates, and overly aggressive timeouts. The team looks at the graph and thinks the product is growing, when in reality the system is just hitting the gateway again and again.

A quick self-check is simple: prod and dev have different limits, there are short windows in addition to the daily one, 10-20% of the volume is not preallocated, and retries, duplicates, and spikes by key are visible in the logs.

Short checklist for a review

Test quotas on one service

Start with a critical flow and give the sandbox a separate buffer.

Launch pilot

It's better to review quotas using one template. Otherwise, they quickly turn into a set of manual exceptions, where nobody remembers why one service gets more and another gets less.

Before each review, check five things:

each service has one owner
the priority is written down explicitly, not just agreed in chat
prod is separated from dev, stage, and tests
batch jobs live in their own window and don't compete with user traffic during the day
the reserve exists separately from working quotas and is not used by default

After that, look at 429s not as the problem itself, but as a signal. Sometimes they show that the quota is too small. But just as often, 429s mean the service picked the wrong window, retried too aggressively, or accidentally used someone else's reserve. Fix the cause, not just the limit.

What to do in the next sprint

Don't try to redesign the whole shared pool at once. One critical product and one sandbox are enough for a single sprint. That way you'll see how the scheme behaves under load, and you won't break neighboring services.

A good first candidate is the product that already affects revenue, SLA, or support. Give it a separate limit, and leave a small sandbox nearby for the team's experiments. The sandbox is not for convenience; it's so tests, new prompts, and random runs don't eat the budget of the production service.

Next, separate access not by a shared account, but by separate keys. One key for prod, another for stage, a third for dev or sandbox. If the key is shared, you almost always notice too late who exactly burned the limit and in which environment it happened.

For one sprint, this plan is usually enough:

choose the service with the strictest SLA
give it separate keys by environment
set limits for prod and a lower ceiling for the sandbox
turn on simple alerts as you approach the threshold

After setup, test the scheme not on a quiet day, but in two unpleasant modes. The first is a release day, when the team is running smoke and regression tests and users are already going to prod. The second is a load peak, for example in the morning on weekdays or during batch processing. If the priority service does not dip in those two scenarios, you have the right base.

It's useful to write down one simple rule in advance: which traffic gets cut first when the limit is getting close. Usually, dev is cut first, then stage, and only then less important production tasks like internal assistants.

If you have many teams, providers, and keys, AI Router is handy as a single control layer: rate limits at the key level and audit logs help you quickly see who used the quota and where the imbalance started. It doesn't replace quota policy, but it makes enforcement much easier.

By the end of the sprint, you don't need a perfect scheme. You need a working version where one important service is already protected from other people's experiments and where it's clear what to change in the next cycle.