Skip to content
Jul 23, 2025·8 min read

Controlled Failures in LLM Infrastructure Before Peak

Controlled failures in LLM infrastructure help uncover weak spots before peak demand. We’ll walk through gateway, provider, queue, and retriever checks.

Controlled Failures in LLM Infrastructure Before Peak

What breaks before the traffic peak

A quiet day almost always fools you. Traffic is steady, requests look similar, and rare errors get lost in the noise. The system looks stable, even though some problems simply do not have time to show up.

At peak, everything changes at once. Users arrive in waves, prompts get longer, parallel calls increase, and response times start jumping even in places that were quiet before. A 300 ms delay, which seemed minor in testing, can easily turn into a queue that lasts minutes under real load.

The first things to fail are usually not the most visible parts. Most often the problem starts in one of four places:

  • the gateway holds connections longer and accumulates timeouts
  • the provider starts enforcing limits or becomes unstable
  • the queue grows because of repeated retries
  • the retriever gets slower at finding context and returns worse results

One failure rarely stays alone. If the provider slows down, the gateway waits longer. If the gateway waits longer, clients and workers start retrying. Retries clog the queue with requests that no longer matter. Then the retriever gets more simultaneous calls, latency grows, and the model receives context later or not at all.

Usually it starts boring and ends expensive. Before a sale, the team notices a small rise in errors from one LLM provider. It enables a backup route, but does not check how that route behaves on long requests. The new route responds more slowly, the queue swells, and clients resend requests. Fifteen minutes later, the problem is no longer with the provider, but with the whole chain.

That is why controlled failures are not about a pretty report. They show where the system breaks first, what drags everything else down, and where you actually have no margin. If the team works through a single gateway, it may survive the loss of one provider. But it will not save you if queues, timeouts, and limits were set blindly.

Where to start

Start not with a huge stress test, but with 3–5 scenarios where the failure would be noticed immediately by a user or support team. The point is not to test an abstract stack, but the live request path: from the client to the model response and back.

Usually this is enough: chat in the personal account, dialog summarization for an operator, knowledge base search, text moderation, and response generation in a CRM or help desk. Do not try to test everything at once. If the team attempts ten flows, the test will spread out and the discussion afterward will get lost in details. Five scenarios are enough to expose weak spots before peak.

For each scenario, record two numbers in advance: how long the user is willing to wait and what percentage of errors you can tolerate without visible damage. The wording should be direct. Internal document search may be fine with a 6–8 second delay, while a chat suggestion for an operator is not. If an answer takes more than 3 seconds, the operator simply moves on without it.

Next, assign an owner to each part of the chain. The gateway, routing, queue, retriever, and provider should each have one specific person watching metrics and making decisions during the test. When responsibility is shared, nobody knows who cuts traffic, who switches to another provider, and who stops retries when a failure happens.

A good test looks simple. It has a launch window, a time limit, and a clear rollback plan. The team decides in advance what to do if the queue grows, the provider returns 5xx errors, the retriever slows down, or the gateway hits a rate limit. Rollback should also be simple: restore the old route, reduce load, disable the heavy scenario with a flag, or move some requests to a faster model.

If you use a single OpenAI-compatible gateway, check this before the test: which routes are enabled, where key-level limits are set, who can see audit logs, and who can quickly switch providers without changing code. In a setup like AI Router, this is especially convenient: the team changes only the base_url to api.airouter.kz and keeps using the same SDK, code, and prompts. Before peak, that saves not hours, but sometimes the team’s entire evening.

How to test the gateway and routing

You should not test the failure itself, but the behavior of the whole chain. The gateway should switch traffic quickly and predictably. If the team has to edit the config manually after disabling one route, the test has already failed.

First, disable one working route completely. This could be one provider or one model in the routing rules. Send the same request set and see where the traffic went, how latency changed, and whether the response shape changed for the app. If you have an OpenAI-compatible gateway, do not touch the client code during this check. Otherwise you are testing the manual workaround, not the routing.

A full outage happens less often than a slow failure. So also simulate growing timeouts in part of the models. For example, increase latency from 2–3 seconds to 12–15 seconds for 20% of requests. Then check whether the gateway waits too long, triggers extra retries, and clogs the queue with requests that no longer make sense.

During the run, do not look at just one chart. You need several metrics at once: how many requests moved to the fallback route, how p95 and p99 changed, whether retries and cancellations increased, whether request cost jumped after the switch, and whether response quality dropped in typical scenarios.

Also check key-level limits separately. One noisy client or internal service should not consume all available throughput. Give one key a load above the limit and make sure the gateway cuts only that flow, not all traffic. If you have several teams or products on the same gateway, this shows up very quickly.

Logs and tracing should answer one simple question: what happened to a specific request in 30 seconds of searching? A request should have a trace ID, the selected route, provider, model, retry count, reason for switching, and reason for failure. If the company must mask PII or apply content labels, those events should also be visible in the chain. Compare records in the app, gateway, and queue. If the trace ID does not pass through the whole system, incident analysis turns into guesswork.

A good result looks ordinary. One route fails, some models slow down, one key hits a limit, and the service still stays alive while the logs show a clear picture.

What to check with the provider

Before peak, a provider rarely goes down all at once. More often latency increases, 429s become more frequent, and long responses get cut off or take much longer than usual. That is why a set of short checks for the provider’s weak spots is more useful than one broad stress test.

The first test is fallback to a second provider. It should trigger not only on a full outage, but also on partial problems: a series of 429s, a spike in 5xx errors, or a sharp increase in response time. If the main provider slows down for 40 seconds and the backup responds in 8, the system should switch on its own, without manual intervention and without breaking the response format.

A minimal set of scenarios is simple: the main provider returns 429 on every third request, a long generation runs 3–4 times slower than normal, streaming cuts off in the middle, the backup provider returns the same meaning with a different field schema, and the main quota runs out in the middle of the test.

What surprises teams most often are not the errors themselves, but the limits nobody remembered to check. Compare input and output token pricing, per-minute limits, concurrent request limits, and daily quotas. A common story: the test passes at low volume, but real traffic hits the account’s rate limit rather than the model.

Also check SDK compatibility when changing the base_url. If you use an OpenAI-compatible gateway, the code should not need changes just because the API address is new. But that needs to be confirmed with a test, not assumed. Try a regular completion, streaming, tool calling, and structured output if you have it. Sometimes the basic call works, but streaming or error handling breaks only in production.

After this run, the team should know three things for sure: who will take traffic on 429s, how much it will cost, and which client features survive the provider switch without surprises.

Queues, timeouts, and retries

Keep a backup route ready
Switch traffic between 500+ models and 68+ providers without manual hassle.

Peak traffic usually does not break a system in one place. The failure often starts small: the queue grows by 200–300 tasks a minute, one worker hangs, retries inflate the load, and users see the same answer twice. That is why it is better to start checking not with the model, but with the request path after acceptance.

First, hit the system with a sudden spike and watch not only average latency, but also queue length by the second. If the queue grows faster than workers can process it, there is already a problem even if errors are not visible yet. Sometimes 5–10 minutes of this test reveal more than an hour of calm normal traffic.

Retries are even more dangerous. A retry should save a request, not create duplicates. If the client, gateway, and worker all retry the same request independently, you can easily end up with three provider calls instead of one. For tasks that write results, you need an idempotency key or at least a clear check for whether the job was handled before.

Then compare timeouts across the whole chain: the client, the LLM gateway, the queue and worker, the retriever, and the provider. If the client times out after 15 seconds, the gateway after 30, and the worker after 60, the request is already dead for the user, but the backend is still burning resources. In systems with a single OpenAI-compatible gateway where requests go to different providers, this imbalance is common. The team changes only the base_url, while the old limits and timeouts remain from the previous setup.

There is also a very simple test: stop one worker manually during load. Watch how quickly the others pick up tasks, whether the queue starts growing without stopping, and whether repeated processing appears. There is one good sign here: the system slows down, but does not enter a retry loop and does not lose tasks.

A small example. Before a sale, an online store handles normal traffic fine, but fails when generating product cards because the retriever responds 2 seconds slower and the queue is already full. One extra retry at each step quickly turns 5,000 requests into 9,000. That is what you want to catch in a test, not on the day of the peak.

The retriever under real load

The retriever often fails more quietly than anything else. The gateway responds, the model is technically alive, the queue is moving, but the final answer is already worse because search returned the wrong documents or nothing at all. That is more dangerous than an obvious error: the team sees a normal status, while the user gets a weak answer.

Start with bad queries. Not only cleanly written ones, but also empty strings, broken phrases, typos, duplicates, overly long queries, and noisy prompts. That is how people write when they are in a hurry. If the retriever returns random documents for that kind of input, the model will confidently fill in the gaps.

A good check is simple: mix normal queries with noisy ones in at least one load series. Then look not only at latency, but also at the share of empty results, the number of irrelevant documents, and how often the answer falls back to generic wording instead of a precise response.

Delays and degradation

Under load, search rarely fails all at once. First latency grows in the vector database, then cache stops helping, then reads hit limits, and only after that does quality drop. So it is useful to add artificial delays of 100, 300, and 800 ms and see what the app does: wait, reduce top-k, use stale cache, or go to the model without context.

Routing between models does not solve this on its own. If the retriever returns weak context, the user sees a poor result, not the reason. So check caching, deduplication, read limits, and how the system behaves when search returns zero documents.

When quality drops

The most common mistake in these tests is that the team measures only speed. That is not enough. Pick a set of questions with a known good answer in advance and compare the result before and after load. For basic control, four metrics are enough: the share of answers with the correct document in top-3, the share of empty results, average latency, and the number of cases where the model answered without a source.

A small example: before a seasonal sale, a retailer checks search across delivery and returns documents. At 50 RPS, everything looks fine. At 200 RPS, the database responds 400 ms slower, cache misses happen more often, and deduplication lets through similar cards. Response time increased by only one second, but accuracy dropped noticeably. That is exactly the kind of shift you want to catch early.

An example before a sale

Resolve incidents faster
See the route, provider, and failure reason for each request.

The day before a major promotion, a support chat usually gets 4–6 times more questions. People ask about payment, delivery, returns, and stuck orders. If the LLM-powered bot responds slowly for even 10 minutes, operators quickly switch to manual mode, and the queue starts growing almost without pause.

Such a team often has a main model that handles complex questions well, but during peak hours it starts responding much more slowly. Instead of the usual 2–3 seconds, the answer comes back in 8–12. For the user, that is already a failure, even if the model is technically available.

A week before the sale

The team does not wait for a total outage. It sets thresholds in advance: at what latency the gateway should move part of the traffic to a faster model, at what queue size the bot should answer more briefly, and at what knowledge base search time the retriever should be skipped entirely in favor of a safe template response.

If the company uses an OpenAI-compatible gateway, that kind of test is easier to run ahead of time. For example, in AI Router you can keep the same SDK and client code, then test how the system behaves when the provider and route change. That is useful for a very simple reason: during a sale, nobody wants to rewrite an integration on the fly.

On sale day

Picture the usual situation. The main model slowed down, the queue grew from 200 to 1,800 requests, and knowledge base search that took 300 ms now takes almost 2 seconds. If the rules were not defined ahead of time, users will get a mix of timeouts, empty replies, and strange repeats.

If the rules are in place, the system behaves predictably:

  • new conversations go to the backup model after a set delay
  • long responses are cut down to short instructions
  • the retriever is disabled for simple intents like "where is my order"
  • retries stop after a fixed limit

This mode does not make the service perfect. But it keeps the flow moving and prevents one bottleneck from taking down the entire setup. For sale day, that is often enough: the user still gets an answer, the operator sees a clear reason for degradation, and the team understands which threshold was triggered too late.

Mistakes in the tests

These checks can easily create a false sense of readiness. The team simulates one neat failure, for example by disabling a provider, sees traffic switch to the backup, and considers the test successful. In peak traffic, almost nothing breaks in isolation. The provider gets slower, the queue grows, some requests time out, and the retriever returns empty context.

Single tests are useful only at the beginning. After that, you need to test failure combinations. If the gateway survives the loss of one channel but starts wavering when latency rises and partial errors appear, the problem is still there.

Average latency also lies often. p50 may look fine while p95 and p99 have already drifted so far that users close the window or press "send" again. For LLMs, not only the total response time matters, but also time to first token. If the interface stays silent for the first 8–10 seconds, people think the service is broken, even if the answer eventually arrives.

Another issue is retry storms without protection. When every layer retries on its own, load grows faster than the original traffic. One slow provider can quickly turn into a flood of duplicates, a clogged queue, and cascading timeouts.

Before the test, record at least three limits: how many retries are allowed per request, when the system stops calling the problem provider, and what queue limit triggers traffic cutting or a simpler scenario.

The most underrated mistake is not visible in charts, but on the user’s screen. Teams check metrics, but they do not look at what the person actually sees during degradation. An empty reply box, an endless spinner, and an error without explanation hurt more than an honest message about a delay. It is better to decide in advance what your fallback mode is: a short response without the retriever, a switch to another model, or saving the request to a queue with a clear status.

If you do not have a screenshot of the user flow, a graph of tail latency, and a retry log for each layer after the test, then the run was only half done.

Short pre-season checklist

Simulate a failure ahead of time
See how the service behaves under 429s, timeouts, and queue growth.

The day before peak, the team should be able to answer five simple questions without debate. If any of them has no clear answer, the failure test will quickly turn into an ordinary incident.

  • Switching between providers works in a real test, not only on paper.
  • The queue grows predictably and does not build up thousands of tasks before a full jam.
  • The retriever can handle real volume, not just a pretty lab scenario.
  • Logs collect one request story from entry to model response.
  • Rollback has a specific owner who makes the decision.

Each point should have simple checks behind it. Disable the main route and make sure the gateway sends requests to the backup without code changes and without a sharp rise in errors. Check the queue length limit, task lifetime, and the rule for when the system cuts traffic. Look at search latency, the share of empty responses, and result quality at peak request volume, especially if the database updates during the day.

Logs are worth checking by hand too. For one request ID, the team should see the route through the gateway, provider selection, retries, timeouts, retriever access, and the reason for failure. For teams in Kazakhstan, this is also an audit question if they need to show where the failure happened and what data the system processed.

A good sign is when this list can be handled not only by SRE or an ML engineer, but also by the on-call shift. Then, during load, nobody spends 20 minutes looking for the responsible person or arguing whether the spike is temporary.

If you use a single OpenAI-compatible gateway, check one more detail: do SDKs, timeouts, and error codes behave the same way after switching providers. It is often these details that break rollback, even when the infrastructure is technically ready.

What to do next

Do not leave the overall run until the end of the month. One short test this week is more useful than one large perfect run that gets postponed again. Often 30–40 minutes is enough if the team already knows the scenario: gateway failure, slow provider, growing queue, and an empty retriever result.

After the run, record not general conclusions, but thresholds and actions. At what latency do you switch routes? How many 5xx responses in a row do you wait for before removing a provider from rotation? When do you disable an expensive model and move traffic to the fallback option? These numbers are best written down in one place and tied to a role right away: who watches the dashboard, who changes limits, and who informs the product team.

A useful rule is simple:

  • if a step repeats in every run, it should be automated
  • if a decision depends on one on-call person, you need a clear runbook
  • if the team argues about a threshold during an incident, the threshold is not defined yet
  • if switching takes more than 5 minutes, the setup is too manual

Teams often leave small things manual that later eat up time: enabling the backup provider, changing the route for long requests, lowering the rate limit for one client, or turning off a heavy retriever. These actions are better moved into flags, routing policies, and ready-made alert templates.

If you plan to run these tests regularly, it is also worth looking at the entry point itself. Sometimes the problem is not the model, but the fact that you have too many separate integrations, different limits, and manual rules for each provider. For teams in Kazakhstan that care about keeping data inside the country, masking PII, audit logs, and key-level limits, it may make sense to look at a single gateway like AI Router on airouter.kz. The point is not to switch tools for the sake of switching, but to reduce manual handoffs and make system behavior predictable before peak.

If the next test leaves you with fewer manual actions and fewer disputed decisions, you are moving in the right direction.

Frequently asked questions

Where should you start when preparing for controlled failures?

Start with 3–5 scenarios where a person will notice the failure right away: chat, knowledge base search, operator summarization, moderation. Do not try to cover the whole stack at once.

Set the waiting threshold and acceptable error rate for each scenario right away. Then assign one owner for the gateway, queue, retriever, and provider so nobody argues about who makes the call during the test.

Which metrics should you watch during the test?

Do not look only at average response time. Problems show up fastest in p95, p99, time to first token, queue length by second, retry count, and cancellations.

It also helps to see how much traffic moved to the fallback route, whether request cost increased, and whether the retriever started returning more empty answers. One smooth line almost always hides tail latency.

How do you know routing is really working?

Disable one working route completely and send the same request set again. Do not touch the client code, otherwise you are testing the manual workaround, not the routing itself.

Then compare where the traffic went, how latency changed, and whether the response format broke. If the app behaves differently after the switch, the route only works on paper.

When should you enable a backup provider?

Do not wait for a full outage. Switching should also kick in for a series of 429 responses, a chain of 5xx errors, or a clear jump in response time.

Set the thresholds ahead of time. For example, if the main provider misses the limit several times in a row, the gateway should move part of the traffic to the backup option automatically. After that, the team should know how much it costs and which client features will survive the switch without surprises.

Why do timeouts often break systems worse than the outage itself?

Because mismatched timeouts keep burning resources even after the user has already left. The client may wait 15 seconds, while the worker keeps going for another minute, and all that time the system is still holding the queue and connections.

Align the timeouts across the chain: client, gateway, queue, worker, retriever, provider. Then the request either finishes on time or quickly frees up space for the next one.

How do you avoid a retry storm?

Give retry rights to one layer, not to all of them at once. If the client, gateway, and worker all retry the same request independently, load can triple very quickly.

Limit the number of attempts, add a pause between them, and use an idempotency key for jobs that write results. When the queue grows or the provider is clearly struggling, stop retries before they clog the whole system.

How should you test the retriever under real load?

Run more than just clean requests. Mix in empty strings, typos, broken phrases, duplicates, and overly long inputs, because that is how people really write when they are in a hurry.

During the load test, do not measure only latency. Watch the share of empty results, the number of irrelevant documents in top-3, and cases where the model answers without relying on a source. It also helps to add 100, 300, and 800 ms of artificial delay to the retriever and see what the app does in each case.

What should be in the logs for fast incident analysis?

With one request, you should be able to see the full story in half a minute. For that, the logs need a trace ID, the selected route, provider, model, retry count, timeout or switch reason, and retriever access.

If the system masks PII or applies content labels, keep those events nearby too. When the trace ID gets lost between the app, gateway, and queue, incident analysis turns into guesswork.

How should you prepare a rollback before peak season?

A proper rollback should be short and clear. The team decides in advance who starts it, within what time window, and at which threshold.

Usually simple actions are enough: restore the previous route, reduce traffic, disable a heavy scenario with a flag, or move part of the requests to a faster model. If rollback takes more than a few minutes and people need to edit code on the fly, the plan is too manual.

Do you need a single OpenAI-compatible gateway before the high season?

Yes, if it removes manual switching and does not break the current integration. Before a peak, that is especially useful: the team changes the base_url, while the SDK, code, and prompts stay the same.

But do not take that on faith. Test the regular call, streaming, tool calling, structured output, key-level limits, and error codes after changing the provider. In a setup like AI Router, this kind of test helps you quickly see where you are really ready for peak traffic and where you are only hoping for the best.