Sep 26, 2024·8 min read

LLM Service Load Testing: Peak, Queues, Bottlenecks

Load testing an LLM service helps you find where queues grow, what breaks under peak load, and where the bottleneck sits in the API, network, and retries.

What breaks at peak load

At peak load, it is not only the model response time that grows. A request often slows down before it even reaches the model. The gateway checks the key, applies rate limiting, masks PII, writes an audit log, chooses a route to the model, and only then sends the request onward. If any of these steps slow down by 50–100 ms, total latency spreads fast.

Average latency is almost useless in this situation. It smooths out spikes and hides the queue. On the chart, everything looks calm: 2.3 seconds on average. But some users are already waiting 8–12 seconds or getting a timeout. That is why load testing looks not only at average, but also at p95, p99, and the timeout rate.

The queue rarely lives in just one place. It builds up in the API gateway, in the database connection pool, in the log write queue, and even in post-processing workers. Often the stage everyone thought was secondary is the one that hits the limit first. While the audit queue or content labels grow, processes hold memory, connections stay busy, and in a few minutes it starts to look like the model is failing. Even though the model itself may still be responding at almost the same pace.

Retries make things worse very quickly. One timeout triggers a retry, then another, and traffic can grow in a couple of minutes not by 10%, but by almost double. If the client, SDK, and gateway all retry independently, the system amplifies the peak on its own. From the outside it looks simple: latency rises, then the error rate rises, and then the queue does not clear even after the incoming traffic drops.

Usually one slow stage holds up the whole request. For example, the model keeps generating tokens at a steady pace, but the service waits too long for a free worker to mask PII or write the audit record to local storage. The user sees a slow LLM API, while the bottleneck is in the surrounding stack, not the model.

At peak load, it is not the most expensive component that breaks first, but the narrowest one. That is why a good run looks for the moment when the queue starts growing faster than the system can drain it.

Which scenarios are worth running

For load testing, one "average" request is not enough. At peak load, the system fails where you did not expect it: short chats may pass fine, while long responses fill the queue. Streaming may hold up, while tool calling sharply increases the number of backend operations.

A good run mirrors real traffic, not a neat lab case. If production is still quiet, build a reasonable mix and test at least five modes.

A short chat with a fast answer is needed in almost any service: support, an internal assistant, knowledge-base search. It shows how many requests the system can handle without extra delay.

Long context and large output expose very different problems. A user uploads a contract or a long conversation and asks for a 1,500–2,000 token summary. This kind of test quickly shows growth in generation time, memory use, and queues.

The same request is worth running both in streaming and non-streaming mode. For the user, that is a different experience; for the infrastructure, it is different pressure on connections, proxies, and client timeouts.

A mixed flow of regular responses and tool calling is also a must. When the model calls search, CRM, or a calculator, one request can easily turn into a chain of several operations.

And finally, you need a sharp burst. Not a gentle ramp-up, but a jump after a campaign, a push, or the start of the workday. That is the mode that often breaks rate limits, the load balancer, and the connection pool.

A practical template looks like this: 70% short chats, 20% long requests with large responses, and 10% requests with tool calls. After that, add a burst: for two minutes, raise incoming traffic by 3–5x and see what degrades first.

Also test request cancellation and resubmission separately. Users often hit "stop", change the prompt, and run it again. If the service does not close old connections quickly, you get hidden load even at moderate RPS.

If you work through a gateway like AI Router, do not limit the test to the model alone. Run the same scenarios through streaming, through routing to an external provider, and on your own hosted model. Then you can see immediately where the system slows down: in generation, in the proxy layer, in audit logs, in PII masking, or in the external backend for tools.

What to measure before the first run

Before testing, record not one number but a set of metrics. RPS alone tells you almost nothing. For an LLM service, both requests per second and concurrent requests matter. A system may comfortably handle 20 RPS with short responses and fail at the same 20 RPS with long generation.

Measure time to first token and time to the end of the response separately. For the user, these are two different delays. If the first token arrives quickly, the interface feels alive even during a long response. If TTFT grows, the issue is often not in the model, but earlier: in the load balancer, routing, key check, PII masking, or the queue in front of the GPU.

The queue also needs to be visible by stage, not as one line in a dashboard. Usually five points are enough: entry into the API, routing and model selection, the preprocessing queue, the inference queue, and post-processing plus logging. Then you can immediately see where the backlog is building. If the queue grows only before inference, look at the GPU limit or the provider cap. If requests are waiting before the model, check CPU, network, rate limits, the log database, and all synchronous checks on the way in.

It helps to break errors down by type. 429, timeouts, and 5xx look similar to the client, but they need different fixes. 429 often points to limits on the key, the provider, or the model pool. Timeouts often come from long responses, slow retries, and a crowded network. 5xx usually appear when the proxy, worker, logging, or an external dependency fails.

Input and output size in tokens matters for almost every metric. Without it, it is hard to understand why the same load behaves differently. A request with 300 input tokens and 150 output tokens is one load class. A request with 8,000 input tokens and a long generation is another.

Before you start, also record CPU, GPU, network, and disk usage. Watch not only averages, but peaks too. Disk is often forgotten, even though it is what slows down audit logs, cache writes, and tracing. If you have a routing layer like AI Router, it is useful to split metrics by provider, by your own GPU infrastructure, and by specific model. Otherwise, you only see the "average temperature".

How to build a test plan

It is better to start not with abstract RPS, but with real traffic. Take logs from a normal day and from the busiest hour. Look not at the average, but at the request profile: how many short chats you have, how many long conversations, how many document-backed requests, how many requests end in timeout.

If you go through a single gateway, this is easier to collect. In one place, you can see which models are called most often, where the context length grows, and at which minutes the queue starts to swell.

After that, group the traffic. For LLMs, this matters more than the total volume. A request with 200 tokens and a request with 20,000 tokens put completely different pressure on the system, even though both count as one API call.

A working plan usually looks like this:

Choose 3–5 request types from the logs: short chat, long chat, RAG with a large context, batch task, streaming generation.
For each type, record the input length and the typical response size. Do not stop at averages; keep the heavy tail too.
Set two load levels: normal and peak. The first gives you a baseline, the second shows where the system starts to fail.
Put warm-up into a separate stage. First warm up the connections, cache, worker pool, and model, then start the main run.
Increase traffic in steps, for example every 5–10 minutes. One sudden jump creates a lot of noise, but it does a poor job of showing the moment the queue begins.

Steps help reveal the turning point. Before it, latency rises gradually. After it, p95 and error counts shoot up almost immediately. That is the area worth digging into: ingress queue, provider limit, slow retrieval, overheated GPU, or a narrow connection pool.

It also helps to add one scenario that looks like a real peak hour, not a lab load. For example, 70% short support requests, 20% RAG over documents, and 10% long analytical requests. That mix usually gives a more honest picture than running one "average" request.

A good test plan answers one question: at what point does the service stop handling your real traffic without a queue, a sharp latency jump, and extra errors.

How to estimate queues without complex math

Plug in the gateway without a rewrite

Change the base_url and keep using the same SDKs, code, and prompts.

Connect API

A queue grows when more work enters the system than it can finish. For a first estimate, two numbers are enough: arrival rate and processing rate. If you receive 50 requests per second and the service finishes 40, the queue grows by 10 requests per second. In one minute, that is already 600 requests.

This rule works both in testing and in production. Average latency may still look fine at first, but the tail gets worse quickly. The user does not see that on the RPS chart, but they do notice when the answer takes 20–30 seconds instead of the usual 3–5.

Look not only at requests, but also at tokens. Two streams at 20 RPS can load the system very differently. Short requests with 200–300 tokens will pass easily, while long generation with a large context will clog the model even if RPS does not change.

It is useful to count four stages separately: request intake with authorization and rate limiting, the steps before the model such as routing, PII masking, and database search, the model itself with prompt tokens/s and completion tokens/s, and then the steps after the model — content checks, logging, packaging, and response delivery. That way you can see much faster where the blockage forms.

Sometimes the model is free, while the queue is building before it. For example, the service may be spending too long on audit logs or slowing down while masking personal data. The reverse also happens: the API front end accepts requests quickly, but token generation is slow, and the queue already lives inside the model workers.

Watch how long a request spends in the queue, not just how many requests are sitting there. A simple estimate is this: if there are 400 requests in the queue and the stage handles 40 per second, a new request will wait about 10 seconds. It is a rough estimate, but it is good enough for a first pass.

Do not mix cold start and steady state. In the first few minutes, workers, connections, caches, and the model itself may still be warming up. If you test both modes together, the numbers get muddy. Warm up the service first, then measure stable load. Cold start is better tested as a separate scenario.

Where to look for bottlenecks besides the model

Very often the latency comes not from the model, but from the chain around it. If you break the request into stages, you can quickly see where the backlog is building: API gateway entry, key check, context search, model call, response validation, logging, and sending the result back to the client.

The API gateway is often the first limit. At peak, even a simple check for the key, quota, and key-level rate limit can add 20–50 ms to each request. On a single request this is barely noticeable, but with hundreds of concurrent calls, that small delay quickly turns into a queue in front of the model.

RAG search also often takes more time than expected. Vector search, reranking, document reads, and assembling a long context can easily take a full second before the first token. If the model responds steadily but p95 grows, the cause may be in the database, the cache, or the prompt builder.

Post-processing hurts throughput almost as much as inference. JSON validation, parsing long responses, schema checks, filters, and even simple regexes on a heavy stream load CPU and memory. Teams often look only at GPU usage, while one validation-heavy service can noticeably reduce RPS across the whole pipeline.

Audit logs, PII masking, and content labels create their own tail. If the service writes logs synchronously or runs every response through a separate validation module, latency rises in steps. This happens in banks, telecom, and the public sector, where control is required, but the control itself also needs to be tested under peak load.

Provider limits also distort the picture. Your servers may handle the load comfortably, while the external provider hits its RPM or TPM limit first and starts returning 429s or slowing down. In a setup with a single gateway, it helps to measure local processing time separately from upstream time. Otherwise, it is easy to blame the wrong layer.

Network is checked less often than it should be. DNS, TLS handshake, short connections without keep-alive, extra hops between regions, and a slow load balancer create an ugly, uneven tail. This is especially visible on short requests, where the model answers quickly and almost all the time goes into connection setup.

How to tell where the bottleneck really is

The most practical way is to run the same traffic with the model replaced by a stub fixed at 100–200 ms. If latency barely changes, look for the problem before or after the model.

Then disable parts of the pipeline one by one: RAG, logging, PII masking, JSON validation. This kind of test usually gives a more honest answer than a single latency graph.

Capture stage timings in one trace: gateway, auth, retrieval, upstream, postprocess, logging, response flush. If model latency stays flat while the gateway queue grows from 30 to 300 requests, you have already found the bottleneck.

Example of a peak hour

Check peak load through one gateway

Run one scenario on different models and compare TTFT and timeouts.

Start test

At 19:00, an online store launches a promotion, and traffic changes within a couple of minutes. Before the launch, the service was handling, for example, 40 requests per minute. After the push and email blast, the flow quickly grows to 500–700, but the requests are no longer the same.

About half of the users ask short delivery questions: "When will it arrive?", "Is pickup available?", "Can I pay in installments?" These answers are short, and the model handles them easily. The other part asks for product recommendations from a long description: "I need a laptop for design work, under this budget, with quiet cooling and a good battery." Here the request is longer, more context goes into the prompt, and RAG more often searches the catalog and stock database.

At the same time, operators are not waiting. The CRM keeps sending conversations for summarization so managers can see a short recap of the dialogue. For the model, this is normal work. For the whole system, it is one more flow of tasks using the same queues, the same workers, and often the same logging database.

At 19:03, the model responds at almost the same speed as before. At 19:05, the queue for context search in the catalog grows. At 19:07, logging starts writing more slowly because of the spike in events. By 19:10, users already see the delay, even though the GPUs have not hit their ceiling yet.

That is the whole point of load testing. If the RAG stage can process 80 requests per second and 100 are coming in, the queue grows by 20 requests every second. In one minute, there are already 1,200 requests waiting. The model itself may still look "healthy": its p95 barely changes, while the user’s full response latency grows from 2–3 seconds to 10–15.

The worst effect is that short delivery questions start slowing down too. They are cheap, but they sit in the same queue next to heavy product recommendations and conversation summaries. If logging is synchronous and personal-data masking runs in the same pipeline, the delay gets even worse.

In a peak hour like this, the bottleneck is often not in the model. It usually ends up in context search, logs, the database connection pool, or a shared worker pool where fast and heavy tasks were mixed together.

Common testing mistakes

Many teams get neat charts and still miss the real load. The reason is usually simple: the test looks like a lab, not live traffic.

The most common mistake is to run the same prompt hundreds of times. That is convenient, but it tells you almost nothing about the peak. In production, requests vary in length, number of messages in history, expected response size, and enabled tools. A short 30-token question and a long conversation with deep context create completely different queues.

Because of that, the team sees a "normal" average latency and relaxes. But users are hit by the tail: p95, p99, timeouts, and error spikes. If 90% of requests are fast and 10% wait 20 seconds, the service already feels broken.

Another miss is forgetting that the client and SDK do retries on their own. In the graph, you see 100 requests per second, but in reality the system gets 130 or 150 because some calls were repeated after a timeout. The queue grows fast, and it feels like the model is the only cause. In practice, retries are what amplify the load.

Problems often sit outside the model. Teams test only inference, but not the whole path of the request: load balancer, API gateway, authorization, PII masking, audit logs, rate limiting, network to the provider, and sending the stream back to the client. If the company works through a single OpenAI-compatible gateway, the test should cover the full route, not just the model call in a vacuum.

Another trap is provider limits and the network. Even if your side of the system handles the peak, the external provider may cut RPS, tokens per minute, or the number of concurrent requests. A few extra milliseconds at each hop also add up into a queue.

And finally, do not mix cached and uncached traffic into one pile. If half the test hits prompt cache, the numbers look too good. Then the release ships, the cache cools down, and latency suddenly doubles. It is fairer to measure the two profiles separately and see how the tail changes in each one.

A load test only works when it resembles an ordinary day in your system, not a convenient synthetic run.

Quick check before release

Cover PII and audit logging

Run production scenarios with PII masking, audit logs, and key-level limits.

Connect AI Router

An hour before release, do not try to "tune a little more". It is better to quickly confirm that the service can handle an ordinary day and a peak, and that the team knows where it will start to break.

Keep two traffic profiles ready. The first shows normal load: average prompt size, familiar response length, steady request flow. The second simulates peak: more concurrent sessions, sudden bursts, long responses, and a share of heavy operations such as JSON output or calling a large model. If you only have one averaged scenario, the test almost always looks too calm.

Before deployment, a short checklist is enough:

you can see separate queues at each stage: input gateway, PII masking, routing, provider or your own model call, logging;
the team knows the limit for concurrent requests in real numbers, not "by feel";
timeouts are set explicitly for each step: connection, first token, and full response;
retries do not create a storm when upstream is already slow;
monitoring catches not only average latency, but also the tail, 429s, and 5xxs.

Queues are easier to check by stage than on one total line. If overall latency went from 4 to 12 seconds, the model may not be the reason. Often the time is spent earlier: the request waits too long for a free worker, hits the provider rate limit, or gets stuck in post-processing. In a setup with an LLM gateway, this is especially visible: the model itself may be steady, while the bottleneck hides in routing, key-level limits, or the external provider.

The final check is simple and very useful: run the same test again. If the second result is close to the first one in p95, errors, and queue length, the test is believable. If the numbers swing too much, first fix the environment and the load generator, and only then ship the release.

What to do after the run

Look not at average latency, but at the first stage that stalled. Sometimes the model is slow, but more often the input limiter, connection pool, worker queue, logging, or audit storage gives out first. Record the failure point in numbers: at what RPS or number of concurrent requests the queue grew, where timeouts started, and which component produced the first error spike.

Then remove the things that are making the incident worse by themselves. Extra retries often only make things worse: one slow response turns into three new requests and clogs the system faster than any peak load. If the queue is already past the safe threshold, turn on backpressure: reject new work sooner, return 429 or 503 earlier, set a short wait timeout in the queue, and do not keep a request hanging for 30–60 seconds "just in case".

It is also useful to split traffic across different routes right away. Short requests like classification, field extraction, or a short chat should not sit behind long summaries and agent chains. If you mix them in one queue, users of short scenarios will see poor latency even when overall load is still acceptable.

After the run, four fixes are usually enough:

Limit automatic retries on the client and in the gateway.
Introduce backpressure based on queue length instead of waiting for everything to hit timeouts.
Split routes for short and long requests.
Recheck fallback between models and key-level limits.

Fallback needs to be tested separately. You cannot just assume it "will work". If the primary model slows down, the backup route should not double the cost, break the response format, or send all traffic to the same provider under a different alias. Key-level limits matter too: one noisy client can easily consume capacity and hurt SLA for everyone else.

A small example: if 80% of traffic is short requests under 300 input tokens and 20% are long tasks with several thousand tokens, keeping them in one pool is usually a bad idea. Splitting queues often gives a noticeable win even without changing the model.

If you need to compare the same load profile across different routes, it is convenient to do it through a single OpenAI-compatible gateway. For example, in AI Router on airouter.kz, you can run scenarios separately through an external provider and through your own open-weight models, without changing the SDK, code, or prompts. That kind of test quickly shows exactly where the bottleneck sits: in routing, upstream limits, local GPU infrastructure, or client retries.

Frequently asked questions

How can I tell whether the model is fine and the surrounding stack is the problem?

Look at timing by stage, not just overall latency. If model latency stays almost flat while the queue grows at the gateway, in RAG, in logs, or in postprocess, the problem is in the surrounding stack.

What should I measure besides average latency?

Average latency hides the tail. Keep p95, p99, timeout rate, TTFT, full response time, number of concurrent requests, and queue length by stage next to it.

Why measure time to first token separately?

TTFT shows when the user sees the first sign that the interface is alive. If the first token arrives late, look for the cause before inference: in auth, routing, PII masking, queues, or the network.

What scenarios should a good load test include?

Use a mix of live traffic, not one "average" request. Usually a short chat, long-context requests with large outputs, streaming, tool calling, and a sharp burst after a quiet baseline are enough.

How can I estimate queue growth without complex math?

Compare request arrival speed with processing speed. If 50 requests per second arrive and the stage finishes 40, the queue grows by 10 per second and each new request starts waiting longer.

Why are retries so dangerous during peak load?

Because retries amplify the peak on their own. One timeout triggers another request, then another, and in a few minutes the system spends its effort on its own storm instead of useful traffic.

Do I need to test streaming separately?

Yes, it is a separate scenario. Streaming keeps connections open longer, loads proxies differently, and stresses client timeouts more, so numbers without streaming often look too calm.

Where is the bottleneck usually hiding besides the model?

Often the first bottleneck is the API gateway, RAG search, audit logs, PII masking, JSON validation, the connection pool, or the network to the provider. Even an extra 20–50 ms at each step adds up quickly on a large stream.

How do I test fallback between models?

Run the fallback with the same traffic as the primary path. Watch not only errors, but also cost, response format, TTFT, and whether the backup route still goes to the same provider under a different name.

What should I do after the run if the queue is already growing?

Turn on backpressure right away and cut unnecessary retries. Then split short and long requests into different queues, record the breaking point in numbers, and check which stage stalled first.