Feb 05, 2026·8 min read

Microbatching LLM Calls: How to Cut Costs Without Breaking SLA

Microbatching LLM calls helps cut the cost of internal tasks without adding too much latency. We’ll look at where batches make sense, how to protect SLA, and what to measure.

Why single calls get expensive fast

An LLM call is not just the price of the output tokens. In most cases, you also pay for the overhead: the system prompt, message formatting, limit checks, logging, the network hop, and sometimes PII masking. If the task itself is short, that fixed overhead starts taking up too much of the bill.

Imagine an internal service that tags incoming requests. One short piece of text, one short answer. There is not much useful work, but the wrapper around the request is almost the same as for a much longer task. When you have thousands of these small calls every day, the budget goes to repeating the same overhead instead of the actual meaning.

The problem grows faster during working hours. At 10 a.m., employees launch email summaries, meeting breakdowns, document classification, and CRM card checks at the same time. Many of these tasks are not urgent, but the system treats them as urgent because they arrive one by one and immediately. The queue swells, rate-limit pressure rises, then come 429s, timeouts, and retries.

For internal processes, this is especially annoying because some tasks can easily tolerate a delay of 2–10 seconds. The user is not waiting for an answer in chat. They are opening a report, switching tabs, or starting a batch check. But the app still sends every request separately, as if every millisecond mattered.

Where the money disappears quietly

Most of the budget is usually eaten not by big conversations, but by small repeats. The same system context goes into every short call, every small task takes its own place in the queue, one timeout triggers a resend of the same prompt, and retries duplicate the history and service fields.

That is why single calls get expensive quickly. The price is not just the model’s answer, but also the shape in which you send the work. When requests are small and frequent, the shape starts costing too much.

Where small batches help

Small batches work best where no one is waiting for an answer at that very second. If the task lives in an internal queue and an extra 2–10 seconds does not break anything, microbatching often brings a noticeable cost win. This is especially clear with short and similar requests.

A good example is processing reviews, tickets, and internal requests as they come in. Let’s say 12 new support messages arrive in the queue within 3 seconds. Instead of 12 separate calls, the system collects them into one small batch, asks the model for the topic, urgency, and a short summary, and then maps the answers back to the cards. The employee opens the ticket only after that, so the extra delay is almost invisible.

Bulk short summaries for emails and documents also fit this mode well. In the morning, sales or procurement gets dozens of similar emails: price request, deadline clarification, attached contract, follow-up. If the prompt is the same for all and the answer needs to be short, the model usually handles it consistently. A fixed result format, such as 2–3 lines of text or a fixed JSON, helps a lot.

Classification of requests, leads, and cases also often pays off. It is routine work with a clear set of labels: "new lead", "repeat", "complaint", "needs a call", "spam". When there are only a few possible answers and no long text is needed, batches lower the cost without hurting internal SLAs in any noticeable way.

Another strong use case is checking an answer before it is shown to an employee. For example, one model drafts a reply to a customer, and a second model quickly checks the tone, PII, or internal policy violations. If this check runs in the background and fits into a couple of seconds, the operator simply sees a cleaner draft. For banking, telecom, and healthcare, that is often more useful than chasing the fastest single call.

Background extraction of fields from forms is also close to ideal. Loan applications, forms, insurance cases, and internal service forms usually have a similar structure. You need to pull out the name, ID number, contract number, date, and request type. The work is repetitive, the answer is short, and mistakes can be sent back for reprocessing.

Usually these are tasks with similar prompts, short formal answers, queue-based processing rather than live chat, and tolerance for a small delay. Small batches love boring, repetitive work. That is where their main strength lies.

When it is better to stay in single-call mode

Microbatching does not help everywhere. If a person is waiting for an answer right now, an extra 300–800 ms spent collecting a batch can cost more than the token savings. For internal tasks, this is especially noticeable where the model works inside a live interface.

The first example is an operator chat during a live conversation. The operator writes to the customer, looks at the suggestion, and replies immediately. If the system waits for a few more nearby requests, the conversation starts to feel off: the operator pauses, the customer notices the delay, and the pace drops. In that case, it is better to keep single-call mode and reduce cost in other ways: shorten the context, choose a cheaper model, or cache repeats.

The same applies to prompts during a call. When a contact-center employee asks the model, "how should I answer this objection" or "what should I ask next", the answer is needed almost instantly. Even a small queue gets in the way. On a call, every second is audible, and team frustration builds quickly.

Be careful with steps where the model’s answer immediately triggers the next action. For example, the model checks the text of a request, and then the system either opens a form for revision or sends it further down the process. If you add waiting just to batch it, the whole chain gets stretched. The more dependent steps you have in a row, the worse batch logic works there.

Another poor candidate is rare requests with very long context. There are too few of them to build a batch from. And the request itself is heavy: it uses memory, takes longer to process, and can slow down nearby tasks if mixed into one batch.

Single-call mode is usually better left where a person is watching the screen or talking to a customer, the next step starts right after the model responds, requests arrive rarely and do not form a steady queue, or one request is already too expensive in time because of the long context. A simple rule works well: if a person notices the delay, or the next step gets blocked by it, this part is better left alone.

How to launch microbatching

It is better to start microbatching not on all traffic at once, but on one calm scenario. Good candidates are internal tasks where an extra second does not break the process: request triage, document tags, short summaries, or checking product cards. If you start with a customer chat, complaints will come very quickly.

First, split the flow into two parts. Send urgent requests one by one, without a queue. Put everything that can tolerate a short pause into a separate queue. This simple rule removes most of the risk before the first tests even begin.

Set a maximum wait time for the queue. Often 1–3 seconds is enough. If the batch is not full by then, send what you have.
Start with small batches of 4–8 requests. That is usually enough to reduce cost without stretching the response into uncomfortable territory.
Group similar tasks together. The best results usually come from grouping by text length, prompt type, and expected answer size. If you mix short and very long requests, the whole batch will wait for the heaviest one.
Keep separate rules for errors. If one request in a batch times out or hits a limit, do not block the others. Retry only the failed items.
Compare more than the average cost. Look at the cost per useful answer, p50, p95, and the error rate. p95 usually shows whether things improved or whether the problem just moved into the latency tail.

If you already have an OpenAI-compatible gateway like AI Router on airouter.kz, this setup is convenient to place before model selection. That way, you first collect the right batches and then decide where to send them, without changing your main code and SDK.

A good start looks boring, and that is a good sign: one queue, one task type, batch size 4, a 2-second timer, and measurement over several days. If p95 stays within normal limits and the cost per task drops, try batch 6 or 8. If the queue grows at the start of the hour or after lunch, reduce the batch size instead of waiting for SLA to slip.

How not to break SLA

Urgent and background flow

Give live scenarios their own route and batch jobs their own queue through AI Router.

Set up flow

SLA usually breaks not because of the batch itself, but because people put everything into one batch and wait too long. If a request needs a response in 2–3 seconds, it does not belong in the same queue as tasks that can calmly live for 20–30 seconds.

The first rule is simple: keep urgent requests separate. Knowledge-base search for an employee, live operator suggestions, and quick text checks should go through their own flow. Summarizing emails, labeling archives, extracting fields from documents, and nightly reports can be safely collected into small batches.

Microbatching usually works without pain if you set two hard limits right away: the maximum batch size and the maximum wait time. For example, up to 8 requests or up to 150 ms, whichever comes first. That way the queue does not blow up, and users do not feel that the system is "thinking" just to save money.

Another trap is mixing short and very long prompts. One heavy request drags the whole batch down, and p95 grows faster than the API bill falls. It is better to split traffic into at least two classes: light requests and long ones. It is a rough rule, but it often protects SLA better than fine-tuning.

There is a second source of failures: retries and provider limits. If the provider returns 429 or a temporary error, the batch will go through again, and response time will rise immediately. So you need to budget not only for normal model latency, but also for retries, pauses between them, and RPM or TPM limits. Without that, the spreadsheet looks nice, but during working hours the queue suddenly turns to concrete.

A practical minimum looks like this: do not mix urgent and background tasks, limit both batch size and wait time, separate short and long prompts, count SLA with retries and rate limits in mind, and turn batching off if p95 grows faster than the savings. If savings are 8% and p95 rises by 40%, it is better to disable batching for that scenario.

What to measure before and after

If you look only at token cost, it is easy to draw the wrong conclusion. For a business, the more important number is the cost of one completed task: how much one case classification, one short email summary, or one support answer costs. Microbatching may lower the price on paper, but if some tasks time out or need to be resent, the savings disappear quickly.

Measure metrics at the task level, not just at the model-request level. Then it becomes clear that cheap tokens can still produce an expensive result because of delays, empty runs, and retries.

It helps to keep a few numbers in front of you:

cost of one successfully completed task
average response time and p95
share of requests that landed in full or nearly full batches
retry, timeout, and cancellation rates
breakdown by task type and by time of day

Average time by itself does not say much. The user or internal service suffers not from the average, but from the long tail. If the average drops from 2.1 to 1.8 seconds, but p95 rises from 4 to 11 seconds, the mode got worse even if the price went down a little.

Look at batch fill rate separately. If you set a batch size of 8 tasks but usually collect only 2 or 3, you are paying in waiting time without getting the full benefit. During working hours the picture often changes: in the morning traffic is dense, after lunch it gets uneven, and in the evening batches form faster again.

Do not mix different tasks into one report. Short moderation, document field extraction, and long-form answer generation behave differently. For one group, microbatching is useful; for another, it only adds delay.

If traffic goes through AI Router, audit logs and key-based limits help you quickly see where the savings are real and where they were eaten by timeouts, repeated calls, or too little traffic. A good rule of thumb is simple: the cost of one useful task should go down, and p95 should stay within SLA.

Example for an internal queue

First collect the batch

Build small batches first, then send them to the right model through AI Router.

Try the gateway

For a quality team, there is usually no point in sending every conversation to the LLM one by one. If the team checks about 600 conversations a day, and each one needs a short review using the same template, single calls simply repeat the same instruction hundreds of times.

Usually, people do not need an answer in one second. They need the finished result in the interface within 5 minutes so they can review the shift, mark mistakes, and pass edge cases to a manager. That is a good fit for microbatching: there is a waiting window, the template is the same, and the result is short and predictable.

The workflow is simple. The queue collects new tasks, the worker waits for either 6 conversations or 30–45 seconds, and sends them as one batch. In the prompt, the shared instruction is written once, and then six conversations follow with the same answer structure.

The price drops for two reasons. The long overhead is no longer repeated in every request, and the system makes fewer network calls. In practice, a few simple rules are often enough: normal checks go into a shared queue, the batch size is limited to six conversations, the batch has a timer, and the model returns the answer in the same format for every conversation.

At this pace, the team usually does not complain about speed. If one conversation used to go into processing immediately and now waits up to 45 seconds inside the queue, the user often does not even notice. They care about the result within SLA, not the exact moment the request started.

Urgent checks are better kept separate. A customer complaint, an incident review, or a request from management should not sit in the same batch as routine work. For those, teams create a second queue without microbatching, or with a batch size of one.

This example is good because it does not require a complex process overhaul. You do not change the evaluation template, you do not teach employees a new workflow, and you do not promise an instant answer. You simply give the system 30–45 seconds to gather a small batch and get a noticeable reduction in LLM cost without missing the 5-minute deadline.

Common mistakes

Microbatching usually breaks not because of the idea itself, but because of rough setup. The team sees savings in tests, turns batching on for all traffic, and a week later gets complaints about delays in places where everything used to work smoothly.

The most common mistake is putting everything into one batch. Short classifications, long summaries, field extraction from documents, and internal operator prompts all live at different speeds. If you mix such requests, the whole batch starts waiting for the heaviest prompt. In the end, the savings exist only on paper, and the queue behaves worse.

No less trouble comes from setting the batch size to "maximum". In a staging environment, a batch of 32 requests may look cheap. During working hours, it often takes too long to fill. For internal tasks, you almost always need two limits at once: how many requests can be collected and how many milliseconds they can wait. Otherwise the system chases a full batch and loses SLA.

Another trap is looking only at average latency. The average almost always looks reassuring, but people feel p95, not the average. If 80% of requests are fast and the rest sit for 4–6 seconds, the report looks fine, while employees are already bypassing the system manually. So you need to watch not just latency, but queue age, batch size, and the share of overdue requests.

People also often forget a separate route for urgent tasks. A suspicious transaction check, a live chat suggestion, or a short support reply should not sit in the same line as nightly report summaries. Urgent requests need their own path: single-call mode or very small batches.

The last mistake seems minor, but it hurts badly: the team takes conclusions from one dataset and applies them to every process. Microbatching may work well on support tickets, but on legal documents or medical records it can behave very differently. Prompt length, answer size, and traffic spikes are not the same there. That is why each flow should be tested separately, not by analogy.

Pre-launch checklist

Change only the API address

Add AI Router and keep your current SDKs, code, and prompts unchanged.

Connect API

Microbatching should not be turned on out of habit. It works well where a task can wait 2–5 seconds and the user will not notice. For internal labeling, email summaries, draft replies, and nightly checks, that is often fine. For a live operator chat, it is already questionable.

Before launch, it helps to check four things. A delay of a few seconds must not break the process. Prompts should be similar in shape and size, otherwise batches quickly become uneven. The team should define an acceptable p95 in advance, without vague words like "fast". And single-call mode must be recoverable in one minute with one flag or routing rule.

There is another practical test. Take 100–200 real requests from a normal working hour and look at the spread in length. If some prompts are 300 tokens and others are 10,000, one shared batch will quickly become awkward. In that case, it is better to split the flow by task type first.

A good start looks boring, and that is a plus. One internal queue, a small batch size, a strict wait limit, and a simple rollback. For example, the team batches only employee request summaries with a 1–2 second window. If p95 goes above the agreed threshold or errors grow, traffic immediately returns to single-call mode. That kind of launch gives a honest answer: is there real LLM cost reduction without SLA risk.

What to do next week

Do not start with the most visible scenario. For the first pilot, it is better to choose one internal process with a simple, clear SLA: ticket triage, call summaries, draft support replies, or document classification. If the task can tolerate a 10–60 second delay and does not affect the customer in real time, it is the best fit.

A good pilot should not drag on for a month. Five to ten working days are usually enough to see the difference in cost, latency, and number of handled tasks. During that time, the team will understand whether the waiting time bothers people or whether no one notices it while the system collects requests in small batches.

Split the traffic into two routes right away. Send urgent calls one by one, without waiting in a queue. Put background tasks into a short queue and collect them into small batches by size or by timer. This approach is easy to explain to the business: you do not touch what must answer immediately, and you save where people are not sitting at the screen with a stopwatch.

The plan for next week can be very simple. On day one, choose one internal process and record its SLA, current cost, and average response time. On day two, create separate routes for urgent and background requests. On day three, turn on microbatching only for the background flow with very modest settings. Over the next few days, watch cost, latency, errors, and team complaints. At the end of the pilot, compare the numbers with the baseline and decide where to keep batch mode.

If the gain is noticeable, keep batching in one or two internal flows and do not push it further too early. If the benefit is barely visible, do not complicate the architecture for no reason. A normal first pilot ends in a grounded way: one flow moved to batches, urgent requests stayed as they were, and the team understood the method’s limits. That is more than enough for the first week.

Frequently asked questions

What is microbatching in simple terms?

It’s a mode where the system does not send every request to the model right away. It waits for a short window, such as 1–3 seconds, collects several similar tasks, and sends them together. This approach usually works best for background processes where no one is waiting at the screen.

Where does microbatching save the most money?

It usually pays off on short, repetitive tasks: ticket tags, short email summaries, field extraction from forms, and simple request classification. In those cases, the overhead around the request takes up too much of the cost, while a small delay hardly bothers anyone.

When is it better to keep single calls?

Do not use it where the answer is needed immediately. Operator chat, live call prompts, and steps that trigger the next action right away are better kept in single-call mode. Even a few hundred milliseconds can make the system feel slower.

What batch size and timer should I start with?

For a first pilot, a batch of 4 to 8 requests and a waiting window of 1 to 3 seconds is usually enough. That gives you a good test of savings without creating too much latency. If traffic is uneven, keep the batch smaller rather than waiting too long for a full batch.

How do I avoid breaking SLA after launch?

First separate urgent and background tasks. Then set two hard limits: how many requests a batch can hold and how long it can wait. If p95 grows faster than the cost per task falls, turn batching off for that scenario and do not try to force it with tuning.

Can I mix short and long requests in one batch?

It’s better not to mix them in one queue. One long prompt slows down the whole batch, and fast tasks end up waiting for someone else’s heavy work. It’s easier to keep at least two flows: light requests separately and long ones separately.

What should I do if one request in a batch times out or hits a limit?

Do not let one failed item block the whole batch. Process the successful items right away and send the problematic request for a separate retry. It is even better if each item has its own ID, so a retry does not create a duplicate in the business process.

Which metrics should I track before and after implementation?

Do not look only at token price. It is more useful to track the cost of one completed task, average response time, p95, retry and timeout rates, and batch fill rate by hour. These numbers quickly show where you are really saving money and where you are just hiding delay in the tail.

How do I choose the first process for a pilot?

Choose a calm internal process where an answer can wait at least a few seconds. Good options are ticket triage, short summaries, document tags, and checks of standard cards. The prompts should be similar, the answer should be short, and switching back to single-call mode should take a minute, not a week.

How do I know the pilot really worked?

The pilot worked if the cost of one useful task dropped noticeably, p95 stayed within SLA, and the team did not start complaining about delays. If the savings are small, while the queue gets stale and errors increase, it is better to switch that flow back to single-call mode and keep the system simpler.