Oct 31, 2024·8 min read

When a reranker pays off: recall, latency, and cost

Let’s look at when a reranker pays off in search: how to measure recall gains, the impact on latency, request cost, and when the extra step is not worth it.

Why people argue about an extra step in search

Traditional search often does only half the job. It finds documents by words, topic, or the general meaning of the query. But that is where the debate starts: the right passage may not be first, but sixth or eighth.

For the system, that may look acceptable. For a person, it does not. Users almost always look at the top of the list, and a mistake in the first positions hurts more than a miss near the bottom. If the correct answer is in the results but hidden too low, people still feel the search failed.

That is why teams add a reranker. It takes a short list of candidates, say 20 or 50 documents, and reorders them not by rough similarity, but by a more precise match to the question. The idea is simple: do not search again, just move a result that was already found to where it belongs.

The debate is usually not about the idea itself, but about the cost of that step.

If a reranker moves the right answer from position 7 to position 2, the user notices it right away. In support, internal knowledge base search, or RAG for an operator, the difference shows up fast. The same set of documents starts to feel "smarter" simply because the order improved.

But an extra step is never free. The team pays in response time, a separate model call, and a more complex search pipeline.

And even a good reranker does not always pay off. If the base search already puts the right document near the top consistently, the second step changes almost nothing. It only adds milliseconds and inference cost.

Sometimes the opposite happens. The base search finds "similar" documents but confuses close wording, instruction versions, or nearly identical product cards. Then the reranker often fixes exactly the mistake a person sees: it puts the more accurate passage higher instead of just the most similar text.

That is why the debate never really dies down. Some people look at recall across the whole list and say the base system is already good enough. Others look at the top of the results and see that users still do not reach the right answer. In practice, everything comes down to the gap between a better working top and the losses in latency and cost.

What counts as a gain, and what counts as a cost

You can only tell whether a reranker pays off at the task level, not from a pretty metric in a report. Teams often see quality improve across a long candidate list and draw the wrong conclusion. Users do not read the top 100. They see the first 3-5 results, or they get a final answer right away.

The metric that matters more

Focus first on recall in the working top. If the reranker moved the right document from position 18 to position 4, that is real value. If it only shuffled documents inside top-50 while nothing changed in the top-5, the benefit is close to zero.

It is better to tie metrics to what happens after search. What matters is not whether the system found a document somewhere in the list, but whether the operator found the return policy in one query, the bot gave the correct answer without extra clarification, the manager opened the right customer card on the first try, and the employee did not end up manually scanning a long results list.

That view quickly brings things back to reality. A rise in recall on candidates does not pay for the extra step by itself. Only the improvement that changes the outcome of the task pays.

A good test is simple. Take a set of real queries, look at the top results without a reranker and with one, and then mark not just the position of the correct document, but also whether the person or bot was able to finish the task. For support, the difference between a document in 2nd place and 9th place is huge. For an analyst who will open 20 cards anyway, it may be almost zero.

Cost and latency by layer

You also need to count cost step by step. If you compress everything into one daily number, the picture breaks quickly. Measure the cost per request and split it into candidate retrieval, reranking, and final answer generation.

The same logic applies to latency. Measure separately how long the first search took, how much the reranker added, and how much time the model needed to answer. Otherwise, it is easy to miss an unpleasant scenario: the reranker gave a 4% boost to recall in top-5, but added 350 ms, and the bot started feeling noticeably slower.

Do not look only at the average; look at the tail too. If 80% of queries are fast but hard queries with the reranker sometimes take twice as long, that is what the operator will feel.

A simple rule of thumb: there is value when the extra step meaningfully increases the chance of solving the task without breaking response time on real queries. If quality improves only on paper while cost and latency increase on every call, it is too early to put the reranker into production.

When a reranker really helps

A reranker does not expand search. It will not find a new document if the first stage did not put it in the candidate list at all. Its value starts somewhere else: when recall among candidates is already decent, but the ordering is weak. The needed passage is in top-50 or top-100, but it sits too low and does not make it into the working top that a person or RAG will see.

That is where the second step often has a noticeable effect. A fast first search gathers a broad set of similar chunks. The reranker reads them more carefully and orders them by the meaning of the query, not just by rough closeness. For the metric, that can mean moving from 24th place to 4th. For the user, the difference is even bigger: the answer appears right away, with no extra clicks or clarifications.

This is most obvious with short, noisy, and ambiguous queries. Phrases like "card limit," "return," "invoice," or "vacation" are too short. They carry little context but many possible meanings. The first stage often finds several very similar passages, but puts the wrong one first. The reranker is better at telling whether the text is about a rule, an exception, or an old version.

The benefit is usually larger if the database itself creates confusion. That happens when there are many duplicates and nearly identical documents, the texts are long, the answer is hidden in one paragraph, nearby passages have small but important differences, or old and new versions are mixed together.

In such a database, the first stage honestly finds "something close," but it does a poor job of deciding what to put first. The reranker fixes exactly that ranking mistake.

The picture is usually simple: without it, the needed fragment is already among the candidates, but rarely makes it into top-5 or top-10. With it, the same candidate set gives a better order. If, however, the needed material rarely even reaches top-100, the second step is almost useless. Then the search itself needs work: chunking, filters, fields, embeddings, and the synonym dictionary.

How to calculate payback

Use live queries from logs over 2-4 weeks, not a synthetic set. For each query, you need a reference: which document should appear at least in top-3 or top-5. If the team did not agree on the reference in advance, the discussion will quickly turn into opinions.

Then run a clean test. Do not change the index, chunks, embeddings, or filters between runs. Otherwise, you will no longer know what was caused by the reranker.

First, measure the baseline without a reranker. Track recall@k, the share of queries with the correct document in the working top, average latency, and p95. Also calculate cost per 1000 queries.
Then enable the reranker and run the same query set without any other changes. Even a small prompt tweak or chunk-size change breaks the comparison.
Try a few candidate-list sizes, usually 20, 50, and 100. At 20, the reranker often has nothing to save. At 100, it may improve recall, but cost and latency rise much faster.
Convert quality gains into money. If recall in the working top-3 rose by 4 percentage points and that removes 30 manual checks per 1000 queries, multiply that by the cost of one check. Add the cost of errors too, if bad search leads to a wrong answer, a repeat contact, or a lost request.
Compare that gain with the cost of the step itself. Cost includes the reranker call, extra CPU or GPU load, and the price of added latency. For internal search, an extra 150 ms may be acceptable, while in an online chat those same 150 ms often hurt SLA and conversion.

It is useful to calculate not only absolute recall growth, but the cost of one point of improvement. Sometimes moving from 0.78 to 0.82 pays off quickly, while growth from 0.82 to 0.83 is already too expensive. In two-stage search, that is a normal pattern.

The formula is simple: money saved from fewer errors and manual actions minus extra reranker cost minus the loss from latency. If the result is consistently positive on real queries, the step is worth keeping. If not, it is better to invest in search itself, labeling, or filters.

An example from support

Check production requirements

Use PII masking, audit logs, and content labels in one stack.

Connect the service

In support, the value of a reranker is easiest to see on tricky queries. A customer writes: "I paid for the item partly with bonus points and partly by card. How do I process a return?" The question is short, but it contains an important exception: the normal return flow does not apply if part of the amount was paid through a different payment scheme.

Base search usually latches onto frequent words: "product return," "payment," "rules." The top results include a general return article, a page with payment terms, and a standard reply template. The operator opens them one by one and still does not see the needed line about partial payment and amount recalculation.

At the same time, the right fragment often did not disappear completely. It sits somewhere in top-50, just too low. The reason is simple: it contains fewer popular words and more rare details. In two-stage search, the reranker reads the full question and moves that fragment into top-3. For the operator, the difference is very practical: no need to scan several almost-right documents, because the answer is visible right away.

That kind of case makes it easy to see when the step is justified. If the operator already waits around 2 seconds for a suggestion, an extra 300-700 ms rarely gets in the way. But the system more often puts the exact document at the top, and the employee spends less time on manual checking. Over a shift, that matters more than it seems from a single query.

But in a chat that is supposed to feel almost instant, the same step becomes a problem. If the product promises a response in 300-500 ms, even a good reranker eats too much of the time budget. Recall and latency start fighting each other: the bot answers more accurately, but the user feels the pause and thinks the reply is slow.

So do not look at the ranking improvement alone; look at the scenario. For an operator interface, this step is often justified. For a fast chat, it is better to enable it only for ambiguous or rare questions where base search fails more often.

When a reranker only adds cost and latency

A reranker is not needed by default. If the first search already puts the right document near the top most of the time, the second step changes very little. It only adds another model call, 100-400 ms of latency, and extra money on every request.

This is most obvious on small, clean databases. When the document collection is tidy, duplicates are removed, names are clear, and metadata is filled in, good search already does almost all the work. In that kind of database, a reranker often only polishes the top but does not rescue the results.

A bad sign is a candidate list that is too long. The team takes top-100 or top-200, sends it to the reranker, and expects a meaningful boost in quality. In practice, the needed document was already in top-5, and the long tail only inflates latency and the bill. If the improvement in the working top-3 or top-5 barely moves, the step does not pay off.

There is also a product angle. Not every search needs perfect document ordering. In some scenarios, the user values speed more than a rare quality improvement. If someone is looking for a short help article, an internal policy, or a form number, an answer in 0.7 seconds is often better than a slightly more accurate answer in 1.8 seconds.

A reranker looks especially bad where the cost of an error is low. If a user can quickly rephrase the query, open another document, or start a new search without serious harm, the team can live with some misses. There is no point paying for expensive accuracy if a miss is cheap.

Usually, the second step adds cost and latency in five cases: when the first search already has high recall on a short list, the database is small and well labeled, the reranker gets too many candidates, users react strongly to extra hundreds of milliseconds, and mistakes in results rarely lead to lost money or time.

The most honest test is simple: compare results without a reranker and with one, not by an average metric, but by what a person sees in the real interface. If the second step barely lifts the useful document higher but consistently slows down the response, it is better to remove it or enable it only for hard queries.

Where teams go wrong in tests

Run a pilot on your own queries

AI Router helps you quickly see where a reranker pays off and where it only adds latency.

Launch a pilot

Most often, the team looks at average recall and celebrates a gain of a couple of points. The problem is that the average hides the most painful queries. That is where typos, short phrases, rare terms, and queries with extra words live. If the reranker only helps on easy queries, the number goes up, but the person doing the work barely notices.

It is better to break the test into query groups. Look separately at frequent, rare, long, conversational queries, and the ones that used to fail. That shows where the second step really rescues the results and where it only rearranges documents that were already good.

Another common mistake is comparing rerankers on different candidate lists. That should not be done. If the first stage found one set of documents in one test and a different set in another, you are no longer comparing just the reranker. First freeze the same candidate list, then check how the order changes. Candidate retrieval and reranking should be evaluated in separate experiments.

Synthetic questions also often distort the picture. They are too clean. Users do not write like a dataset. They write "the account won't open," "where do I cancel a request," or "why is yesterday's payment not visible." On such wording, the effect can be very different. If you have query logs, use them. If the data is sensitive, mask personal information first and then build the test.

Teams also often mix search quality with final answer quality from the model. Those are different things. A reranker can move the right document higher, but the generation will not improve if the answer model reads the context poorly. The opposite can also happen: the answer got better only because you switched to another model, while search stayed the same.

That is why the evaluation should be split into two layers. First measure search itself: recall@k, the position of the needed document, and the share of misses. Then measure the final answer separately on the same queries. For both layers, it helps to look not only at the average but also at p95 latency, and to calculate cost per 1000 queries rather than per one lucky example.

The last trap appears after launch. In testing, the team does not account for cache, batching, or queueing, but in production it gets a different latency. The same reranker may add 40 ms during a quiet hour and 250 ms at peak when queries wait in line. If you run through a shared LLM gateway or shared GPU, this becomes especially visible. Only after that test can you see whether the step brings real value or just makes search more expensive and slower.

A quick checklist before launch

Check p95 before release

Test models and routes on real queries through one gateway.

Compare models

Do not enable a reranker based on intuition; do it after a short check on your own data. If you do not have a set of real queries with known correct answers, almost any conclusion will be random. For the first pass, 100-300 queries is usually enough if they come from live traffic, not from a table invented in half an hour.

A good test looks at more than quality. Users do not see your recall separately from response time and cost. If the answer arrives 700 ms later and costs twice as much, a gain of a few percent may not matter.

Before launch, it is enough to check five things:

Build a set of queries with reference documents or correct answers. It is better if it includes short, long, vague, and genuinely hard queries.
Set a latency limit per request. For internal search, that is one number; for a support chat or checkout flow, it is a very different one.
Count the full chain cost, not just the reranker cost. Include search, reranker, generation, and repeat calls if they happen.
Run at least 2-3 candidate-list sizes. For example, top-20, top-50, and top-100.
Split the results by query type. Rerankers often work better on long questions with context and are almost useless on simple navigational queries.

A small example. A team tests search over a support database and sees an average 4% gain in recall@10. That sounds good. But after breaking it down, they find that almost all of the win came from rare, wordy questions, while for common queries like "reset password" the reranker changes nothing. In that case, it is smarter to enable it not for all queries, but through a simple routing rule.

If you compare several rerankers and models through a single OpenAI-compatible gateway like AI Router on airouter.kz, this kind of test is easier to keep clean. You can change the route or model without rewriting the SDK, code, or prompts, and it is easier to see where the gain came from the reranker and where it came from another part of the pipeline.

After that, decisions usually become calmer. Either the step pays off in specific scenarios, or you honestly remove it and stop paying for extra milliseconds.

What to do next

Do not make the reranker the default first step. First, check the foundation: how you split documents into chunks, what metadata you give each fragment, and what top-k you use in the first search. Very often, recall improves at this stage already, while latency barely changes.

If answers still miss after that, add a reranker only where it helps. It usually pays off best on long, noisy, and ambiguous queries where the first search brings back many "almost right" fragments. Simple queries like a contract number, plan name, or exact error are better handled through the fast path without an extra step.

A good practice is to split traffic into two scenarios. Keep ordinary search for simple queries. Turn on two-stage search for complex ones with a simple rule: long query, few exact matches, many similar chunks, and a high risk of error in the answer.

Before a full launch, you need a test on live traffic. A shadow run works well, where the new setup computes results in parallel but does not affect the user. If you have enough traffic, run an A/B test and look not only at the average metric, but also at the latency tail, cost per 1000 queries, and the share of cases where the reranker truly changes the working top.

It helps to reduce the decision to a simple table. Usually, four columns are enough: quality gain on your tasks, added p95 or p99 latency, cost per request or per 1000 requests, and the risk of error if the reranker is turned off.

That table quickly brings things back into focus. If the reranker gives a 6-8% lift in recall on hard cases and adds 150 ms only for part of the traffic, it is worth keeping. If it improves the metric by a fraction of a percent but makes search noticeably slower and more expensive, it is better to remove the step and invest in chunks, filters, and top-k.