Skip to content
Oct 24, 2025·8 min read

Document Chunking for RAG: How to Test It

Compare chunk sizes, overlap, and reranking on one question set to choose RAG document chunking based on data, not opinion.

Document Chunking for RAG: How to Test It

Why chunking debates take so much time

In almost every team that has already launched knowledge base search, the argument over document chunking lasts longer than anyone wants. The same corpus can produce different answers just because you changed the chunk size or overlap. With 200–300 tokens, search more often finds the exact passage, but loses context. With 800–1000 tokens, there is more context, but the results often include extra text.

Because of that, the discussion quickly moves away from data and into opinions. One engineer shows a question where small chunks worked better. Another brings an example where a long chunk saved the answer. For those two queries, both are right. For the whole system, it means almost nothing.

Intuition fails here often. It feels like a “logical” piece of text should be easy to find. In practice, search does not work the way a person reads. It sees embeddings, nearby phrases, repetition, filler text, and an imprecise question. That is why a polished explanation in a meeting can easily lose to a boring option in a real test.

A lot of time is especially wasted when the team discusses good examples. One good answer proves almost nothing. If the system answered a vacation question well, that does not mean it will just as confidently find the rule for SLA, access limits, or an older version of a policy.

The argument usually ends only when everyone looks at the same question set and the same document corpus. Then you can see which option finds the right chunk more often, where search misses, and whether reranking helps. That conversation takes an hour. Without it, teams can spend weeks arguing about taste instead of results.

If the knowledge base is live and documents change often, the cost of mistakes goes up. A bad chunking setup later gets masked with prompt hacks, extra top-k tuning, and manual exclusions. It is much easier to test the hypotheses once on a shared benchmark than to fix system behavior later based on user complaints.

What counts as a good result

A good result is not “I feel like the answers got better.” It has numbers chosen before the test. Otherwise one team will look at answer completeness, another at cost, and a third will start arguing about chunk size again.

Usually four metrics are enough:

  • hit@k — did the right chunk appear in the top-k results
  • answer accuracy — did the pipeline give the correct answer
  • latency — how long the request takes; median and p95 are better to watch
  • cost — how much one run or 1,000 questions costs

First look at retrieval, then at the answer itself. This matters. If the right chunk did not make it into top-k, the problem is almost always chunking, indexing, or reranking. If the chunk is there but the answer is still bad, then it makes sense to look at the prompt, the model, or the context format.

For retrieval checks, you need a simple rule: for each question, you should have a reference chunk or at least the document where the answer lives. After that, you can count how often that chunk appears in top-3 or top-5. For chunking tests, that is often enough. If a variant does not push the right text upward, a nice answer on a couple of lucky examples proves nothing.

It is better to define the threshold in advance. For example, a variant passes if it:

  • delivers hit@5 of at least 85%
  • stays within your SLA for p95
  • fits the cost limit
  • does not reduce answer accuracy compared with the baseline

The numbers depend on the task. In an internal banking knowledge base and in an online store catalog, the right threshold will be different. But the approach is the same: agree on the pass/fail line first, then run the experiment.

If you are testing reranking in RAG separately, do not mix its effect with the effect of chunking. Otherwise you will not know what actually helped: smaller chunk size or a reranker that simply repaired weak retrieval at the cost of extra milliseconds and money.

A good result usually looks boring. It shows how many questions the system covers, how much it costs, and whether it meets response-time limits. For a working RAG system, that is enough.

Build one question set

Arguments about chunking are often not really about chunks, but about different examples in the minds of the people involved. One person remembers a short chat query, another remembers a long question with a lot of conditions. So first freeze one question set and do not touch it while you compare variants.

Usually 30–100 questions is enough. If there are fewer, the result depends too much on chance. If there are more, that is great, but only if you can truly label the correct source for each question.

It is better to use real work queries rather than made-up ones: knowledge base search questions, support tickets, user chat, employee questions about internal instructions, common onboarding questions, sales questions, or compliance questions.

Mix different query types. You need short phrases like “return limit,” long questions with context, ambiguous queries where one word can mean different things, and questions where numbers, dates, deadlines, thresholds, and exceptions matter. These are exactly the cases that most often break chunking.

A good test set rarely looks neat. That is a plus. If it only contains clean questions in one style, the test will be too soft. In real work, people write briefly, mix up terms, leave out details, and combine several conditions in one message.

For each question, mark in advance where the correct answer lives: the document, section, paragraph, or even a table. Do not leave this for later. Otherwise, after the run, the team will start arguing not about search, but about which answer should count as correct.

A simple example: a company may have a rule that returns are possible within 14 days, but promotional items have a different deadline. Then the question “Can I return an item after 10 days if it was bought on discount?” should point not to the general return policy, but to the paragraph with the exception.

Once the set is ready, freeze it until the comparison is finished. Do not remove awkward questions and do not add new ones after the first results. Otherwise you are no longer testing chunk size, overlap, or reranking—you are testing your reaction to interim numbers.

Prepare the chunking variants

Chunking is better tested on a few clear variants, not twenty at once. If there are too many configurations, you quickly lose the connection between cause and result. For the first round, 2–4 schemes is usually enough.

A good starting point is to compare different chunk sizes and check overlap separately. Size tells you how much meaning fits into one piece. Overlap shows whether you are breaking a thought at the boundary between neighboring chunks.

In practice, it is convenient to take the same document set and build several index versions:

  • 300–400 tokens with no overlap
  • 300–400 tokens with 10–15% overlap
  • 700–900 tokens with no overlap
  • 700–900 tokens with 10–20% overlap

That is already enough to see the difference. Zero overlap shows how badly search suffers at text boundaries. Small overlap often gives a noticeable gain without making the index much larger. Medium overlap is worth testing too, but it does not always pay off: the database grows, similar chunks become more common, and the results get noisier.

It is very important not to change anything else. Keep the same documents, the same text cleaning, the same rules for removing menus, footers, duplicates, and system blocks. If you change chunking and cleaning at the same time, you will not know what affected the result.

The same goes for embeddings. At this stage, it is better not to touch the embedding model. Otherwise you will not be comparing chunking schemes, but a mix of two factors.

To avoid confusion, give each variant a short name and write it into the test table. Labels like S350-O0, S350-O40, S800-O0, S800-O80 work well, where S is chunk size and O is overlap. Keep three more fields next to them: which documents were indexed, what cleaning was applied, and which embedding model was used.

Test reranking separately

Mask PII in tests
Add PII masking if you are running questions over customer or HR data.

A common mistake is simple: the team changes chunk size, overlap, and reranker in one run. The metric goes up or down, but nobody understands what actually worked. That is how debates over tuning last longer than the test itself.

First run all chunking variants without reranking. Keep the embeddings, top-k, question set, and evaluation method the same. That way you see the pure effect of chunks: which option more often pulls the right paragraph into the candidate list.

After that, repeat the same test with the same reranker for all variants. Do not change the reranker model between runs and do not tweak its settings halfway through. If you are comparing chunking, the reranker should be a constant background factor, not a second variable.

Look not only at the final answer, but also at the path to it:

  • did the right chunk make it into top-k before reranking
  • did the reranker move it higher
  • did the rank of the right paragraph improve
  • did false but similar-looking passages disappear

This quickly brings everyone back to reality. A reranker will not save a document that the base search never found. If the right paragraph did not even make it into top-20, the problem is almost certainly chunks, overlap, the query, or embeddings—not ranking.

It helps to inspect specific misses. For example, a rule may be in one paragraph and the exception in the next. With chunks that are too small, search often finds the rule but loses the exception. A reranker cannot help if the second piece is not among the candidates. With moderate overlap, both pieces can appear in the results, and then the reranker can lift the right one.

If the reranker only helps at one chunk size, that is also a signal. It is not that the reranker is somehow “smarter” on its own, but that one chunking option gives it a useful candidate set while the other gives it noise.

How to run the test step by step

The idea is simple: change only chunking and keep everything else the same. If embeddings, the prompt, the answer model, and the selection threshold all change at once, it becomes hard to trust the conclusions.

  1. Build a separate index for each chunking variant. For example, one for 400-token chunks with no overlap, another for 800-token chunks with 100 overlap, and a third for 1200-token chunks with 200 overlap. Do not mix them in one store.

  2. Fix the run conditions. The question set, top-k, embedding model, reranker or no reranker should be the same for all variants.

  3. Run the full question set without manual edits. Do not rephrase on the fly, do not remove “bad” questions, and do not replace a query with a clarifying one. If a question is ambiguous, that is part of reality, not noise.

  4. For each query, save the same set of fields: which documents appeared in top-k, what rank the correct chunk got, whether you found it, how long the answer took, and how much the request cost if cost is tracked separately.

  5. Repeat the same run for the other variants and put the results into one table. Each row is a configuration: chunk size, overlap, reranking on or off. Keep hit@k, the share of queries where the correct chunk was found, time, and cost in the columns.

After that table, the argument usually narrows fast. Instead of “I think 1,000 tokens is better,” you get a real picture. For example, 800/100 may deliver almost the same retrieval as 1200/200, but answer faster and cost less. That is enough to choose one or two leaders and send them into the next test on the full RAG answer.

Example on an internal knowledge base

Run RAG without rewriting
Switch providers through AI Router and keep your current stack as is.

It is better to test on a real internal knowledge base rather than on training text. Take three document types: employee instructions, an FAQ with common questions, and policies with exceptions. On this mix, chunking usually fails more often because the rule is in one place, the deadline in another, and the exception is hidden in a note.

A question like this works well: “Can I get a subscription payment refunded after 14 days if the service was already activated, but there was a confirmed outage on our side?” It contains a deadline, a condition, and an exception. If RAG answers confidently but does not find the exception, the test is already useful: the problem is not the model, but finding the right text chunk.

On the same document set, build at least two chunking variants. For example, short chunks of 250–350 tokens with small overlap and long chunks of 700–1000 tokens with the same embeddings. Top-k should stay the same, as should the question set.

Then look beyond the final answer. Check which chunks made it into retrieval. Short chunks often find the exception paragraph better, but sometimes lose neighboring context and the model confuses the deadline. Long chunks more often carry the full rule, but they bring extra text along, and the model latches onto the general “Refunds” section instead of the exact exception.

After that, turn on reranking and repeat the same case. A good reranker lifts the specific paragraph, not the whole “Refund Policy” section, to the top, especially the line that says the 14-day limit does not apply or is treated differently when the outage was confirmed. If the top result after reranking is still the general section and the exact paragraph is lower down, the problem remains.

This kind of example quickly ends the debate in the team. Everyone can immediately see which option really finds the rule and which only looks convincing in a demo.

Mistakes that break the conclusions

Conclusions often break not because of the chunks themselves, but because the experiment is weak. The team changes several things at once, gets a different result, and then argues about what actually worked.

The most common mistake is changing embeddings, the prompt, top-k, or even the answer model together with chunk size. After that, you can no longer say whether the new chunking helped or whether the gain came from a different retriever. One test — one variable.

The second mistake is using too convenient a question set. If the queries are short, direct, and almost copy the wording in the documents, almost any chunking scheme will look “fine.” In real systems, people ask in long, messy ways, with team-specific terms and missing context. Those are the queries that should be part of the check.

The third mistake is looking only at the average. The average hides failures. Suppose the larger-chunk scheme gave a slightly better overall score, but it regularly misses tables, lists of conditions, or date-related passages. For a bank knowledge base, a telecom team, or internal support, that is not a minor issue—it is a risk.

The fourth mistake is judging only the model’s final answer. That is a weak check. The model may guess from general knowledge, fill gaps with assumptions, or assemble a plausible answer from irrelevant pieces. So you need to look separately at which chunks were retrieved, what rank they got, and whether they contained enough facts to answer.

A quick filter is simple:

  • keep everything fixed except one parameter
  • include both easy and awkward questions in the set
  • look at failures, not just the average
  • check retrieved chunks separately from answer quality
  • do not draw conclusions from 5–10 queries

That last point is often underestimated. On ten queries, one lucky answer changes the overall picture too much. You need a set where recurring problems are visible: loss of context in long instructions, separation of definitions and exceptions, noise from overlap, and the value or uselessness of reranking.

Even 50–100 questions usually give a much fairer result than a small demo set. That test takes more time, but then you do not have to change chunk size every week based on someone’s impression.

A quick check before choosing

Test with local requirements
If data storage inside the country matters to you, start your checks with AI Router.

Choosing chunking without a short check is almost always a bad idea. On paper, the scheme may look neat, but on hard questions it may start losing the answer or pulling noise into the results.

You can tell a good option not by the average score on easy queries, but by questions where the answer is buried deep in the text, split across two neighboring paragraphs, or phrased with different words than the question. If the scheme holds up there, it has a chance to survive in production.

Before deciding, it is worth checking four points:

  • the winner does not fall apart on hard questions, even if the passage is long or the terminology is thin
  • overlap brings value instead of just inflating the index and search time
  • the reranker earns its latency because the quality improvement is noticeable
  • the test can be repeated in a month on new documents and produce a similar result

Overlap is where people most often make the same mistake: they set it too high “just in case,” then the index gets heavier and the answer quality barely changes. If 10–15% overlap gives the same result as 30%, the extra volume is not needed. It only makes sense where chunk boundaries often cut one thought in half.

The logic is the same for reranking. If the relevant chunk is already steadily near the top without it, the reranker may only add latency and cost. But if it moves the right chunk from eighth place to second or third, that is a real gain. Especially on questions with similar phrasing, where regular search gets confused.

And one more filter: the team should understand why the option won. Not just “because the metric is higher,” but something like “because the chunk keeps the table and explanation together” or “because the reranker separates similar policies better.” Then a month later you can repeat the test on new data without restarting the whole argument.

What to do after the test

After the test, do not leave the result in chat or in the team’s memory. Record the scheme that performed best in your working documentation: chunk size, overlap, splitting rule, top-k, reranking, question set, and the metric used for comparison. In a month, nobody will remember why you chose that mode unless it is written down.

Save the tricky examples too. If one option answered policy questions better while another confused similar documents, note that next to the numbers. That makes it easier for the team to understand where the chunking works well and where its weak spots are.

Add the question set to your regular run. Then every change can be checked on the same benchmark instead of on feelings. This is especially useful when the team changes embeddings, rebuilds the index, or tries a different reranker.

A few rules are enough after that:

  • run the test after major knowledge base changes
  • repeat it after changing the embedding model or reranker
  • look not only at the average score, but also at failures on important questions
  • keep results from past runs so you can spot regressions

Recheck chunking after major content updates. Knowledge bases rarely stay still. Today you have short FAQs; two months later you have long policies, tables, letter templates, and legal documents. On that kind of shift, a scheme that used to be the best can easily start losing accuracy.

If you are comparing several models and rerankers, it helps to keep the experiment in the same wrapper. For teams in Kazakhstan, this often makes work easier: for example, AI Router on airouter.kz gives you one OpenAI-compatible endpoint, so you can switch providers and models without rewriting your SDK, code, or prompts. When the data is sensitive, another part of the job matters too: storage inside the country, PII masking, and audit logs should be checked just as carefully as hit@k or latency.

If the test produces close results, do not argue over tenths of a point. Pick the option that is easier to maintain, and set the next review date right after the next major knowledge base update.

Frequently asked questions

How many questions do you need to test chunking?

For an initial comparison, 30–100 questions is usually enough. Fewer is risky because one lucky or unlucky query can swing the result too much. If you can label more questions honestly, use more.

What should you check first: retrieval or answer quality?

Look at search first. If the right chunk does not make it into top-k, the problem is usually chunking, indexing, or reranking. It only makes sense to review the prompt and the answer model after that.

Which chunk sizes should you compare first?

A good first round is 2–4 variants. Often it is enough to compare short chunks of about 300–400 tokens and long chunks of about 700–900 tokens, then test overlap separately. That makes it easier to see what caused the difference.

Do you need overlap between chunks?

Overlap helps when meaning breaks at the boundary between chunks. Start with 10–15% and check whether hit@k improves. If there is almost no difference, do not bloat the index just in case.

Why freeze one question set for the whole experiment?

Because without one shared set, the team is really arguing about different examples. The same test makes the comparison fair: all variants answer the same questions from the same corpus. Otherwise the numbers are not comparable.

When should you add reranking in the test?

First run the variants without a reranker. Then repeat the same test with the same reranker for all schemes. That way you can see whether the chunking itself helps or whether the reranker is simply covering up weak retrieval.

Why not change embeddings and chunking at the same time?

If you change both at once, the conclusion gets blurry. A metric gain can come from a new retriever, a different embedding model, or even a different top-k, not from chunk size. One test should answer one question.

How do you know a chunking variant passed the test?

Set the threshold before the run. For example, a scheme passes if it keeps the required hit@5, stays within your p95 SLA, does not exceed the cost limit, and does not reduce answer accuracy. Then the decision is based on rules, not impressions.

What if two variants show almost the same result?

Choose the scheme that holds up on hard questions and does not make the system more complex without clear benefit. If the metric difference is small, it is better to pick the easier option to maintain and record the reason next to the test results.

How often should you revisit the chosen chunking?

Repeat the check after major knowledge base updates, after changing the embedding model, or after switching the reranker. A live corpus changes, and an old scheme can start missing exceptions, tables, or long rules even if it used to work fine.