Source-based fact checking: how to build a test suite
Source-based fact checking helps you build tests where the answer is compared against a document, table, or database. We’ll cover the test suite structure, common mistakes, and a checklist.

Why checking by eye does not work
A careful answer is easy to mistake for a correct one. The model sounds confident, keeps a good tone, repeats words from the document - and still gets one date, amount, rate, or deadline wrong. That is enough to break the meaning.
This is especially noticeable in tasks where the answer must rely on the source, not on the model’s "common sense." If the policy says the deadline is 14 days and the answer says 10, good style changes nothing. For a bank, a telecom team, or a support service, that is already a risk.
A model often reads the source differently from a person. It recognizes a familiar pattern and fills in the wording from memory. That is why manual review quickly starts to drift. One reviewer looks at the overall meaning and marks it as "correct." Another notices the wrong limit, a missing condition, or a changed order of steps and marks it as "incorrect."
The worst mistakes are the quiet ones. The answer looks calm and convincing, almost matches the source, but replaces one detail. Teams miss these errors most often.
Manual review usually breaks in four places:
- the style of the answer affects the score more than the fact
- reviewers interpret "almost right" differently
- small mismatches are not noticed right away
- the same error is accepted today and rejected tomorrow
If the team does not have a strict reference, quality seems higher than it really is. In RAG evaluation, this is especially visible: retrieval found the right excerpt, the answer sounds reasonable, so everything seems fine. In reality, the model may have relied on a guess, not the document.
Facts need to be checked by the rule of matching the source. Otherwise, you are measuring the impression the answer makes, not its accuracy.
What counts as a correct answer
A correct answer is almost never just "sounds plausible." For source-based checking, only what can be tied to a specific place in the document or a field in the database is suitable. If there is no such link, you are evaluating impression again.
First, break the question into short statements. One question often contains two or three independent facts, and the model may guess one while getting another wrong. Check each fact separately.
A simple example with AI Router: the question "How many open-weight models does the platform host on its own GPU infrastructure, and in what currency is B2B invoicing done?" contains two separate checks. The first is the number of models. The second is the invoicing currency. An answer like "a lot of models, invoices are issued in the local currency" sounds confident, but it fails the test. The source says "20+ open-weight models" and "monthly in tenge."
For each fact, define the source of truth in advance. It can be a database field, a table row, a paragraph in a policy, or a specific FAQ excerpt. The more exact the link, the fewer disputes during review.
Also define the answer format separately. Without that, the same fact will be compared in different ways. Usually, a few simple types are enough:
- number: 500+, 68, 20+
- date: 2025-04-01 or 01.04.2025
- list: provider names, plans, documents
- exact phrase: wording where words cannot be changed
An exact string is not always necessary. If the question is about a product code, contract number, rate, legal wording, or a database field, it is better to require a full match. If the question is about an ordinary fact from text, normalization is often enough: remove extra spaces, convert the date to one format, ignore case, and treat "20 +" and "20+" as the same value.
Lists are a separate source of disagreement. Decide in advance whether order matters, whether a partial answer is acceptable, and what to do with extra items. If the reference has three providers and the model names two correctly and adds one extra, that is not a full match.
If you cannot write the reference as "fact -> source -> format -> comparison rule," the test is still too rough.
Where to get the reference answer
The reference answer should not come from the team’s memory or from a "roughly current" version of the source. It must be tied to a specific state of the data on the test date. If the question refers to a document, save the exact file version. If it refers to a table or directory, take a snapshot of the database on the same date.
Otherwise, the test quickly turns into an argument. The model answered according to an older version of the policy, while the reviewer is looking at the new PDF. Or the answer about a limit was correct for yesterday’s table, but the number changed in the morning and the test suddenly became falsely bad.
Some data does not live long enough. It is better to remove it from the main suite or move it into separate tests with a short lifespan. Usually this includes:
- stock levels
- current balance
- request status
- today’s price
- last login date
If you keep such fields in the main suite, you will be fixing the test itself instead of the model. For regular evaluation, it is better to use things that change rarely: contract terms, tariff rules on the publication date, statuses from a fixed archive, or versioned reference data.
Alongside each question, store not only the correct answer but also a pointer to the source. For a database record, this is the row id, table name, and snapshot date. For a document, it is the file name, version, page, paragraph, table, or line range if you extracted the text automatically.
This kind of link saves a lot of time. When a test fails, the team immediately sees where to look for the correct fact. No need to read the whole document again or guess which table the number came from.
Mixing rules from the document and live data from the database is only safe if both parts have an explicit timestamp. Otherwise, the answer may look reasonable on paper, but you cannot verify it. For example, the commission calculation rule was taken from a March policy, while the rate came from a June table. That reference is already questionable.
In production systems, this is especially visible in RAG. If a company checks answers against internal instructions and customer data through a single LLM gateway like AI Router, disagreement usually comes not from the model, but from different source versions. First the source is fixed, then the answer is evaluated.
How to build an automated test suite
An automated test suite does not start with invented questions. It is built from what people already ask in real work: chat logs, support tickets, internal policies, the knowledge base, and common employee requests. That way, the suite checks not abstract "smartness," but the real places where the system fails.
Not every question is suitable for source-based checking. First, select the ones where the document or the database has one clear answer. If the question sounds like "what is better" or "why did the company choose this process," automatic comparison will fall apart quickly. But "what is the request processing time," "which plan applies to segment B," or "what is the status of order 18452" are good fits.
It is convenient to store each test in one template:
- question
- reference answer
- source where the answer was found
- comparison rule
It is better to write the comparison rule right away. For one question, exact matching of a number or date is enough. For another, you need to check only mandatory fields: amount, currency, limit, status. If a bank document says "up to 5 business days," the answer "5 days" cannot be counted as fully correct.
Do not include only simple cases. The most useful tests are usually the unpleasant ones: similar product names, small-print footnotes, exceptions for specific branches, and conditions like "except," "if," and "when available." These are the cases that show whether the system actually reads the source or just guesses from the general meaning.
A small example: a policy contains two similar tariffs, and the difference is hidden in a note below the table. If the model answers from memory, it will almost certainly mix them up. One test like that is often more valuable than ten ordinary ones.
When the test base grows, split it by speed and risk:
- a fast suite for every prompt or retriever change
- a regression suite for daily runs
- a pre-release suite for the hardest and most expensive scenarios
The fast suite can include 20-30 questions and take a couple of minutes. The regression suite is broader and catches old mistakes that tend to come back. The pre-release suite is needed before deployment, when the cost of failure is higher: in banking, telecom, medicine, or any service where the answer must match the document, not just sound plausible.
How to compare an answer with a document or database
The comparison should check not "does it sound plausible," but "does it match the source." To do that, first bring the answer and the reference into the same format, and then compare them using rules that are easy to reproduce in code.
Start with normalization. Remove extra spaces, convert text to one case, and bring dates into a single format. Convert currencies to one notation. Otherwise, the system will start arguing over details: "15.01.2025" and "2025-01-15" mean the same thing, but a string comparison will treat them as different.
With numbers, it is better to be stricter. Compare not only the number itself, but also the unit of measure. "12" and "12 months" are not the same. "500" without clarification is also risky: it could mean 500 requests, 500 tenge, or 500 models. If the bot answers about a tariff, limit, or deadline, the unit should be right next to the number.
A source reference is mandatory
An answer without a link to a specific place in the source should not be considered verified. For a document, this can be a line number, paragraph, or section. For a database, it can be a record ID, primary key, or a combination of fields used to find the record.
If an internal assistant says that invoicing in AI Router is monthly in tenge, the test should check not only the phrase itself, but also the link to the right excerpt from the document or the record ID in the tariff directory. That way, you can see where the fact came from.
Partial matches and extra facts
It is useful to store at least four statuses:
- exact match
- partial match
- miss
- extra facts outside the source
It is important to separate partial matches from misses. An answer like "data storage within the country" may be correct in meaning but incomplete if the document also mentions audit logs, PII masking, and rate limits. That is not zero points, but it is also not a full pass.
Also catch added facts. The model often writes confidently and adds details that are not in the document or database record. If the source says only "monthly in tenge," and the answer adds "with payment due by the 5th," that is an error even if the first part is right.
A good comparison usually checks three things at once: whether the fact is correct, whether there is a source reference, and whether the answer invented anything extra.
One scenario as an example
Imagine a bank support chat. A customer asks: "How much free transfer limit do I have left this month, and what rule is it calculated by?" The answer should be short: one number and one rule, without retelling the whole tariff.
The data lives in two places. The PDF tariff says that under the "Smart" plan, fee-free transfers are available up to 300,000 tenge per month, after which the fee is 1%. The database stores the customer’s current state: they have already used 120,000 tenge this month.
What one test looks like
A test card usually needs four fields:
- question: "How much free transfer limit do I have left, and what rule applies?"
- document excerpt: "Fee-free transfers are available up to 300,000 tenge per month. After the limit is exceeded, the fee is 1%"
- database snapshot: monthly_limit = 300000, used_amount = 120000
- expected answer: "180,000 tenge. You can transfer fee-free until the total monthly amount exceeds 300,000 tenge"
The number 180,000 is not taken from the document. It is calculated from the customer’s data. The rule, on the other hand, must not be filled in by model logic. It must be taken from the PDF and checked against the wording or a normalized template.
This test fails in two typical cases. First: the model confuses the current remaining amount with the full limit and answers "300,000 tenge." Second: it adds an extra condition that is not in the source, such as "only for internal bank transfers" or "only after verification."
A good comparison does not argue with the overall impression of the answer. It asks two simple questions: does the number match the calculation from the database, and does the text contain only the rule that exists in the document? If even one point does not line up, the case is marked as failed, even if the answer sounds convincing.
This approach quickly shows where the system breaks: in extracting the fact from the PDF, in the database query, or in the final answer assembly. For RAG evaluation, that is much more honest than reading answers manually and marking them as "probably right."
Where teams most often make mistakes
Many teams break the tests already at the question selection stage. They choose a question where the source answers with a caveat, in two places at once, or does not give one exact conclusion at all. Then the model picks one acceptable option, and the test counts it as a mistake. For source-based checking, such cases are better removed right away.
The reference answer itself is often mixed up too. If you write the correct answer as a long paragraph, the team quickly turns the review into a word dispute. The model may correctly name the amount, date, or status, but the test will fail because of different wording. It is much better to store the reference as a short value, a few allowed variants, or a set of fields that can be compared without guesswork.
Databases have their own trap. The rule has already changed, the limit has been updated, the status has been renamed, but the tests are still looking at the old database snapshot. After that, any report loses meaning: you no longer know whether the answer is wrong or the reference is stale. If the comparison is based on data, the snapshot must be versioned and updated by a clear rule.
Another common mistake is lumping retrieval failure and generation failure into one bucket. These are different problems. If the system did not fetch the right document, you need to fix search, filters, or the index. If the document was found but the model mixed up a number or invented an extra detail, the issue is in generation, the prompt, or the model itself.
Normalization is another separate topic. Without it, tests start complaining about details that do not change the meaning. Usually, it is worth bringing these to one form in advance:
- company and product names
- date formats
- currencies and numbers
- codes, statuses, and abbreviations
A simple example: in the database the tariff is stored as PRO_UNL, in the document it is called Pro Unlimited, and in the answer the model writes "Pro Unlimited Plan." If you do not map these variants to one canonical form, the test will mark an error where there is none.
These mistakes are common in teams that already compare several models and measure quality in batches. If you remove ambiguous questions, describe the reference briefly, keep track of the database snapshot, and separate error types, the test report becomes much more honest. Then you can see exactly what broke: search, generation, or the comparison rules themselves.
A quick checklist before launch
Before the first run, it helps to stop and check five things. If even one of them is vague, the tests will quickly start arguing with people instead of catching model errors.
- Choose one source of truth for each question: a specific document, a table version, or an SQL query result. If the answer can be justified by two different places, the test will be disputed.
- Write the reference answer in one short form. The reference should not be a "good customer answer." It should test a fact: a date, amount, status, field name, or exact quote.
- Save the comparison rule next to the test. Some cases require an exact string, some need a matching number in the right currency, and some can accept one of two wording options.
- Include not only simple examples. You also need edge cases: an empty field, an old document version, similar names, or two almost identical records in the database.
- Fix the data version. The team should be able to rerun the suite a week later and get the same result on the same set of documents and the same export.
The second point usually breaks most often. People write the reference too broadly, and the model passes the test by using smooth wording. If the question is "what is the payment term under the contract?" the reference is better written as "10 banking days" and it should be noted separately that no extra details are needed. Then document-based checking evaluates the fact, not the style.
It is also useful to add a short comment to each test. One line is enough: "check only the limit value from table tariffs_2025_02" or "accept the answer if the model returns the exact status name from section 4.2." That saves hours when the team is arguing over who is right: the model, the checker, or the test itself.
If you run RAG evaluation on internal policies or database data, this checklist is often more important than a complex metric. It immediately shows where the model confuses the source, where retrieval brings back the wrong excerpt, and where the error sits in the tests themselves.
What to do next in production
One successful run means nothing. In a live system, answers drift after small changes: you tweak the prompt, update the retriever, add a metadata filter, switch the model - and yesterday’s correct answer no longer passes source comparison.
That is why tests need to be built into the normal release cycle. Any change that can affect the answer should trigger the same set of checks. Otherwise, the team only learns about the break after a user complaint.
An overall score is useful, but it is too coarse. If the system scored 82%, that is not enough to make a decision. You need to see exactly where it failed: picked the wrong database row, missed a number, mixed up units, invented a value, or merged two sources into one answer.
That kind of breakdown quickly shows what to fix first. Sometimes it is the model. But often the reason is more mundane: bad chunking, an extra SQL join, incorrect post-processing, or a comparison rule that is too loose.
It is useful to keep three separate test sets:
- for answers based only on documents
- for answers based only on SQL and other database data
- for mixed scenarios where the model reads both a document and a table
Mixing them into one suite is usually harmful. The average score hides failures. The system may work well on documents and still consistently mix up filters and dates in SQL.
If the team compares many models and providers, it is better not to make the setup complicated. It is simpler to run the same suite through one OpenAI-compatible gateway. In the case of AI Router, it is enough to change the base_url to api.airouter.kz and then keep running the same SDKs, code, and prompts across different models without separate integration work for each provider. That saves time and makes comparison cleaner.
Another practical step is to keep run history. Then you can see not only the current result, but also the exact moment when the system started making mistakes on a specific fact. Such a log is more useful than a pretty dashboard: it helps you quickly link a regression to a commit, a new model, or a data schema change.
If the test suite lives next to the code, runs after every change, and shows the type of error, the team stops debating based on intuition. You can immediately see which answer matched the document or database and which did not.
Frequently asked questions
Why does manual visual checking often get things wrong?
Because confident, polished text can easily hide an error in a date, amount, limit, or deadline. Reviewers also judge differently: one may accept the overall meaning, while another spots a changed detail and rejects the answer.
What counts as a correct answer in a source-based test?
Count as correct only the fact that can be tied to a specific place in the document or to a field in the database. If you cannot show the source and the comparison rule, you are checking impression, not accuracy.
How should you check a question that contains several facts?
Split the question into separate claims and check each one with its own rule. If the question includes a number and a currency, the model may get one right and miss the other, so a single overall pass is not useful here.
Where should the reference answer come from if the data changes?
Take the reference answer from a fixed version of the data on the test date: a specific file, table snapshot, or export. Otherwise, it may pass today and fail tomorrow simply because the source has already changed.
Which questions are better left out of the automated suite?
Do not automate questions that do not have one clear answer. Opinions, explanations like "why it was chosen," and live fields such as current balance or today’s price are better moved into separate short-lived tests or removed from the main suite.
What should be stored in one test case?
Keep the question itself, a short reference answer, an exact source pointer, and the comparison rule in one test case. That is enough for the team to quickly see where the correct fact lives and why the test passed or failed.
Should the answer be normalized before comparison?
Yes. Without normalization, tests will start arguing over spaces, casing, and date formats. First bring the answer and the reference into the same shape, and always compare numbers together with their unit and currency.
What should you do with a partly correct answer and extra details?
Separate exact match, partial match, and miss right away. If the answer names the correct fact but adds a detail that is not in the document, that is already an error, even if the first part sounds right.
How can you tell whether search failed instead of the model itself?
Look at the path to the answer. If the system did not retrieve the right excerpt or record, fix search, filters, or the index; if the source was found but the number or rule is wrong, the problem is in generation, the prompt, or the model.
When should these tests run in production?
Run the same suite after every change that can affect the answer: the prompt, retriever, model, post-processing, or SQL. It is also useful to keep run history so you can immediately see when and after which change the system started making mistakes.