Jul 06, 2024·8 min read

Judge Model for Auto-Evaluation: Where to Trust and Where to Check

Judge models for auto-evaluation help you check answers quickly, but not everywhere. Here is how to use a rubric, manual sampling, and signs of systematic errors.

Why a judge model makes mistakes

A judge model for auto-evaluation is convenient for a simple reason: it gives one number. That number is easy to put on a dashboard, compare across releases, and discuss in a meeting. The problem is that this number shows not the answer quality for a person, but how another model understood your rubric and read the text.

That is the difference between a convenient metric and real quality. A metric likes compression: one score, one table, one conclusion. A user answer is more complex. It can be clear and polite but factually wrong. It can be plain but solve the task exactly.

If the rubric is vague, the judge model starts evaluating the overall impression. A smooth and confident answer often gets a high score even when there is a mistake in logic, calculation, or a rule reference inside. Style can easily hide a content problem. For summarization, that is already unpleasant. For contract extraction, bank scoring, or a medical summary, that is a risk.

An average score also hides failures well. Imagine 100 answers: the judge scores 90 simple cases almost perfectly, but in 10 rare cases it systematically misses a dangerous mistake. On the chart everything looks fine because the average stayed high. For a team, that is a trap: rare cases are often the most expensive ones.

The same thing shows up when comparing models. A team can run several variants through one gateway, get a neat table, and pick the leader by average score. Then it turns out that the leader writes more polite answers, but is worse at recognizing refusal, ambiguity, or missing data. The judge rewarded form, not meaning.

Blind trust in an automatic score is dangerous for another reason too: the team starts tuning the system to the judge's taste. Then quality does not rise; only the ability to pass the test does. If the score goes up but users still complain about strange answers, the problem is almost always in the evaluation setup.

When you can trust auto-evaluation

Auto-evaluation is most useful where the answer can be checked almost by template. The less room there is for taste and guesswork, the steadier the result. If the task is reduced to whether the required field is present, whether the format is correct, and whether the answer stayed within the limit, the judge model usually handles it well.

The most reliable case is checking the answer shape. Did the LLM return JSON, are the required fields present, do the types match, is an attribute missing, did the text exceed the limit? Here the model looks for concrete signs instead of arguing about meaning.

Cross-checking against a short reference also works well if the correct answer is almost unambiguous. For example, you need to extract a contract number, a date, or one status from a fixed list. When the reference is short and there are few allowed variants, the judge can usually separate the correct answer from the wrong one.

Pairwise comparison is another good scenario. It is easier for models to pick the better answer out of two than to assign a subtle score from 1 to 10. Especially if the question is direct: which answer follows the instruction more accurately, which has fewer factual errors, which does not break the format.

At the start, three safe uses are usually enough:

filter out answers that failed format checks;
choose one of two drafts for an operator;
check short classifications against fixed labels.

If the judge model makes a mistake in such tasks, the team loses time, not a customer or money. That is the right mode for the first launches.

In a production system, such a layer is often placed first. If a team sends traffic through one LLM gateway, these checks are easy to automate early, while manual review is kept for disputed tasks where the meaning of the answer matters more than the form. For example, in AI Router you can run several models through one OpenAI-compatible endpoint and keep the same evaluation flow without rewriting the SDK, code, or prompts.

When you need manual sampling

Manual sampling is needed where a good answer does not have one correct form. A judge model likes a templated style: short, even, without controversial details. Because of that, it often downranks a strong answer simply because it was written differently from what the rubric expected.

This is especially obvious in open-text tasks. Two replies to a customer complaint can be equally useful, but one sounds softer and the other more direct. A person will see the appropriate tone and meaning. The model will often choose the one that looks more like the examples in the prompt.

There are cases where it is better not to skip humans at all. These are medical, legal, and financial answers, where one inaccurate phrase changes the meaning. This is fact checking when you do not have a reliable ground truth or a database to compare against. This is also about requests where you could accidentally reveal PII, violate internal rules, or break legal requirements. And it is about rare errors with high damage, even if they happen once in 500 answers.

Domain tasks are dangerous for a simple reason: the judge model can sound confident and still miss the error. In a medical summary, it may not notice that the advice contradicts the symptoms. In a banking answer, it may approve an ambiguous explanation for a refusal. In legal text, it can sometimes mark the answer as complete even though a needed caveat is missing.

Facts are the same story. If there is no reference, the model starts judging not truth, but plausibility. For news, industry overviews, and short market briefs, that is not enough. A person should read at least part of the sample and separately check sources, dates, numbers, and names.

Another risk area is security and compliance. If the system works with requests, customer chats, or internal documents, manual review should check not only answer quality, but also whether a full name, account number, phone number, or other personal data leaked. For banks, telecom, the public sector, and healthcare, this is ordinary operational checking, not extra caution.

The rule here is simple: the rarer the mistake and the higher the price, the less you should trust auto-evaluation without human review. Let a person read not everything, but risky segments, disputed answers, and edge cases where the judge gives an overly confident score.

How to build a rubric without vague wording

A judge model starts to get confused when you ask it to score everything with one number. A phrase like "answer quality" sounds convenient, but it usually mixes accuracy, completeness, style, format, and safety. It is better to split the evaluation into separate criteria and score each one on its own scale.

For an ordinary LLM task, four criteria are often enough:

factual accuracy;
instruction following;
format compliance;
answer safety.

If your workflow carries higher risk, the criteria should be tied to real failures. For a banking or medical case, it is useful to score PII leakage separately. For API integrations, score JSON schema compliance separately. Then the rubric stops judging the overall impression and starts catching specific mistakes.

Each criterion needs a short scale with examples. Usually 0, 1, and 2 are enough. More levels often only get in the way: both people and the model start arguing about neighboring scores instead of the substance.

For the criterion "format compliance," the scale can look like this: 0 — the answer is not in the right format or breaks parsing; 1 — the format is almost correct, but a field is missing or there is extra text; 2 — the answer fully passes validation. For "factual accuracy," the logic is similar: 0 — there is hallucination or a direct contradiction with the data; 1 — a minor inaccuracy without harm to the task; 2 — the facts match the source.

Vague words are better removed right away. "Good," "fairly complete," "high-quality," "reasonable" sound clear only in theory. Two labelers will read them differently. It is more useful to write observable signs instead: "all three steps are named," "there are no new facts," "the final answer appears in the first paragraph," "no personal data is present."

The rubric should say in advance when the judge gives a zero and when a human is needed. A zero is for obvious failures: dangerous advice, PII leaks, invented facts, broken format. Manual review is for cases where even a good model often fails: complex calculations, disputed wording, legal and medical answers, subtle cases with sarcasm or refusal.

Before launch, give the rubric to two labelers on the same sample. Often 30–50 answers are enough. If people regularly disagree on one criterion, the problem is almost always in the wording of the scale, not in the people. Fix that before auto-evaluation goes live. Otherwise the judge model will simply inherit the same confusion.

How to set up the check step by step

Review disputed answers

Keep audit logs and inspect cases where the judge disagrees with human labeling.

View logs

If you are checking LLM answers on training prompts, the judge almost always looks more accurate than it does in live work. For setup, you need a set of real requests: support chats, call summaries, field extraction from documents, assistant answers in the product. In such a dataset, the noise, odd phrasing, and borderline cases become visible right away.

Collect 100–300 examples from the product. Include not only normal requests, but also short, messy, and disputed ones.
Label part of the dataset manually. For the first pass, 30–50 examples are often enough if two people review them using the same rubric.
Review the disagreements between people. If labelers disagree, the cause is usually a vague criterion.
Run the same set through the judge model and compare scores for each criterion, not only the overall score.
Fix the rubric and repeat the cycle on a new slice, not the same examples.

Do not look only at the match rate. The judge model may systematically overrate neat but empty answers. It may confuse polite tone with quality, punish a different style, and miss a factual error in a long text.

A good check looks for types of failures. For field extraction, it is useful to track which fields the judge misses most often. For support, separate a wrong fact from weak wording. For summarization, move the required details into separate criteria: date, amount, reason for contact, next step.

A small example quickly reveals the bias. Suppose the judge gives a high rating to short call summaries even when the text does not include the payment amount. On overall agreement with human labeling, you see 82%. That sounds decent. But on the criterion "required facts present," agreement drops to 54%. That means the rubric is too soft, or the judge itself does not understand that missing the amount breaks the answer.

If the team compares several models, do not merge their answers into one common table without splitting them apart. One model writes smoothly, another keeps facts better. The judge often likes smooth style and shifts the picture because of it.

The repeat cycle is better run on a new data slice: different dialogs, a different scenario, a different language, a different answer length. If the error repeats in the same type of task after the changes, that part of the evaluation still needs manual sampling.

An example of a systematic error in a real task

Imagine a support chat that answers only from the company knowledge base. It works smoothly on ordinary questions, and errors appear on requests with exceptions.

Customer request: "I have already paid for the order. Can I change the delivery address if the package has been handed to the courier?" The knowledge base says the address cannot always be changed. If the order is already with an external delivery service, a new order or separate approval from an operator is needed.

The bot replies very politely: "Yes, of course. I will help change the address. It usually takes up to 2 hours." The answer sounds good, but it is inaccurate. The bot promised an action that may not exist in the rules.

The judge model often overrates such an answer. It sees the polite tone, clear structure, and attempt to help. If the rubric only includes usefulness, clarity, and tone, the judge will easily give 4.5 out of 5. People give 2 out of 5 because the customer received a false promise.

Where people and the judge diverge

On a sample of 200 dialogs, the difference usually shows up immediately. On simple FAQs, the scores almost match. On requests with exceptions, the gap is already large.

Request type	Human average score	Judge average score
Ordinary knowledge-base question	4.4	4.5
Question with an exception	2.7	4.3

The reason is simple: people penalize a factual mistake more strongly than a dry tone. Without a separate fact criterion, the judge does the opposite.

What to change in the rubric

After such a failure, the team needs a rubric where factual accuracy is separate from politeness. Usually a few direct rules help:

the answer does not promise an action that is not in the knowledge base;
if there is an exception, the bot names the condition directly;
if data is missing, the bot asks a clarifying question;
a factual error immediately lowers the final score, even if the tone is good.

After that, manual sampling no longer feels like pointless caution. For exception cases, it is better to review 100% of dialogs manually for at least the first 2–3 weeks. If the volume is too large, check at least 30% of such cases and keep 10% of ordinary dialogs as a control group.

Otherwise the judge model's error will remain invisible: the bot will seem polite and helpful until support starts dealing with customer complaints.

Common launch traps

Launch a pilot faster

Move runs to one gateway and avoid rewriting code for every provider.

Try gateway

Auto-evaluation usually breaks not because of the model itself, but because of the evaluation setup. The most common mistake is to roll style, accuracy, and completeness into one overall score. Then a beautiful but wrong formulation gets the same rating as a dry but accurate answer. It is easier for the judge to assign a middle number than to honestly separate answer qualities along different axes.

That is why the rubric is better kept separate. Score factual accuracy on its own. Coverage of the request on its own. Format or tone too, if they truly matter for the task. Otherwise it becomes hard to tell what exactly dropped after a model or prompt change.

The second trap appears when the team checks the rubric on one type of request and then expects the same stability everywhere. On FAQs, call summaries, and legal drafts, the judge makes different mistakes. If the rubric was only tested on short support questions, it will easily drift on long answers with several conditions and caveats.

The average score also lulls people into complacency. It hides rare but expensive misses. A model can look fine by the overall number and then, once every 50 answers, miss a dangerous error: a made-up rate, a wrong deadline, a swapped diagnosis.

To keep control, a simple discipline is usually enough: look not only at the average, but also at the tail of bad scores; store examples where the judge disagrees with a human; test the rubric on several request types; record the version of the prompt, model, and parameters; and calculate the cost of evaluation per thousand or ten thousand answers.

Versioning often has a very practical problem. The team slightly changes the judge instruction, sees a 4–5 point increase, and thinks quality improved. In reality, the scale changed. If you run evaluations through one OpenAI-compatible gateway, it is useful to record not only the prompt text, but also the exact model name, provider, and the date of the configuration change.

There is another sober point too: an expensive judge model does not always pay off. If auto-evaluation eats a noticeable part of the budget, people start running it less often, and blind spots grow. It is often more sensible to keep a strong judge for disputed cases and send the main volume through a cheaper model with manual sampling on failures.

What to check before launch

Do not change the SDK

Change the gateway address and keep testing with the same integration and the same prompts.

Start pilot

Before launch, do not look at the average case, but at the answers where the mistake is most expensive. One hour of such review often saves a week of arguments about why the metrics suddenly stopped matching the real quality.

First, build a manual sample. It should include not only neat and short answers, but also difficult cases: disputed requests, incomplete data, long context, several constraints at once. If the team is introducing an LLM in a bank, telecom, the public sector, or healthcare, risky cases are better moved into a separate set. That is exactly where the judge model most often makes a mistake with a confident face.

Before launch, it is useful to check a few things:

will a new labeler understand the rubric without verbal explanations;
can the model tell a factual error from style;
can you see in the table which criterion and task type the failure starts on;
do people disagree because of a weak scale definition;
is it already clear where auto-evaluation only sorts the flow and where it affects a decision.

A small test quickly clears things up. If the model consistently forgives factual mistakes but strictly punishes tone or format, you will see it in the first dozens of examples. It is better to catch that bias before launch, not after a quarter of reports.

If the cost of an error is higher than the speed gain, stop auto-evaluation on that part. Go back to the rubric, add examples, rebuild the manual sample, and only then turn the automatic judge back on. A fast launch by itself does not help if the team then manually sorts through incorrect evaluations.

What to do next

A judge model is useful as long as its responsibility is narrow. Keep automatic checks for simple tasks: format, required fields, clear policy violations, and matching a reference under straightforward rules. Anything involving semantic nuance, disputed completeness, or unclear usefulness for the user is better checked with manual sampling.

Manual review does not have to be large. Often 30–50 answers per disputed scenario are enough to reveal a recurring error. If the judge model keeps giving a high score to a polite but empty answer, the problem is no longer one prompt, but the rubric itself or the task class.

Before choosing one judge model, compare several candidates on the same dataset. Use one rubric without rewriting the criteria for each model, give the same prompt and the same input field order, run one control sample with human labels, and count disagreements with humans, not only the average score. Also watch the error type: one model misses facts more often, another clings too much to style.

After that, fix the evaluation version. Keep the rubric itself, the prompt text, the control sample, and the reason for every change. Otherwise, in two weeks the team will see a metric shift but will not know what caused it: a new criterion wording, a different judge, or a fresh request set.

If you test the judge across different providers, it is convenient to run everything through one compatible interface. In AI Router, you can send these runs through one OpenAI-compatible endpoint and avoid changing the SDK, code, or prompts when switching models. That is useful when you compare several candidates on the same rubric and want the differences to come from the model, not the infrastructure.

Good auto-evaluation does not replace common sense. It simply shows faster where to look with human eyes first.

Frequently asked questions

Can you really trust a judge model at all?

Yes, but only for narrow tasks. If an answer can be checked by clear signs like JSON, required fields, text length, or a short label from a fixed set, a judge model usually gives a useful signal.

When an answer requires understanding meaning, checking facts, or spotting risk for the customer, rely on manual sampling rather than a single score.

Which tasks is auto-evaluation usually good at?

It works best with the form of the answer. Whether the right format was returned, whether all fields are present, whether the types match, and whether parsing is broken — these checks are much more reliable than judging how good the answer feels.

Pairwise comparison also works well, when the judge chooses the better of two options based on accuracy or instruction following.

When is manual review still necessary?

Do not remove humans from medical, legal, and financial answers. There, one inaccurate phrase changes the meaning and can be costly.

The same applies to facts without a reliable ground truth, PII leaks, rare rule exceptions, and any case where the mistake is uncommon but expensive.

Why does the average score often mislead?

An average smooths out rare failures. A judge can score 90 simple answers well and still regularly miss 10 dangerous mistakes, while the final number still looks decent.

Look not only at the average score, but also at the tail of bad cases, disputed segments, and the kinds of requests where the cost of error is higher than usual.

What rubric should you use for the first launch?

For a start, four separate criteria are enough: factual accuracy, instruction following, format, and safety. Do not merge them into one score, or style will start hiding meaning errors.

Keep the scale short, for example 0, 1, 2. For each score, give observable signs rather than words like "good" or "fairly complete".

How do you know the judge values style more than substance?

Compare the judge's scores with human labels for each criterion. If the model forgives false promises, missing required facts, or ambiguous answers, but likes polite tone, then the balance is too far toward form.

This kind of failure is especially visible in exception cases, where the answer sounds smooth but breaks knowledge-base rules.

How many examples do you need to set up evaluation?

Usually 100–300 real examples are enough for a dataset, and 30–50 answers are enough for the first manual pass. Have two people label the same part of the sample using the same rubric.

Do not collect only clean requests. Add short, messy, disputed, and borderline cases, or the judge will perform worse in live traffic than it did in testing.

What should you do if labelers often disagree?

First fix the rubric, do not argue about people's taste. If labelers regularly disagree on one criterion, the wording is too vague.

Rewrite the scale in observable terms. For example, not "the answer is complete," but "the date, amount, and next step are all named."

Should you rate an answer with one overall score?

No, one number is convenient for a dashboard, but it hides the cause of failure. You will not know what dropped after a model or prompt change: facts, format, coverage, or safety.

Keep the overall score only as a supporting signal. Make decisions based on separate criteria and on risky segments.

How do you reduce the risk of costly mistakes after launch?

Keep auto-evaluation for simple checks and send disputed cases to humans. For rare exceptions, expensive mistakes, and high-risk answers, define a separate review path before launch.

A standing control sample also helps. If the judge starts giving high scores to empty or incorrect answers, you will notice before complaints arrive.