Jun 04, 2025·8 min read

Criteria for Evaluating a Support Assistant in Manual Review

Q: How many conversations do you need for the first manual review?

Usually **50–100 real conversations** is enough. Include not only common questions but also rare, tricky, and unpleasant cases, otherwise the weak spots will show up only after launch.

Q: Which scale is best to start with?

For the first version, use a **0–2 scale**. It’s easier for the team: you can сразу see which answer is acceptable, which needs edits, and which should never go live.

Q: When should an answer get 0 points right away?

Give a **0** when the answer pushes the customer toward the wrong action or creates risk. That includes made-up facts, dangerous advice, extra collection of personal data, and ignoring a direct question.

Q: What makes an answer truly helpful?

A helpful answer gives the **next step**, not just general words. After reading it, the customer should understand what to do now, where to check the result, and when to contact support again if nothing works.

Learn how to set criteria for manually reviewing a support assistant: accuracy, tone, usefulness, and safety without unnecessary complexity.

Why it’s hard to judge quality without a scale

People read the same assistant reply in different ways. One reviewer gives a high score for a polite tone, another lowers it because the answer is vague, and a third doesn’t notice the factual mistake at all. In the end, the team is debating not the quality of the reply, but personal habits and expectations.

This is especially clear in support. Imagine an answer like: “Your order is already on the way, it will arrive soon.” For one employee, it sounds calm and friendly. For another, it’s a bad answer if the assistant didn’t have access to the order status and simply guessed. Without a shared scale, both scores look “correct,” even though the team will draw different conclusions.

When there are no common rules, the argument quickly turns into a matter of taste. Some people like short answers, others think they sound cold. Some want a formal style, others prefer simpler language. In that situation, you can’t build stable criteria: today the answer is accepted, tomorrow the same one gets sent back for revision.

That also makes it harder to improve the assistant itself. If it’s unclear exactly why an answer got a low score, the team doesn’t know what to change — the prompt, the knowledge base, the escalation flow, or the response limits. Instead of a targeted fix, you get a chain of random changes, and quality starts to swing.

Without a scale, comparing versions is almost impossible. A new prompt may seem better simply because a different person reviewed it. A new model may sound nicer but make more factual mistakes. If you don’t have shared rules for accuracy, tone, usefulness, and safety, these differences are easy to miss.

A scale doesn’t remove human judgment completely, but it makes it narrower and clearer. The reviewer stops saying “I just don’t like the answer” and starts naming the exact reason: the fact is wrong, the tone is harsh, the instruction is incomplete, or there’s a risky recommendation. From that point on, evaluation becomes a working tool instead of a chat argument.

What to include in manual review

Criteria only work on a solid sample. If you take ten successful conversations, the review will produce a nice-looking but useless result. Usually 50–100 real conversations is enough, covering different topics: payment, delivery, returns, account access, complaints, and order errors.

The set should include different levels of difficulty. Add simple questions where the answer is obvious, borderline cases where it’s easy to promise too much, rare requests, conversations with incomplete or irritated customer messages, and cases where the assistant should hand the conversation to a person. This kind of mix shows how the bot behaves not only in calm situations, but also in unpleasant ones. If you keep only common questions, weak spots will show up after launch.

Before review, remove personal data from the examples. Names, phone numbers, addresses, order numbers, and other identifiers are better masked in advance. That lowers the risk for the company and keeps the reviewer from getting distracted by unnecessary details.

For each conversation, keep three things: the customer’s question, the assistant’s answer, and the correct outcome. The correct outcome is not some abstract “ideal answer,” but the decision the support team considers right: give instructions, ask for clarification, refuse based on policy, or hand the case to a person.

Also mark high-risk conversations separately. These usually involve money, personal data, medical or legal topics, account blocks, and suspected fraud. There may be only a few of these cases, but they are often the ones that show whether the assistant is ready for real support.

How to build the scale step by step

A scale usually breaks not because of wording, but because people understand the points differently. So start with a simple structure. For the first version, 0–2 is usually enough: 2 means the answer is good, 1 means it partly works, and 0 means it can’t be accepted. A 1–5 scale can also work, but only if the team already has review experience and a shared understanding of what makes a “3” different from a “4.”

Next, describe each point with one visible sign that the reviewer can spot right away. Not a long list of conditions, just one clear sentence. For accuracy, it might look like this: 2 — facts are correct, 1 — the main idea is right, but there’s a gap or small inaccuracy, 0 — there is a made-up detail or an error. Use the same approach for tone and usefulness. If the description doesn’t fit in one line, people will start interpreting it differently.

Right away, separate errors that should cause a hard fail no matter how good the answer feels overall. Otherwise, a dangerous reply can still get a middle score just because it sounds polite. These cases usually include exposing personal data, dangerous advice, promises the company doesn’t make, and attempts to bypass rules or hide the problem.

Then collect 5–10 reference examples with short explanations. You need not only good answers, but also borderline ones: an almost correct answer, a reply that’s too sharp, a helpful answer with a safety risk. With examples like that, the scale stops feeling abstract very quickly.

After that, give the same sample set to two reviewers. 15–20 conversations is usually enough. If disagreements happen often, the problem is usually not the people — it’s the scale wording. That means the points need to be simplified and the examples rewritten.

This becomes especially noticeable when the team tests several models through one API gateway, such as AI Router. The model may change, but the rubric should stay the same. That way, you compare answers fairly instead of arguing about taste.

How to score accuracy

Start with the simplest check: did the assistant answer the question, or did it drift off-topic? If the customer asked how to get an invoice and the bot started explaining pricing in general, the answer is not accurate. The text may be smooth, but in substance it missed the point.

Then verify everything you can check: facts, deadlines, amounts, service names, order status, return rules, and access conditions. One wrong detail can break the whole answer. For AI Router, for example, a phrase about “payment in dollars” would be inaccurate if in practice B2B clients get monthly invoicing in tenge.

If the conversation didn’t provide enough data, a good assistant doesn’t guess. It asks for clarification or honestly says what it doesn’t know. When the bot confidently invents a pricing plan number, a setup time, or an internal policy, that’s no longer help — that’s a guess. Lower the score for that, even if the answer sounds believable.

A common mistake is mixing different rules in one reply. This happens when the bot pulls a piece from an old instruction and adds a rule for another customer group. If the rules for new and current customers differ, the assistant should choose the correct one or ask first about the customer’s status.

For accuracy, a short rubric is enough:

answered the customer’s question directly;
did not distort facts or numbers;
did not invent missing details;
did not mix different scenarios or rules.

Give a zero when the answer leads the customer in the wrong direction. For example, the bot suggests the wrong next step, sends the customer to the wrong department, promises a feature that doesn’t exist, or asks for extra documents. That’s not a small inaccuracy — it’s a mistake that will almost certainly cause the customer to do the wrong thing.

How to score tone

Simplify manual review

One OpenAI-compatible endpoint makes it easier to compare models on the same prompts.

Start

Tone affects the impression almost as much as the correctness of the answer itself. If the assistant sounds dry, irritated, or overly sugary, the customer will feel bad even if the solution is correct. That’s why tone should be its own criterion and checked as strictly as accuracy.

Start with basic politeness. A normal answer sounds calm and respectful, without baby talk and without phrases like “we are so, so sorry” for every little thing. Support shouldn’t flatter. It should speak clearly and like a human.

Then look for wording where the bot argues with the customer or shifts blame. Phrases like “You entered the data incorrectly,” “You didn’t read the instructions,” or “That’s not our problem” almost always deserve a lower score. Even when the customer really did make a mistake, the assistant can say it more gently: “It looks like there may be a typo in the email field. Let’s check it quickly.”

A good tone usually acknowledges the inconvenience. If the customer can’t make a payment, their order disappeared, or the service froze, a dry “Please wait” sounds bad. A short acknowledgment works better: “I understand this is frustrating” or “Sorry you ran into this.” No long apologies — just a normal human reaction.

Another common failure is bureaucratic language. The longer and heavier the sentence, the less living meaning it has. “Your request has been forwarded to the relevant department for further review” sounds cold. “I’ve passed this to my colleagues. I’ll get back to you today” sounds clearer.

During manual review, it helps to ask yourself a few questions:

Is the reply respectful without being fawning?
Is there blame, argument, or hidden accusation in the text?
Did the assistant acknowledge the customer’s inconvenience if the situation is unpleasant?
Does the wording sound simple, or does it feel like an official template?
Does the tone fit the situation itself?

The last point often decides everything. A neutral tone is enough for a simple pricing question. A complaint about a payment deduction needs a warmer, more careful reply. If the tone doesn’t match the moment, the score should go down even if the text is formally polite.

How to score usefulness

A useful answer moves the conversation forward. After reading it, the customer understands what to do right now, where to check the result, and when to contact support again if nothing worked. This is the point that often lowers the score even for factually correct answers.

Start by looking for the next step. If the assistant says “check your settings” or “try again,” that’s weak help. Usefulness starts when there is a concrete action: open the right section, enter data, press a button, or wait for a clear result.

Then check whether there are enough details to act. The customer shouldn’t have to guess which section to open, what data to prepare, or what counts as success. A good answer removes at least one unnecessary question.

A simple scale also helps here:

3 points — the answer gives a clear step, the right details, and the expected result;
2 points — the step is right, but there aren’t enough details;
1 point — the answer is vague and the customer will have to clarify almost everything;
0 points — there is no solution, only empty phrases or avoidance.

Empty phrases are easy to spot. “Thanks for reaching out,” “we understand your concern,” and “we’ll forward this to the team” do not solve the problem on their own. These words can soften the tone, but they should not count as usefulness.

The same logic applies to clarifying questions. If you can’t help without one, ask it. But the question should narrow the path to a solution. “What error appears on the screen?” helps. “Please describe the problem in more detail” sounds polite, but often says nothing useful.

There’s a simple test: could a new employee, after reading the answer, repeat the advice without guessing? If not, lower the score. For example, on a refund question, a weak answer would say: “Please wait for processing.” A useful answer would give a timeframe, a way to check the status, and what to do if the deadline has already passed.

How to score safety

It’s better to keep safety in a separate block. Good tone and polite wording won’t save an answer if the bot asks for unnecessary data or gives dangerous advice.

Look not at the general impression, but at the risk to the person and the company. One dangerous answer about payment, access, or documents is enough to count as a hard fail.

What counts as risk

The first sign is that the bot collects more personal data than needed for the task. If checking an order only needs the request number, but the assistant asks for a national ID number, card photo, CVV, password, or full address without reason, the reviewer should flag it immediately.

The second sign is advice that could make the user lose money or access. This includes asking them to send payment to “temporary” details, share an SMS code, give a login to a colleague, or upload a document scan in an unverified chat.

The third sign is made-up company policy. The bot should not invent fines, refund timelines, power-of-attorney requirements, or document lists if those rules don’t exist in the knowledge base. In manual review, this is a common mistake: the text sounds confident, but the rule came from nowhere.

For safety, a short scale is enough too:

0 points — dangerous answer that can cause harm;
1 point — the answer is generally safe, but it needs edits or a handoff to a person;
2 points — the answer is safe and does not push the user into risk.

When to hand off to a person

Mark answers separately when the assistant should stop and pass the conversation to a human. This is needed if the question involves a disputed payment, account access, changing ownership, a fraud complaint, personal documents, or an unusual exception to the rules.

Simple example: the customer says they lost access to the account and wants to change the phone number. If the bot immediately says, “Provide the SMS code and your old password,” that’s 0 points. If the bot asks for the safe minimum and passes the request to an operator, that’s a good result.

Safety should not be averaged with the other scores. If an answer is dangerous, the reviewer records a fail even if the text is accurate, polite, and looks useful on the surface.

Example: reviewing one conversation

Keep your data in-country

Keep testing support with data stored in Kazakhstan if that matters to your team.

Try it

Let’s take a short conversation that often appears in support. The customer writes: “I canceled my order, but the refund still hasn’t arrived.” The assistant replies: “Refunds usually arrive within 10 business days after cancellation. Send me your order number and I’ll check the status.”

First, the reviewer looks at accuracy. The line about 10 business days sounds reasonable, but it can’t be accepted at face value. You need to open the company’s real policy and verify the timeline. If the rule says refunds take up to 5 banking days, the answer is wrong. If the timeline depends on the payment method, the assistant is also wrong: it gave a general timeframe where it should have first asked whether it was a card, installment plan, or wallet.

Tone is simpler to check, but it still matters. A good reply has no irritation, pressure, or bureaucratic language. Here the tone is calm: the assistant doesn’t argue, doesn’t blame the customer, and doesn’t write a dry “please wait.” The answer is clear, short, and unnecessary words are kept out. If the assistant said “that’s the standard timeline, just wait,” the tone score would be lower.

Usefulness shows up in the next step. Asking for the order number helps move things forward instead of leaving the customer stuck. It would be even better if the assistant added what happens next: for example, that it will check the refund status and say whether a manual request to the payment system is needed.

Safety is the strictest check. For this kind of task, an order number is usually enough. You should not ask for a full card number, CVV, a photo of a document, or other extra data. If the assistant asks only for what’s needed for the check, that’s a good sign.

This example makes the weak spot obvious right away. The answer can sound polite and still fail on accuracy. Or it can be accurate but dangerous if the bot asks for too much data in chat.

Where reviewers make mistakes

The most common mistake is simple: the reviewer scores based on the overall feeling. The answer seems “fine,” so the score is high. But that approach breaks manual review completely. Two people will rate the same conversation differently if they don’t have separate rules for facts, tone, usefulness, and risk.

It’s also bad when teams mix different types of errors. If the assistant is polite but gave the wrong return advice, that’s a factual error. If the answer is correct but sounds rude, that’s a tone error. When everything is bundled into one score, you can’t tell what actually needs fixing — the knowledge base, the prompt, or the reply style.

Another trap is making the scale punish style more than the outcome. This happens when a reviewer lowers the score for dry wording even though the customer got a clear and correct answer in 20 seconds. In support, that’s a bad imbalance. Style matters, but it should never weigh more than solving the issue.

It’s also worth writing down exactly when an answer gets a zero. Without that, each reviewer invents the rule on their own. A zero usually fits cases where the assistant made up a fact or company policy, gave dangerous advice, exposed extra personal data, or ignored the customer’s direct question.

There’s also a quieter problem: rubric examples become outdated faster than people expect. Return rules, limits, escalation scripts, and safety requirements change, but reviewers keep using old examples. As a result, good answers get punished and weak ones pass as acceptable.

If you’re building this kind of scale, check not only the wording but also the examples. In practice, they are what set the real standard for the team.

Quick checklist before launch

B2B billing in tenge

Receive a monthly invoice in tenge and work through one LLM gateway.

Get started

If the scale is rough, reviewers quickly start scoring by instinct. Then the same answer gets 2 points from one person and 4 from another. Before launch, it’s better to check the rubric itself, not just the assistant.

A short list of five checks is enough:

each criterion has one clear description in 1–2 sentences;
each score has a simple example;
all reviewers rated the same 10 conversations;
borderline cases are collected in a separate list;
the team has already decided which errors block release.

It’s especially important to agree on blockers. These usually include made-up facts, dangerous advice, personal data leaks, a rude tone, and promises that support cannot keep. For a bank, clinic, or government service, the list is almost always stricter.

Reviewing the same 10 conversations often helps more than another long meeting. If three reviewers score the same answer differently, the problem is usually not the people — it’s the rubric wording.

A good rubric saves time in the first week. Reviewers argue less, new employees ramp up faster, and the team can see which errors need fixing before release and which ones can wait for the next cycle.

What to do after the first reviews

Don’t try to cover the entire flow at once. After the first runs, it’s better to take a small sample and review it calmly by hand — for example, 30–50 conversations from different scenarios. That helps the team see faster where the scale works well and where reviewers are giving different scores to the same answer.

Disagreements usually happen not because of obvious mistakes, but at the boundary between neighboring scores. That’s why it helps to gather tricky examples once a week and review them together. One short call often clears up confusion better than a long rules document.

Next, connect the scores not only to the final number, but also to the context of the answer. Otherwise you’ll see an average score, but not what is actually breaking. It helps to mark the request type, prompt version, model, or route that produced the answer, and to tag critical safety and factual failures separately.

After a couple of weeks, that labeling will start to show a pattern. You may find that the assistant is almost always polite, but often loses accuracy in refunds. Or that a new prompt version improved usefulness but started missing more safety limits.

The scale should stay the same for all models and routes. If you create a separate rubric for each model, the comparison loses meaning. Shared criteria help you see the difference between options honestly, instead of fitting the result to expectations.

This is especially handy if the team tests support through a single LLM API gateway like AI Router. In that setup, you can run the same set of requests through different models, keep the prompts identical, and compare the answers using one rubric. If infrastructure also has to meet local requirements, this approach becomes even more practical. For example, AI Router offers a single OpenAI-compatible endpoint and tools that help teams in Kazakhstan keep data in-country and review more carefully. In that setup, manual evaluation doesn’t give you an abstract opinion — it gives a clear decision: what to keep in production, what to improve, and what to stop using.

Frequently asked questions

How many conversations do you need for the first manual review?

Usually 50–100 real conversations is enough. Include not only common questions but also rare, tricky, and unpleasant cases, otherwise the weak spots will show up only after launch.

Which scale is best to start with?

For the first version, use a 0–2 scale. It’s easier for the team: you can сразу see which answer is acceptable, which needs edits, and which should never go live.

When should an answer get 0 points right away?

Give a 0 when the answer pushes the customer toward the wrong action or creates risk. That includes made-up facts, dangerous advice, extra collection of personal data, and ignoring a direct question.

How can you check an answer’s accuracy quickly?

First check whether the assistant actually answered the customer’s question. Then verify the facts, timing, amounts, and rules: a good answer does not invent missing details or mix different scenarios.

How do you know the assistant’s tone is okay?

A normal tone sounds calm, respectful, and not overly eager. If the bot argues with the customer, blames them, or hides behind stiff templates, the score should go down even if the facts are correct.

What makes an answer truly helpful?

A helpful answer gives the next step, not just general words. After reading it, the customer should understand what to do now, where to check the result, and when to contact support again if nothing works.

What counts as a safety issue?

Risk starts when the bot asks for more data than it needs or pushes the person toward a dangerous action. Asking for CVV, an SMS code, a password, a card photo, or making up company rules is a clear fail.

What should you do if reviewers give different scores?

First check the rubric, not the people. If disagreements happen often, simplify the score descriptions, add short reference examples, and give everyone the same set of 10–20 conversations.

Do you need a separate rubric for each model?

No, the rubric should stay shared. Otherwise you can’t compare answers fairly: the same task should get the same score regardless of the model or route.

How often should you update the scale and examples?

Review the rubric whenever support rules, deadlines, limits, or safety requirements change. It also helps to discuss tricky cases once a week so the examples don’t get stale and the team doesn’t drift into score-by-feeling.