Skip to content
Oct 06, 2024·8 min read

Customer Complaint Classification: How to Combine Rules and LLMs

Customer complaint classification helps assign queues and SLAs faster when you combine simple rules, LLMs, confidence checks, and manual review.

Customer Complaint Classification: How to Combine Rules and LLMs

Why a simple setup breaks quickly

On paper, everything looks simple: there are 8–10 categories, an agent picks one, and the request goes to the right queue. In practice, that setup does not last long.

Customers do not write in templates. They write emotionally, jump between facts, forget details, and describe the problem in their own words. A user does not write “billing error” or “authentication failure.” They write: “You charged me again, nothing loads, and the operator has been silent for an hour.”

For strict rules, that is awkward text. It has a lot of noise, few exact phrases, and the meaning is often hidden between the lines. If a person is angry or in a hurry, they do not lay out events in order at all. They just dump everything into one message.

That leads to a second problem: one text often contains two or three topics at once. A customer may start with a duplicate charge, then add that the card was not linked, and finish by complaining about a rude reply in chat. If the system asks them to choose one label, it cuts the meaning down. Part of the request gets handled, and part of it gets lost.

Categories also change more often than you might think. The business launches a new service, changes promo terms, adds a new order status, or moves a process into the app. Within a few weeks, the old scheme no longer matches the way people really describe problems.

Even carefully written rules start to fail at that point. They catch familiar words, but get confused by new scenarios. Yesterday’s similar complaint was about delivery; today it is about a subscription or bonuses. Formally, the text is almost the same, but the route needs to be different.

A wrong label seems minor until you look at the consequences. The request goes to the wrong team, priority is set too low, the SLA clock starts under the wrong scenario, and the customer gets extra questions instead of an answer.

For a bank, that can delay the review of a disputed charge. For retail, it can miss the deadline for a return response. If there are many such mistakes, not only request routing breaks, but the whole picture of queues, workload, and response times does too.

A simple setup breaks not because it was designed badly. It breaks because live text and live business change faster than rules do.

What the system decides for each request

A single customer message rarely requires just one decision from the system. Usually, it makes several at once, and an error in any of them hits deadlines, queues, and review quality.

A working setup should answer four questions: what the customer is really writing about, who should get the request right now, what response time should be set, and where a human is needed before the first answer or before closure.

The complaint topic seems simple until people start writing in their own words. One customer writes “the money never arrived,” another says “the refund is stuck,” and a third says “after payment, silence.” The meaning is the same, the wording is different. So the system should not stop at a broad topic like “payments.” It also needs a subcategory: duplicate charge, delayed refund, disputed transaction, payment error.

Next comes routing. That is no longer about the meaning of the text, but about action. A complaint about a rude reply from an agent can stay with the front line. A question about a charge should go straight to billing. If the text shows signs of fraud, card compromise, or an attempt to bypass verification, the request should move to risk. If the customer is complaining about a courier or a missing order, the ticket goes to delivery.

Then the system sets the SLA. One deadline for everything does not work here. Urgency depends on risk, amount, channel, and customer type. A message about a duplicate charge for 500 tenge and a complaint about a 480,000 tenge transaction should not wait the same amount of time. A complaint from a public channel or from a large B2B client also often needs a faster response.

The last decision is often forgotten, even though it saves you from costly mistakes. The system should know where an agent must check the output manually. That is needed when the model is unsure, when the rules and the LLM give different results, when the amount is high, or when the topic is disputed. In those cases, it is better to send the request for manual review than to close it automatically with the wrong category.

A good setup does not just slap a label on text. It chooses the topic, the support queue, the response time, and the point where a person must step in.

Where to keep rules and where to let the LLM speak

If you build classification only on phrase dictionaries, the setup starts to crack fast. People write with typos, fragments of thought, screenshots in the text, broken dates, and amounts. Today a customer writes “charged twice,” tomorrow “what’s this duplicate on the card,” and the day after “payment went through again, please refund it.”

Rules are best kept where the signal is obvious and there is nothing to argue about. If the text contains words about blocking, an exact amount, a contract number, an order number, duplicate charging, or urgency, the system should latch onto that immediately. Those things are easy to extract with regular checks and simple conditions. They are predictable and do not depend on how the customer formed the sentence.

The LLM belongs to the other part of the task. It reads meaning even when the text is uneven and messy. A customer may complain about three things at once, jump from detail to detail, and mix up cause and effect. The model still often understands that this is a payment dispute, a delivery issue, or a risk of missing the SLA.

Usually the working boundary looks like this: rules extract required signals and urgent cases, the LLM chooses the category from free text, and then rules verify the result and, if needed, forcibly change the route.

This is especially useful where topic names change almost every week. Support adds a new queue, lawyers update complaint types, or the business splits one category into two — then the phrase dictionary has to be rewritten again. The LLM handles such changes more calmly because it relies on context, not exact word matches.

But do not give the model all routing decisions. If there is a mandatory route, it should be locked in by rule. A complaint about account blocking, a hint of fraud, a threat to go to the regulator, a repeat charge of a large amount — these cases are better sent to the right queue without free interpretation.

A normal setup looks more like a framework than an endless list of phrases. Rules set the boundaries: what is urgent, what must not be confused, what fields need to be extracted from the text. The LLM fills in the middle and understands the customer’s intent where language keeps changing. In most cases, that combination gives better quality and less manual cleanup.

How to build the scheme step by step

Classification works better when the scheme stays boring and clear. If the team cannot quickly explain how one category differs from another, the model will also start to get confused.

First narrow the categories

Do not start with the full reference list. For the beginning, a working set of 10–20 categories is enough. Keep only the groups that support staff really distinguish in daily work. If two categories almost always go to the same queue and get the same response time, it is better to combine them.

For each category, describe three things right away: where the request goes, what SLA it has, and under what condition a senior review is needed. Otherwise you get a nice label with no process value. For example, the category “Payment did not go through” can go to the finance queue, have a 30-minute response target, and an extra escalation if the customer writes about duplicate charging or card blocking.

After that, add hard rules for urgent and unambiguous cases. They work well for words and signals where mistakes are not acceptable: “double charge,” “charged twice,” “cannot log in,” “data leak,” mention of a court, regulator, or medical risk. There is no need to ask the model here if the route and priority are obvious.

Then add the LLM

Once the rules cover the edge cases, give all the remaining texts to the LLM. Ask it not just to choose a category, but to return the category, confidence, and a short reason for the choice in one sentence. That helps check disputed answers and fix the scheme faster when the model confuses similar topics.

At the start, a simple logic is usually enough: rules handle urgent requests first, the LLM analyzes free text, doubtful answers go to manual review, and once a week the team checks where the model made the most mistakes.

The confidence threshold should not be too low. If the model is wavering, it is better to send the complaint to a person than to break the SLA and put the request in the wrong queue. Many teams start with a cautious mode: anything below 0.7 or 0.8 goes to manual review.

If the flow is large, save both the short reason for the choice and the employee’s final decision. After a couple of weeks, it will become clear which categories need to be split, where the rules have gone stale, and which new topics customers have started to describe in their own words.

How to handle uncertainty and new topics

Connect your SDK without refactoring
Plug in your current SDK and start the pilot without rewriting client code.

When complaints are written freely, the model almost never makes the same mistake twice. Today it confuses a refund with a cancellation; tomorrow it sees fraud risk in a complaint about a courier. So the system needs not one answer, but honest handling of uncertainty.

Two thresholds instead of one

Teams often set one general confidence threshold and stop there. That is not ideal. Queue assignment and SLA assignment are better handled separately.

A queue can be assigned at a lower confidence level if the mistake is not very costly. The SLA should be set only where the model is clearly more certain. Otherwise you get a dangerous situation: the request went to “almost the right place,” but the system already promised a response time for a critical case.

A simple example: the text “The money was charged twice, and there has been no answer for three days” looks like both billing and an escalation over delays. The billing queue can be chosen early. A strict SLA for a financial incident is better set only after a stricter check.

If the text does not fit known categories well, the model should be allowed to answer “other.” That is better than false precision. For complaint classification, that outcome is more useful than a pretty but wrong label.

How to catch new topics

Manual review is not a backup plan — it is a source of new rules. Keep requests that the model often sends for human review, confuses between two categories, or marks as “other.”

Once a week, review that set and look for repeated patterns. Usually you will quickly spot either new wording for an old problem or a completely new complaint type. After that review, the team can add a new category, refine the old description, write a rule for obvious cases, update examples in the prompt, or adjust the threshold for one route.

There is a simple rule that saves a lot of time: do not change the rules and the prompt on the same day without checking old examples. If you change everything at once, you will not know what actually improved the result and what broke neighboring categories.

It is better to keep a short change log and run the same set of past requests after each update. Then you can see whether the system is mixing up queues less often, whether fewer requests go to manual review, and whether SLA performance has dropped where everything used to work fine.

Example: a duplicate charge complaint

The customer writes: “You charged me twice, and there has been no answer for three days.” For a person, there are two problems right away. The first is a possible payment error. The second is that support has been silent for longer than promised.

If the system looks only at one topic, the request can easily go to the wrong place. Billing sees a disputed transaction, but nobody notices the missed response deadline. Or the opposite: the complaint lands in general support, even though the money needs to be checked immediately.

A rule catches the first part well. It looks for words like “charged,” “twice,” “again,” checks the transaction amount, and looks for payment signals: date, order number, card, receipt. If the amount is above the threshold, the rule raises urgency without long reasoning. That case is better not left entirely to the model’s free interpretation.

The LLM is needed for the second part of the text. It sees that the customer is not only complaining about a charge, but also about there being no answer for three days. A rule often misses that, because people write in different ways: “silence,” “nobody replied,” “they have been ignoring me for three days.” The model brings those phrases together as one theme — support delay.

After that, the system can make several decisions at once. The main queue goes to billing, the SLA gets urgent status, the request gets the label “support response delay,” and a copy goes to service control if needed.

That route is better than the “one complaint — one category” scheme. The billing team gets the request immediately, and the service manager sees that the customer already has a second reason to be unhappy.

Sometimes the text is contradictory. A customer might write: “It was charged twice, but then one charge was refunded, although I’m not even sure now.” Or they may mix payment, bonuses, and old chat history in one message. In those cases, the system should not guess all the way. It marks the request as doubtful and asks an agent to confirm the route manually.

It is exactly on such complaints that the combination of rules and LLM works best: the rule quickly catches the financial risk, and the model notices the second layer of the problem that affects the queue and SLA.

Mistakes that cost a lot

Check routes without changing code
Compare several models for queues and SLAs through one OpenAI-compatible endpoint.

The most common expensive mistake is not the model, but the category scheme itself. The team creates too many similar labels: “payment error,” “duplicate charge,” “payment dispute,” “card refund.” On paper, the difference seems clear, but in live text it quickly blurs. In the end, even people start getting confused, not just the LLM.

If categories are similar, the same request can easily end up in different queues. That often happens when the rules overlap: by the wording, the email looks like billing; by the tone, it looks like an urgent complaint; by the amount, it looks like a VIP case. The system sends the complaint sometimes to finance, sometimes to general support, while the SLA clock runs on its own. That does not help the customer. The team also loses time reassigning the ticket.

Another problem is changes that nobody follows through to the end. The business changes SLAs, adds a new queue, revises priorities, but the prompt and test set stay the same. A month later, the dashboard still looks okay, but complaints are already moving by yesterday’s rules. If you change routing, update the rules, prompt, examples, and validation set at the same time.

Another costly habit is forcing the model to guess what a rule can decide in a second. If the email came from a partner address, if the text contains a case number, if the customer already chose a topic in the form, there is no need to ask the LLM to “understand the meaning” again. Rules and LLM should split the work honestly: rules take the obvious stuff, the model handles free text, mixed cases, and noise.

Worst of all is when nobody reviews disputed cases. Then false escalations pile up quietly: urgent tickets get sent upstairs for no reason, simple questions get extra priority, and real risks sink into the background.

Once a week, it helps to look at at least four groups: requests that went to different queues because of similar signals, cases with low model confidence, escalations that were later canceled manually, and SLA breaches after rule or prompt changes.

A small review often brings more value than another round of fine-tuning. If the system made the same mistake 30 times on the same complaint type, the cause is usually not a “weak model,” but a bad scheme, an old prompt, or overlapping rules.

Quick check before launch

One gateway for the pilot
Change base_url and test classification on fresh incoming requests.

Before going live, check not only the model, but also the operational side. Every category should have a specific owner, its own queue, and a first-response time. If a category has no responsible person or team, that category is not ready for production yet.

The same applies to SLAs. A phrase like “we’ll respond quickly” is not good enough. For a payment blocking complaint, the deadline may be 15 minutes; for a disputed refund, 2 hours; for a general service complaint, one business day. When deadlines are stated clearly, the system stops being a lottery.

Also check the rules separately. Urgent, risky, and legally sensitive cases are better caught before the LLM. If the customer writes about a data leak, fraud, duplicate charging, a complaint to the regulator, or a request to delete personal data, the rule should immediately send the text to the right queue. The model helps here, but should not be the only filter.

The LLM is not very useful either if you ask it for only one label. For real work, you need the category, a confidence score, a short reason for the choice, and a signal that the topic is new or the text does not match known examples well.

Low confidence should lead not to a random queue, but to manual review. That route also needs a deadline. Otherwise, all the disputed requests will start piling up there, and the team will see them too late. A simple rule is often enough: everything below the set confidence threshold gets a human in the loop and a separate SLA.

Another test is often skipped: can the team see its own mistakes? Without that, the scheme degrades quickly, and you only notice it through missed deadlines. The dashboard should show how many requests went to the wrong queue, where the model gets confused most often, and which complaint types produce the most misses.

If you do not have that picture of queues and complaint types, it is too early to launch the system. It may look accurate in tests and still break support work every day.

What to do next

Do not try to cover the entire complaint flow right away. It is better to take one support line and the 3–5 most common topics where mistakes are costly: payments, delivery, refunds, and account blocking. That makes it easier to see where rules already help and where the LLM gets confused.

Test the scheme not on an old training set, but on fresh complaints from the last 2–4 weeks. People change how they phrase things all the time: today they write “charged twice,” tomorrow “the money went through again,” and the day after they attach half a screen of chat history. If you test only on clean examples, the system will look accurate on paper and weak in the live queue.

For a start, a simple cycle is usually enough: take one request flow, keep the number of categories small without extra detail, run fresh complaints blindly, review the disputed cases manually, and then update the rules and model prompts once a week.

Metrics are also better split early. A queue error and an SLA error are not the same thing. If a fraud complaint went to “general questions,” that is one problem. If the topic was identified correctly but the request did not get urgent priority, that is another. When you count those misses separately, the team finds the cause faster: a rule failure, a weak SLA description, or a bad model answer.

Before production, it is worth thinking through the service side as well. You need audit logs so you can later see why the system made a given decision. You need PII masking so card numbers, phone numbers, and addresses do not leak into prompts and logs. And it is better to decide in advance how you will switch models if the current one starts performing worse on new complaint types.

At that point, it is more convenient to have one compatible API gateway instead of building separate logic for each provider. For example, AI Router on airouter.kz provides an OpenAI-compatible endpoint, so the team can change base_url and compare different models without rewriting client code, SDKs, or prompts. That is especially useful if you need to test a new model quickly on real complaints, keep data inside Kazakhstan, and maintain audit logs in one format.

If the pilot shows that 80% of common requests go through without manual edits and urgent complaints do not miss their SLA, the scheme can be expanded to neighboring queues. If not, do not add new categories. First fix the mistakes that are already hurting deadlines and routing.

Frequently asked questions

When is it better to handle everything with rules instead of an LLM?

Use rules where the signal is clear and mistakes are expensive. Double charges, signs of fraud, a complaint to the regulator, data leaks, a large amount, a contract or order number — these are things a rule can catch immediately and send to the right queue without debate.

When does an LLM perform better than a phrase dictionary?

LLMs work well on real-world text where the customer writes messily, mixes several issues, and does not use your category names. The model helps where a phrase dictionary can no longer keep up with new wording and changing processes.

How many categories should we start with?

Start with 10–20 working categories. Pick only the topics that support truly distinguishes in day-to-day work and for which you already have a queue, a response time, and a clear escalation rule.

Should we use different confidence thresholds for routing and SLA?

Yes, if the cost of an error is different. You can assign a queue with a softer threshold, and set SLA only with a stricter one. That way you do not promise urgent handling when the model is still unsure.

What should we do if one complaint has several problems at once?

Do not force the system to choose just one topic at any cost. Let it set a main category, add a second label for the related issue, and, if needed, send a copy to service control or manual review.

How do we know a new complaint topic has appeared?

Watch manual review and requests marked as “other.” If the model often confuses two topics and people describe a new problem in the same wording, the team will quickly spot it in the disputed cases and update the scheme.

When is a human in the loop mandatory?

Bring in a person when the model is not confident, when the rule and the LLM disagree, when the amount is high, or when the topic is risky. Manual review costs less than a wrong route and a broken SLA.

What data should we store so we can fix errors later?

Save the request text, the rule output, the model output, the confidence score, a short reason for the choice, and the final employee decision. Then the team will know what broke: the threshold, the prompt, rule overlap, or the category design itself.

How do we test the scheme before production?

Use fresh complaints from the last 2–4 weeks, not an old training set. First run one flow of requests blindly, then review the misses manually, and only after that change rules or the prompt one at a time.

Why does the team need a single API gateway to test different models?

If you compare models often and do not want to rewrite the integration for each provider, a single gateway makes life much easier. The team changes base_url, runs the same SDK and prompts, and then compares quality, latency, and price faster on real complaints.