Jul 12, 2025·8 min read

Extracting Attributes from Price Lists Without Manual Cleanup

Q: Do I need a separate brand dictionary?

Yes, without it brands quickly split into variants like `Nestle`, `NESTLE`, and `Нестле`. The dictionary should include synonyms, common typos, and remove legal forms like `ТОО` or `ООО` if they get in the way of search.

Q: When should I use rules and when should I use an LLM?

Start with rules. They handle clear patterns like `12x500 ml` or `250 g` well. Use a model where names are messy and word order changes. If the model is unsure, send the row to a person instead of the catalog.

Q: How can I quickly check parsing quality before uploading?

Look at the `before` and `after` pairs for a sample of rows. If the system suddenly parses a file much worse than usual, stop the import and check the dictionary, the rules, or the supplier’s new format.

Attribute extraction from price lists helps bring units, brands, and pack sizes into one format, even when suppliers send Excel, PDF, and CSV files in different shapes.

Why price lists break the catalog

Catalogs rarely break because of one big mistake. More often, they are damaged by hundreds of tiny mismatches that arrive with every new supplier file. For a person, these are just different ways of writing the same thing. For a system, they are already different products.

The same item may arrive as "Jacobs Monarch coffee 95 g", "Jacobs Monarch coffee 95g", and "Monarch Jacobs 0.095 kg". The meaning is the same, but the rows do not match. If you load such data without preparation, duplicates appear in the catalog, prices drift apart, and stock gets split across similar product cards.

The problem is not only in the name. Suppliers mix units of measure in one field: kg, g, g, pack, pcs. Someone writes the weight with a space, someone glues it to the number, and someone converts grams into kilograms. Even simple product comparison starts to fail, because "500 g", "0.5 kg", and "500g" look different.

The brand is also rarely in the same place every time. In one row it comes first, in another it hides in the middle, and in a third it moves to the end after the flavor, color, or series. If the catalog cannot separate the brand from the rest of the text, search and filters quickly lose their meaning.

Pack size creates even more confusion. A supplier may write "12x500 ml", "0.5 l x 12", "12 pcs of 500 ml", or "pack of 12 bottles 0.5 l". For a buyer, this is a normal notation. For import, it is a set of fragments, and it is unclear what describes the volume of one unit, what refers to the number in the case, and what describes the package type.

Column names change just as often. Today it is "Product name" and "Price", tomorrow it is "Item", "Supplier SKU", "Price incl. VAT", "Unit". The day after, part of the fields may even be inside one column. Import cannot guess on its own what lives where.

As a result, the catalog gets the same problems over and over: duplicate cards, broken filters and search, damaged analytics for brands and pack sizes, and manual cleanup after every upload.

A price list almost never counts as ready data. It is raw text with meaning inside, but a new form every time.

Which fields to normalize first

When suppliers put everything into one line, the catalog quickly fills with duplicates. It is not enough to just pull text out of the file. You need to split it into fields that can later be compared, filtered, and checked.

First, separate the brand from the product’s general name. In the row "White sugar Almaz 1 kg", the word "sugar" describes the product type, while Almaz is the brand. If you keep them together, search by category and merging identical items will start to get confused. One supplier will put the brand at the beginning, another at the end, and a third will write it in Latin letters.

Next, choose base units of measure and do not change them from file to file. Usually a simple set is enough: l, ml, kg, g, pcs. The most useful rule here is simple: the number is stored separately from the unit. Then "0.5 l", "500 ml", and "0.500 L" can be brought to one format without guessing whether it is the same product.

A minimal field set usually looks like this: the original row without edits, the general product name, the brand, the numeric value and unit of measure, and the quantity in the package. In many cases, that is already enough to stop creating duplicates.

Pack size is best moved into separate fields right away. "Apple juice 1 l 12 pcs" contains at least two different attributes: the volume of one unit and the number of units in the box. If you merge them into one field, the system will not understand that 12 refers to the package and not to liters.

It helps to keep two layers of data. The first describes a single sellable unit: 1 l, 500 ml, 250 g. The second describes the shipment package: 6 pcs, 12 pcs, 24 pcs. This greatly reduces errors in purchasing and stock tracking. Otherwise, "water 0.5 l x 12" can easily turn into either 12 liters or 12 separate cards.

Always save the original row. Even if you have already extracted the brand, unit, and pack size, the raw entry is needed to check questionable cases, train rules, and handle complaints from buyers. When normalization makes a mistake, that original row shows where the parsing broke.

In short, you should normalize not the whole product at once, but its parts. Then the catalog lasts longer and needs less manual cleanup.

How to build the dictionary and rules step by step

A normalization dictionary is not written from memory. It is built from fresh files that suppliers are actually sending now. Old exports often only get in the way: they still contain brands, pack sizes, and unit labels that are no longer used.

Start by collecting the latest price lists from all suppliers and putting them into one working set. You do not need a random handful of rows, but a proper sample: Excel, CSV, PDF after OCR, and emails with attachments. The broader the starting set, the fewer surprises after launch.

Then go through the fields and list every variation you see in raw form. The fastest way is to start with three groups: units of measure, brands, and pack sizes. Even at this stage, you can see where most of the noise is: "ml", "ml", "ml.", "liter", "l", "L"; "Nestle", "NESTLE", "Нестле"; "12x500 ml", "12 pcs of 0.5 l", "pack 12".

After that, merge identical values into one list and assign each variant a single canonical value. If you decide to store all volumes in milliliters, do not leave some rows in liters "for convenience". If the brand in the catalog is written as "Nestle", then "Нестле" and "NESTLE" should lead to the same value, not live as three different records.

It is better to separate questionable cases from clear ones right away. If the system is not sure, it is cheaper to send the row for manual review than to silently assign the wrong brand or pack size. One questionable row almost always costs less than a hundred damaged product cards.

Another important thing is the order of rules. First clean the row: remove extra spaces, normalize case, replace commas in numbers, and remove service noise if it gets in the way. Only then parse the brand, units, and pack size. If you do it the other way around, parsing will start latching onto garbage and produce extra matches.

A good dictionary is easy to spot by one simple sign: a new file does not break the scheme, but only occasionally adds a few new variants to an already understandable system.

How to normalize units

First, choose one base form for each measurement type and store it in the data, not only in the product name. For weight, this is usually grams; for volume, milliliters; for count, pieces. Then "0.5 kg", "500 g", and "0.500 kg" all become the same value: 500 g.

This is exactly where price list parsing often breaks. One supplier writes "1 l", another writes "1000 ml", and a third writes "1.0 liter". If you leave all of that as is, the catalog gets duplicates and filter errors.

It is useful to store four things separately: the numeric value in the base unit, the original unit from the file, the unit type — weight, volume, pieces, or package — and the normalized record for the catalog. That way you do not lose the source text and can review questionable cases.

For weight, it is best to bring kg and g into one logic right away. If the product arrives as "2 kg", the system converts it to 2000 g. If the row says "250g" or "250 g.", the result is the same: 250 g. All abbreviations like "g", "g", "kg", "kg", and "g" should be collected in the dictionary in advance.

The same rule applies to volume. "1 l", "1l", "1000 ml", "1000ml", and "1.0 l" should all be stored the same way, for example as 1000 ml. On the storefront, you can later show "1 l", but inside it is better to keep one number and one base unit.

Decimal numbers also do not need a debate. Price lists contain both "0,5" and "0.5". If the parser understands only one version, some values will simply fall out. The simple rule is: replace the comma with a dot first, then save the number in a numeric field, not as a string.

Another problem is "pcs" and "pack". They must not be mixed, even if the supplier did it that way. "12 pcs", "1 pack", "pack", "pcs", and "2 packs of 6 pcs" describe different meanings. In one case it is a sellable unit, in another it is a package, and in the third it is the structure of the pack size. If you put all that into one field, it becomes unclear what is actually being sold.

For questionable records, a simple rule works: store the sellable unit and the pack size separately. Then the row "2 packs x 6 pcs" becomes a clear structure: 2 packs, 6 pieces in each. These small rules are what save hours of manual cleanup later.

How to parse brands and pack sizes

Keep control in production

Set up per-key rate limits in AI Router and keep a review log for questionable uploads.

Launch pilot

Brands and pack sizes most often break where the supplier writes the product the way their manager is used to writing it. One row will say "Coca-Cola 12x500 ml", another "Coca Cola 0.5 l x 12", and a third "LLP Romashka / Coca Cola regular 6 x 1 l". If you do not split this into parts, the catalog quickly fills with duplicates.

For brands, it is best to start with a dictionary of synonyms and common mistakes. You need it even for well-known names. Suppliers write "Coca-Cola", "Coca Cola", "CocaCola", and simply "Кока-кола". All these variants should be normalized to one value, and the original text should be kept separately for review.

Legal entity forms in a brand usually get in the way. "LLP", "ТОО", "IP", "LLP", and similar parts usually belong to the seller or manufacturer in the documents, not to how the buyer searches for the product. It is better to remove them before matching the brand. Otherwise, "LLP Apple City" can easily become a false brand.

You also need to decide where the boundary is between the brand and the line. In the row "Nestle NAN Optipro 3 800 g", the brand can be "Nestle" or "NAN" — it depends on the catalog model. And "Optipro 3" will be a series or variant. If there is no such rule, the same line will end up in the catalog as a brand, a series, and even a model.

Pack size is easier to handle with patterns. Entries like "12x500 ml", "6 x 1 l", "4*250 g", and "2 pcs x 5 kg" describe the same thing: how many units are in the package and what volume or weight each unit has. These parts should be stored separately. For "12x500 ml", a convenient record is: quantity — 12, base volume — 500 ml, total volume — 6000 ml. You can calculate total volume later, but the source parts should not be lost.

This approach makes search and comparison much easier. Then "6 x 1 l" and "12x500 ml" no longer look like different products just because of the way they are written. If the row contains both a brand and a series as well as pack size, it is better to parse it in order: first remove service noise, then find the brand in the dictionary, then extract the series, and only after that parse the package.

Example with one product in three files

If you look only at the original rows, it seems like three different products. One supplier writes "BrandX water 0.5 l 12 pcs", another writes "Water Brand X 500 ml pack 12", and a third writes "BrandX water 12x0.5 L". Without normalization, the catalog often creates three cards, and purchasing later gets duplicates and price confusion.

The point of parsing is to separate the product from the writing style. In these three rows, the word order, language, spaces inside the brand, and the pack-size notation all change. The product itself is the same.

First, the system extracts the brand. "BrandX" and "Brand X" need to be merged into one form, otherwise the brand splits into two records. Usually a dictionary of common spelling variants and a simple rule for joining spaces inside known names is enough.

Then the system brings the volume to one unit. The entries "0.5 l", "500 ml", and "0.5 L" all describe the same volume — 500 ml. After that, the pack size is parsed: "12 pcs", "pack 12", and the pattern "12x0.5 L" all point to a package of 12 units of 500 ml each.

What the result looks like after parsing

After normalization, the card no longer looks like a price list row, but like a set of fields:

brand: BrandX
product type: water
volume of one unit: 500 ml
pack size: 12 pcs

The word "water" is not a problem here either. If you have a synonym dictionary, the system will map "water" and "вода" to the same product group. Without that, the same item can easily end up in different categories.

In practice, this parsing solves a very simple task: all three rows can be merged into one product card. After that, you compare supplier prices for one item instead of manually searching for where "Brand X" matches "BrandX".

Where teams most often make mistakes

Run parsing closer to the data

Choose AI Router open-weight models if low latency or fine-tuned options matter to you.

Choose a model

Most failures start not with hard rows, but with overly bold assumptions. The team sees a column called "Brand" or "Volume" and believes that is exactly what sits inside. In practice, "Volume" may easily contain "6 pcs", and "Comment" may contain half of the needed attributes. Until you look at the actual rows, the parsing scheme will be built on wrong assumptions.

Another common mistake is mixing up product size and transport packaging. The row "water 1.5 l x 6" describes a 1.5-liter bottle and a box of 6 pieces, not a product with a volume of 9 liters. If the system puts that into one field, the catalog breaks quickly: filters show strange pack sizes, purchasing compares incompatible items, and the warehouse receives extra boxes.

A bad habit is overwriting the original text after cleanup. Suppose you turned the row "0.5l/12 pcs" into "500 ml". It looks neat, but the trace of the error is gone: now it is unclear where the "12 pcs" went and why the product looks like a single item. The raw text, the parsing result, and the reason for the edit should live side by side, not replace one another.

Automatic correction of questionable rows without a review queue is also expensive. A model or rule may guess the brand correctly in 95% of cases, but the remaining 5% are often the most painful. If parsing uses an LLM through an OpenAI-compatible gateway like AI Router, the rule remains the same: the model should not silently edit a row when confidence is low.

There is a simple minimum that protects you from repeat mistakes:

keep the original row unchanged;
separate the sellable unit and the case;
mark doubtful parses with a confidence level;
send questionable rows into a review queue;
record the dictionary version and rules for each upload.

The last problem looks boring, but it hits hard. The team changes the brand dictionary, adds a new rule for "g", "g", and "g.", and a week later nobody remembers what exactly changed. Without versioning, you cannot tell after which step the errors grew or how to roll back a bad change.

Do not trust the file at face value. Look at real rows, keep a trace of every decision, and leave a human the right to review anything doubtful. That is cheaper than fixing the catalog by hand later.

Quick check before upload

Don’t edit doubtful items blindly

Send uncertain rows through AI Router and review audit logs for each request.

Run a check

Before import, it helps to look not only at the final columns, but also at a pair of "before" and "after" values for each row. If the system turned "12 pcs x 0.5 l" into clean fields, a person should quickly understand where each piece came from. Otherwise errors stay hidden until the product is already in the catalog.

A good check takes a few minutes and catches most failures. It is enough to go through five points:

keep the original text next to the normalized result;
keep all units of measure in one reference list, not in rules spread across different files;
pass brands through a synonym dictionary so that "P&G", "Procter and Gamble", and the Russian spelling all map to one value;
split pack size into quantity and base unit separately;
put questionable rows into a separate list instead of mixing them with clean ones.

Pack size is especially likely to break. The rows "6x1.5l", "6 x 1.5 l", and "pack of 6 bottles, 1.5 l each" look different, but the meaning is the same. If you store quantity 6 and base unit 1.5 l separately, it becomes easier to calculate price per liter, compare products, and find duplicates.

The same rule applies to units. One reference list for "kg", "kg", "kilogram", "g", "g", and other variants removes half the noise. Do not make the model or the rules guess again every time. Let them map the row to the values you have already accepted.

The list of questionable rows is not for reporting. It saves the operator time. Instead of reviewing the whole file, the person opens only the 3–7% of rows where the system is unsure: it did not find a brand, saw two possible pack sizes, or met a new unit.

The final review screen should show one simple number: what share of rows the system parsed without manual editing. If it was 92% yesterday and 61% today, the file is better left unimported. That signal is more useful than a long report, because it immediately shows what changed: the rules, the dictionary, or the supplier format.

What to do after the first run

After the first pass, do not try to cover the whole catalog at once. It is better to narrow the task: take one product category and the 1,000 most frequent rows. That way you will quickly see where the system already works and where it confuses the brand, unit, and pack size.

That kind of start gives you an honest picture. If those 1,000 rows need a lot of manual edits, there will be even more errors in the rest of the files. If the frequent items parse cleanly, you can move forward without unnecessary noise.

Once a week, it helps to review all new unknown variants that accumulated after uploads. Do not leave this for a month. After a week, the list is usually still short, and the team can understand where the new entries came from: a new supplier, a strange abbreviation, extra text in the name, or a different pack size.

A simple cycle helps here: collect rows that did not parse or got low confidence, group them by error type, add new rules and synonyms, and then rerun only the problematic set. After that, check whether the number of duplicates increased.

On the dirtiest rows, it is useful to compare two approaches: rules and LLM. Rules usually behave consistently on clear patterns, while a model is better at pulling meaning out of chaotic names like "White sugar bag 25kg GOST" or "Coca Cola can 0.33 x12". In practice, one thing rarely wins on its own. Most often, the best result comes from a combination: simple rules first, then a model for the hard cases.

It is worth saving metrics from day one, even if the process is still rough. The minimum set is this: the error rate after parsing, the share of manual edits, the number of duplicates after normalization, and the share of rows the system could not parse. If these numbers do not go down after two weeks, the problem is usually not the volume of data, but weak rules or poor parsing of rare formats.

If the team is testing several models, it is convenient to run the same set of rows through one OpenAI-compatible API and compare the result using logs. For example, AI Router lets you test different models through one endpoint and keep audit logs without changing your usual SDK and code. That is useful when you need to compare parsing quality on the same dataset instead of arguing by feel.

Frequently asked questions

Can I upload price lists into the catalog as they are?

No. First bring the product name, brand, units, and pack size into one format. Otherwise the catalog will end up with duplicates, prices will drift apart, and stock will split across similar cards.

Which fields should be extracted first?

Usually the source row, the overall product name, the brand, the number and unit of measure, and the quantity in the package are enough. That set already helps merge identical items and catch import errors.

What units are best for weight and volume?

For weight, it is better to store grams, for volume — milliliters, and for count — pieces. Then 0,5 kg, 500 g, and 0.500 kg all become the same value, and comparisons work without extra conditions.

Why save the original row after normalization?

It is needed for checking borderline cases. If a rule makes a mistake, you can immediately see what came from the supplier and where the parsing went wrong.

How do you avoid mixing up product size and transport packaging?

Separate them right away. A product has its own weight or volume, while a shipment has the number of units in a carton or package. For a row like 1.5 l x 6, that means one 1.5-liter bottle and a box of 6 pieces, not a 9-liter product.

What should I do with rows the system parsed with low confidence?

Don’t silently fix them. Mark such rows with a confidence level and send them to a separate review queue. That way you can correct a few difficult records by hand without damaging hundreds of cards.

Do I need a separate brand dictionary?

Yes, without it brands quickly split into variants like Nestle, NESTLE, and Нестле. The dictionary should include synonyms, common typos, and remove legal forms like ТОО or ООО if they get in the way of search.

When should I use rules and when should I use an LLM?

Start with rules. They handle clear patterns like 12x500 ml or 250 g well. Use a model where names are messy and word order changes. If the model is unsure, send the row to a person instead of the catalog.

How can I quickly check parsing quality before uploading?

Look at the before and after pairs for a sample of rows. If the system suddenly parses a file much worse than usual, stop the import and check the dictionary, the rules, or the supplier’s new format.

Where should I start after the first normalization run?

Don’t take the whole catalog at once. It is easier to start with one category and the most frequent rows, then review new unknown variants once a week. At the same time, track manual edits, duplicates, and unparsed rows — those numbers show whether things are improving.