Jun 07, 2025·8 min read

Extracting Tables from PDFs: How to Build Clean Data

Extracting tables from PDFs takes more than parsing: you also need line normalization, total checks, and manual review of ambiguous cases.

Why a table in a PDF falls apart

A PDF almost never stores a table as a real table. Most of the time it is just a set of separate text blocks placed at coordinates on the page. A person sees a neat grid, but the program sees words and numbers sitting next to each other. Because of that, when you extract tables from a PDF, column boundaries are easy to lose: a total slips into the next column, a date sticks to the title, and one record suddenly splits into two.

The problem starts with the file type itself. If the PDF was created from text, the parser can still rely on character positions, spacing, and grid lines. If it is a scan or an image, there are no cells at all. First you need to recognize the image, and only then try to build a table. And that is where an 8 turns into a 3, and a dot becomes a comma. For a report with totals, that is not a small issue.

Most often, a table breaks because of a few common things:

a long service name wraps to a new line, and the parser treats the continuation as a new record;
a footnote below the table gets pulled into the last row and corrupts the total;
a merged header cell shifts columns to the left or right;
an empty cell looks like the end of a row, even though the data is simply missing.

In reports, this is especially noticeable in subtotal sections. An expense block may end with a row labeled "Total", and a note in small type may sit underneath it. If the parser captures both lines as one, the totals check will show a false mismatch.

In pricing tables and contract appendices, the failure is even more visible. A line item name often takes two or three lines, with the rate, unit of measure, VAT, and term beside it. Shift one column, and the price ends up attached to a different service. Sometimes a table merges cells for a section title, such as a service package or a group of rates. Then the program cannot tell where the section ends and where the normal rows begin.

That is why parsing PDF tables rarely comes down to an "export" button. First you need to understand what you are looking at: a text-based PDF, a scan, a complex layout, or a mix of all three. Only then does it make sense to gather rows, bring fields into one format, and check whether the numbers add up.

How to tell what type of PDF you have

Start by checking what is in front of you: a text-based PDF or a scan. Almost your entire approach depends on that. If the file is text-based, the program sees letters, numbers, and object boundaries. If it is a scan, the page usually contains only an image, even if the text looks sharp.

The quickest check is simple: try selecting text with the mouse. If selection follows words and lines, regular parsing often gives a decent result. If the cursor only grabs the whole area like an image, OCR is unavoidable.

There is also an in-between case. A contract or report is often assembled from different pages: part exported from Excel, part scanned later, signed, and saved again as a PDF. In that case, one page may parse well while the next one breaks the entire flow. You need to check not only the file as a whole, but each page with a table.

Before you start extraction, it helps to quickly check a few things:

whether text can be selected by line or cell;
whether numbers copy without extra line breaks or spaces;
whether stamps, signatures, or seals cover totals or column names;
whether there is tilt, shadow, gray background, or low contrast;
whether all pages in the document are structured the same way.

Stamps and signatures cause more trouble than you might expect. A round seal can cover a couple of digits in a total, and a signature over a table line can break the column layout. Date stamps often land right inside cells and then look like extra numbers. If those obstacles exist, it is better to mark the problem areas right away instead of expecting a clean result on the first pass.

Regular parsing works when the text is already embedded in the PDF and the table is reasonably neat. OCR is needed when the page is stored as an image, text cannot be selected, or the scan is poor quality. If the document is mixed, the sensible approach is to use two flows: process text pages directly and send scans through OCR. That reduces errors, and it also reduces manual cleanup later.

What to use to extract the table

The same PDF can be handled in different ways, but there is no single universal tool. The choice usually comes down to two questions: can the text be selected, and how stable is the table layout?

For a one-off task, simple copying may be enough. For a stream of documents, it is better to choose a method that gives predictable results rather than a beautiful first run.

Copying from a PDF is fine for a quick test if the file is digital, the table is small, and the columns do not fall apart when pasted.
A coordinate-based parser works well on standard documents: recurring reports, pricing sheets, and contract appendices with the same layout.
OCR is needed for scans and pages stored as images. Without it, you cannot even reach the cells, but it most often confuses numbers and small text.
LLMs are useful in borderline cases where there are no clear boundaries, rows have merged, and the cells contain notes or footnotes.

Usually the best order is this: first try to get text and coordinates, then add OCR, and keep the LLM for difficult fragments and normalization. It is cheaper, clearer, and easier to review.

On multi-page tables, it is usually not the extraction itself that fails, but the page stitching. The first page may contain the full header, the next only a continuation, and the third the total. If the columns match, the field order stays the same, and the rows continue logically, the pages can be combined into one table. If the column widths shift, headers change names, or a row is cut off at a page break, that fragment should be treated as questionable right away.

OCR and LLMs are especially helpful at page joins, where one row is split in half. But this is where you need control: the model may "add" something that is not in the document. That is why it is better to send only the problem spots for reprocessing, not the entire PDF package.

You should also build a manual review queue right away. It usually includes rows where the total does not match the quantity and price, OCR produced low confidence for numbers, one cell split into two lines, a row starts on one page and ends on another, or the description contains a footnote, seal, signature, or handwritten note. That way you do not review the whole document, only the risk points.

The workflow from file to table

Reliable results do not appear when the parser "finds something", but when you have a repeatable workflow. One PDF may contain a neat table on one page, a cut-off tail on the next, and a footnote that accidentally sticks to the last row. If you do not break the work into stages, errors pile up very quickly.

First split the document into pages and identify, for each page, the area where the table actually is. That matters more than it seems. Headers, appendices, signatures, and small notes often sit nearby. If you capture the whole page, they will mix with the data and ruin the row parsing.

Then build the table by coordinates, not just by text. The parser needs to understand where column boundaries run, which words sit on the same line, and where a row continues after a wrap. If a cell has two lines of text, it is best to merge them immediately using a clear rule. Otherwise, a tariff description suddenly becomes a separate record.

It helps to save not only the final table but also supporting data:

page number;
table area coordinates;
original cell text before cleanup;
parser confidence or a questionable-reading flag.

Next, bring the values into a single format. Numbers often come in as "1 250,00", "1250.00", and "1.250,00" in neighboring files. Dates may be in the format "01.02.2025" or "2025-02-01". Currency is best stored separately from the amount so that "12 000 KZT" does not live in the same field as the number. It is a boring step, but without it the data is hard to filter, calculate, and reconcile.

A good habit is to keep the source fragment nearby. It can be the cell text before normalization or a small crop of the row from the PDF. When an accountant or lawyer sees a mismatch, they should be able to open the disputed spot quickly and check it without parsing the whole document again.

Do not hide questionable cells. Mark them right away: weak OCR reading, empty amount with currency present, date outside the expected range, a row without a tariff code, or a total that does not match VAT. In a contract appendix, that is especially important: one badly read digit can change the price by a factor of ten.

This workflow is slower at the start, but it saves hours later. You are not just getting a table; you also know where each row came from and what exactly needs manual checking.

How to make the data consistent

Keep data in-country

Useful when tables contain personal data and you need storage inside the country.

Try the service

After extracting tables from a PDF, raw rows almost always look similar, but they do not match exactly. The same price may arrive as "10 000,50", "10.000,50", or "10000.50", and a service name may appear as "24/7 support", "24 7 support", or as two lines inside one cell.

First merge line breaks where the text clearly belongs to one cell. If the tariff line "Subscription\nfee" split into two pieces, it should be put back together before any comparison. Otherwise the table looks complete, but duplicate checks and grouping produce junk.

Then remove extra spaces and hidden characters. PDFs often contain non-breaking spaces, tabs, and invisible line breaks. The eye does not notice them, but code treats the strings as different. The simplest rule works best: keep normal spaces between words, remove double spaces, and bring characters into one consistent set.

Numbers and labels

Numbers require strict discipline. Choose one format and convert all values to it: one decimal separator, no spaces inside numbers, no currency symbols in the same field. If one document contains "15,5%", "0.155", and "15.50 %", decide in advance whether you are storing the percentage as text for display or as a number for calculations.

VAT, currencies, and ranges should be handled separately. "1000 KZT without VAT", "1 000 tg", "1000 tenge", and "1000 + 12%" are different cases. It is better to keep price, currency, and VAT flag in separate columns. Ranges like "from 100 to 500" or "1-10 users" should also not stay as one string if you later build reports.

Units of measure work the same way. "pcs.", "pieces", "unit", "user/month", and "users/mo." are better reduced to one dictionary. The same applies to service names. If a contract appendix and a monthly invoice name the same service slightly differently, a normalized synonym list cuts manual reconciliation down a lot.

Usually five basic rules are enough:

merge line breaks inside a cell;
clean up spaces and hidden characters;
bring numbers to one format;
move VAT, currency, and ranges into separate fields;
match service names and units of measure against a single reference list.

For teams that regularly receive tariffs and appendices as PDFs, this is no longer cosmetic. Without tabular data normalization, you cannot honestly compare document versions, verify totals, or build one table for accounting.

How to check totals and catch mismatches

Even good PDF table parsing usually breaks not on columns, but on arithmetic. After extraction, it helps to recalculate everything as if the document had no totals at all.

Start by checking the sum of the rows against the section total. If a block contains ten telecom service lines, add them yourself and compare the result to the number at the bottom. This quickly catches a missing line, a shifted cell, or a number that moved into the next column.

It is better to handle VAT separately. Keep three fields: amount before tax, VAT, and amount after tax. Then it is immediately visible where the document mixed rates, where the parser took the total with VAT instead of the base amount, and where 12% was calculated from an already increased sum.

What usually causes errors

rounding in each row instead of in the final total;
a duplicate row after the table breaks across a new page;
a "Total" row inside the body of the table that the parser treated as a normal item;
different totals in the main table and the contract appendix.

Rounding often looks worse than it really is. If the rows add up to 125 430.01 and the section total is 125 430.00, that is usually acceptable. But if the difference reaches 1-2 tenge and repeats across several sections, there is almost always a duplicate or a misread digit somewhere.

It also helps to compare totals across pages and related documents. In a pricing appendix, the monthly totals may match within one table but still not line up with the overall cap in the contract or invoice. That kind of control is especially useful when the same tariff appears in the appendix, act, and report in slightly different form.

For questionable rows, set a simple manual review threshold. For example, send everything with a difference above 0.5% or above 100 tenge for review. Small deviations can be marked as rounding, and anything above the threshold can be checked by a person in a minute.

In practice, it looks simple. If a contract appendix has five tariff lines at 20 000 tenge each before VAT, the pre-tax total should be 100 000, VAT should be 12 000, and the grand total should be 112 000. Any other number is a reason to check the line, the rate, or a duplicate.

Example with tariffs from a contract appendix

Mask data before parsing

Hide personal data if you send contracts, acts, and invoices to a model.

Set up masking

In a contract appendix, a table often looks simple until you try to pull it into CSV. On paper everything is clear: service, volume, rate, total. After parsing, one row becomes two, the currency disappears, and the section total no longer adds up.

Let's take a small example. The document has a section with communication services, and the currency is placed in a note below the table: "All rates are in tenge without VAT." That means the currency field should come from the metadata of that block, not from the rows.

Section: Services

| Service                                    | Qty   | Rate   | Total   |
|--------------------------------------------|-------|--------|---------|
| Data transmission channel                  | 2     | 120000 | 240000  |
| for backup site                            |       |        |         |
| Technical support 24/7                     | 1     | 180000 | 180000  |
| Dedicated IP address                       | 12    | 30000  | 360000  |
| Section total                              |       |        | 780000  |

Note: All rates are in KZT without VAT.

If the parser takes this as-is, the line "for backup site" should not be treated as a new service. It should be attached to the previous line, because it has no quantity, rate, or total. After merging, the name becomes: "Data transmission channel for backup site".

Then normalization comes in. Numbers are brought into one format: spaces are removed, commas are changed to dots if needed, and the amount is stored as a number. The KZT currency is assigned to all rows in the section from the note, rather than waiting for it to appear in every cell.

The totals check here is very simple, which is exactly why it is useful. You recalculate each row and then compare the section total:

2 x 120000 = 240000
1 x 180000 = 180000
12 x 30000 = 360000
Sum of rows = 780000

The total matches. So the table was probably assembled correctly. If the parser had left the wrapped line as a separate row, the row totals might still add up, but the export would contain an empty service. For reporting and loading into an accounting system, that is already a problem.

Not every row should be handled automatically. It is better to give a person cases where a row has no quantity but does have a number in the "Total" column, the rate looks like "as per price list" or "by agreement", a note inside the table looks like a normal row, two services are sitting in one cell because of a wrap, or the section total does not equal the sum of the rows. These are the places that usually require the most manual cleanup.

Where errors happen most often

The most expensive mistake in PDF table extraction is trusting the first OCR result. The text may look plausible, but the structure is already broken: the header moved into the next column, the amount landed in the description, and an empty cell became part of the next row.

That is why it is not enough to recognize characters. You need to check the table itself: how many columns it has, where each row begins, whether headers match, and whether totals have shifted. It is worth running the same file through at least a structural check instead of stopping at text recognition.

Errors often hide in the small things. A minus sign before a number disappears more often than you would think. Thousand separators are also lost regularly: 120 000 becomes 120000, and sometimes 120.000 if the parser mixed up the format. For a report, that is a different number; for a tariff in a contract appendix, it is a different meaning.

Another common issue is merging adjacent rows. This happens when a description contains a line break and the system decides it is one long record. In the end, the service row and the discount row stick together, and instead of two items you get one. After that, totals match less often, and finding the cause takes longer.

Footnotes are especially annoying. At first glance the rate may look the same, but the footnote changes the calculation rule: the price is shown without VAT, the discount applies only to the first month, or the night rate is calculated separately. If you miss that note, the table will still look neat, but the error will already be inside the data.

A short list of typical failures is useful to keep nearby:

OCR recognized the text but broke the column boundaries;
the minus sign, thousands space, or decimal comma disappeared;
two rows merged after a wrapped description;
a footnote changed the rate, and the parser ignored it;
an editor manually fixed the table and overwrote the source.

That last point is often underestimated. Do not overwrite the source data after editing. Keep the original separately, and store corrections in a new version with a note about who changed what. Otherwise, a week later, it will no longer be clear whether the error came from the PDF, OCR, or a manual edit.

Problems rarely start with a major failure. Usually it is the small losses that cause trouble: one minus sign, one footnote, one merged row. Later, those are the things that break totals checks in reports.

Quick checks before exporting

Split the load by key

Key-level limits help separate production, testing, and manual review queues.

Set limits

Five minutes before export often saves an hour of manual cleanup. After parsing, the errors usually hide in small things: an empty cell shifts a column, a date changes format, or a number looks numeric but is stored as text.

Pre-export checklist

Review full columns. If empty cells appear in the middle of a column, check whether values shifted left or right.
Open a few numeric columns and try sorting them or calculating a sum. If "1 250,00" behaves like text, filters, formulas, and pivot tables will break later.
Reconcile the subtotals with the numbers in the document. First by section, then for the entire file. Even a 0.01 mismatch often points to a wrong separator, a lost minus sign, or a missing row.
Mark questionable rows right away. A "review manually" tag is better than a silent error in the final export.
Keep a trace back to the source: page number, file name, and a small crop of the original table.

Also check that dates and currencies use one consistent format. If one part of the table stores a date as "10.02.2024" and another as "2024-02-10", sorting and merging will behave badly. The same goes for currency: "tenge", "KZT", and the currency symbol should be normalized before export, not after.

Errors are especially visible in tariffs and contract appendices. One row may contain the price, the period, and a note such as "included in the package". The parser sometimes puts the note into the price column and leaves the price as text. If you check such spots early, you will not need to fix the report manually at the last moment.

If the table later goes into an accounting system, BI, or an LLM workflow, saving the page number and a source crop makes disputes much easier to resolve. For teams that need auditability and change history, that trace is often more useful than the export itself.

What to do next if you have lots of documents

When the number of PDFs grows from ten to two hundred a week, manual cleanup breaks the whole process. At that point, you do not need "one more script"; you need a simple flow where every file follows the same path and does not get lost between folders, emails, and Excel.

A practical setup usually looks like this:

the file enters an incoming queue with a document number and date;
the system extracts tables and immediately identifies the file type: text PDF or scan;
checks look for empty rows, broken columns, total errors, and odd units of measure;
the finished table goes to export, and questionable spots go to a separate review queue.

It is better to split documents into two flows right away. Process a simple PDF with a text layer using strict rules: table detection, column parsing, date normalization, and currency and number cleanup. Send scans through a different path: OCR, stricter totals checks, and mandatory marking of places where the system is not confident about the characters.

If you mix these types in one scenario, errors build up quickly. A clean PDF usually breaks because of merged cells, while a scan breaks because of one badly read digit that changes the total across the entire page.

Error log

Do not dump questionable cases into a shared folder. Keep a log with the file ID, page, table row, failure reason, and parsing status. Mark total mismatches, missing headers, and rows where the text has merged into one cell separately.

Such a log quickly shows what fails most often: OCR, a specific vendor template, or a normalization rule. Within a week, it can be more useful than a long manual review by eye.

If you connect an LLM, do not send it the entire document without sorting. It is much better to give it only the questionable fragments: the table header, one problematic row, or a block where columns need to be separated. For that layer, a single gateway such as AI Router is convenient: you can run checks through one OpenAI-compatible endpoint and keep your existing SDK and code unchanged. For teams in Kazakhstan, it can also be useful that airouter.kz offers in-country data storage and audit logs.

How to start a pilot

Start with 50-100 documents, not the whole archive. Pick 3-4 of the most common types: monthly reports, tariff appendices, and tabular contract appendices. That is enough to see what percentage of files go through without edits and where a person is still needed.

A good pilot answers three simple questions: how many files you process without manual cleanup, how many total mismatches you catch before export, and how many minutes one questionable document takes. Once those numbers are clear, document processing automation becomes much easier and calmer.

Frequently asked questions

How can I quickly tell whether a PDF is text-based or a scan?

Try selecting text with your mouse. If words are highlighted line by line, the file is usually text-based. If the page behaves like an image, you need OCR. Check 2–3 pages too, because one PDF often mixes both types.

Why do tables from PDFs often fall apart during extraction?

Because PDF files rarely store a table as a real table. The program sees separate text blocks with coordinates, and even a small shift can send a date, amount, or description into the wrong column.

When is regular parsing enough, and when do I need OCR?

Use regular parsing when text is already embedded in the PDF and the columns are laid out neatly. Turn on OCR for scans, photos, and pages where text cannot be selected. If the document is mixed, split the flow by page instead of processing the whole file one way.

Do I need to connect an LLM right away for table extraction?

No, that is often unnecessary and expensive for the whole document. It is better to extract text and coordinates first, then add OCR, and keep the LLM for tricky spots such as line breaks, footnotes inside cells, or page boundaries.

What should I do with service names that wrap onto a new line?

Look at the neighboring fields. If there is no volume, rate, or total in the row, and the text simply continues the previous service, merge it into the previous cell using a clear rule. That keeps the line from turning into a false new record.

How can I check that the parsed totals are correct?

Recalculate everything yourself, even if the PDF already shows totals. Check the row formula first, then the section total, then VAT and the grand total. If the difference is above your threshold, send the line for manual review.

What is the best way to normalize numbers, dates, and currencies?

Bring every field to one format. Store numbers without spaces or currency symbols, keep dates in one format, and put currency and VAT in separate columns. Otherwise sorting, filters, and reconciliation will start lying very quickly.

What should I save together with the final table?

Store the page number, table area, original cell text, and a flag for questionable reading. Then you can open the problematic spot right away without reprocessing the whole PDF.

How should I set up processing if I receive a lot of PDFs?

Build a simple flow: incoming file, PDF type detection, extraction, normalization, totals check, and a separate queue for questionable items. If the documents are similar in layout, set rules by template and keep an error log by vendor and file type.

Which rows should I send for manual review right away?

Do not wait for the error to show up in the final export. Mark rows right away when OCR confidence is low, the amount is empty, columns have shifted, there is a break at a page boundary, a footnote sits inside the row, or the total does not match the calculation. A person can check those cases quickly, and the rest of the file can stay untouched.