Feb 23, 2025·8 min read

Anonymizing Contracts and Medical Records for LLMs Without Losing Meaning

Anonymizing contracts and medical records before sending them to an LLM requires precise rules: which fields to hide, what to keep, and how to avoid distorting legal or clinical meaning.

Where the risk appears

The risk shows up in two places right away. The first is data leakage. The text contains full names, ID numbers, policy numbers, addresses, phone numbers, bank details, diagnoses, and other information that can identify a person or company. The second is loss of meaning. If you replace too much, the model stops understanding who owes what, what happened to the patient, and why it matters.

These problems are often mixed up, even though they are different things. If you leave personal data in place, you break information-protection rules. If you scrub too much too roughly, the LLM will answer in a dry or incorrect way. In a contract, that breaks the logic of obligations, deadlines, and penalties. In a medical record, the link between complaints, diagnosis, treatment, and progress gets lost.

Blank gaps are especially harmful. When a document is full of "[REDACTED]" or has long empty stretches, the model loses the links between entities. It no longer understands where one contract party ends and another begins, which treatment episode belongs to which test, or what happened first. It is better to replace data with clear labels like "Patient_1", "Organization_2", "Date_3". That keeps the structure intact.

The difference between document types is obvious. In a contract, legal meaning suffers most: the parties' roles, subject matter, amounts, deadlines, acceptance rules, and grounds for termination. In a medical record, the bigger risk is re-identification even after masking, because a rare diagnosis, an operation date, and a department together can point to one specific person.

Masking depth depends on the task. If the model is making a short internal summary for a lawyer, you can keep the parties' roles, relative dates, and amount ranges. If the task is clinical classification or ICD code lookup, you usually keep age, sex, complaints, lab results, and treatment course, but remove direct identifiers and overly precise time and place details.

Even if you work through a gateway with PII masking and audit logs, a mistake in the replacement template still ruins the result. Safe anonymization is not about removing everything possible; it is about precisely replacing the fragments that reveal identity but are not needed for the task.

Which fields to look for in contracts

In a contract, sensitive data is not only in the header. It is often repeated at the end, in appendices, and even in footers. Miss one such fragment, and the LLM will still see who signed the deal.

It is better to check a contract as a whole document template, not field by field. Otherwise it is easy to hide the ID number but leave the account number, branch address, or a signature with a name clarification.

First, look for the direct identifiers of the parties:

full names of individuals and signatories
ID numbers, company registration numbers, and other registration codes
legal and actual addresses
phone numbers, email addresses, and contact person names
bank details: IBAN, bank codes, account numbers, bank name

But that is not the full list. A contract also contains fields that do not look personal, yet still make it easy to tell who the document is about. These include the contract number, appendix number, specification number, and power of attorney number. On their own they may seem safe, but together with a date, an amount, and a product name they often point to a specific deal.

If a contract is sent to an LLM for risk review or term extraction, it is better to replace such fields with neutral labels. For example: "Contract No. [DOC_ID]", "Power of Attorney No. [POA_ID]", "Buyer [COMPANY_1]". That way the model keeps the structure and does not lose the document logic.

Where data often hides

Most misses happen in the signature block. That is where you find full names, job titles, the basis of authority, phone numbers, email addresses, and sometimes even a sample signature. Stamps may contain the registration number, full company name, and address.

The same issue repeats in appendices. Specifications, acts, delivery schedules, and powers of attorney often copy the full details even when you have already hidden them in the main text.

It helps to look at a contract as a set of repeating zones: header, party details, appendices, signatures, and stamps. If each zone has its own replacement pattern, the document stays readable. A lawyer still sees deadlines, liability, subject matter, and amounts, while personal and corporate identifiers do not leak out.

Which fields to look for in medical records

In a medical record, personal data is hidden not only in the header. Even if you remove the patient's name, a person can often still be identified by the case number, exact dates, and a rare set of complaints. The mistake is usually the same: obvious fields are covered, but details remain that can easily be tied to one person.

First, remove the direct identifiers:

full names of the patient and relatives
ID number
home or registration address
phone number, email, trusted contact details
policy number or insurance details

That is still not enough. In many systems, the record is exposed by service fields that look harmless but actually link the document to reception, an insurance database, or internal reporting. These include the medical card number, treatment or hospitalization case number, internal patient ID in the medical information system (MIS), referral number, test number, lab request number, and exact dates of admission, tests, surgery, and discharge.

Dates need care. For an LLM, the calendar point itself is often less important than the order of events and the intervals. So instead of "12.03.2025" and "19.03.2025", it is better to keep "day 1" and "day 8" or "7 days after admission". Then the model understands the treatment course without getting an unnecessary link to a specific person.

Another problem is quasi-identifiers. One field by itself does not reveal identity, but together they make the patient recognizable. An age of 47, a rare diagnosis, a pregnancy at 31 weeks, a transfer from a specific department, and an unusual reaction to a drug is already a very recognizable set.

Be especially careful with rare diagnoses, orphan diseases, unusual injuries, and medical history details. Sometimes you cannot remove the diagnosis because the clinical meaning would be lost. In that case, it is better to generalize the surrounding fields: keep an age group instead of the exact age, shift the dates, remove the name of a small town, and hide the exact department and doctors' surnames.

The rule is simple: if a field helps with treatment, analyzing the patient's path, or understanding the doctor's decision, keep it in a generalized form. If a field helps find a specific person in a database or identify them outside the document, hide it.

What to keep so meaning stays intact

When you hide personal data, the goal is not to make the text empty. The model needs workable context: who is connected to whom, what happened, when it happened, and what conditions apply.

Names and titles are better replaced with roles. In a contract, instead of "Alpha LLC" and "Ivanov I.I.", keep "[customer_1]", "[supplier_1]", "[customer_representative]". In a medical record, labels like "[patient_1]", "[cardiologist_1]", and "[neurology_department]" work well instead of the patient's and doctors' names. That keeps the document logic intact.

It is important not to lose the links between participants. If one doctor ordered a test and another discontinued a medication, the model should see that chain. If one party pays, another delivers the goods, and a third accepts them under an act, those roles must not get mixed up. Otherwise the answer may sound polished but be factually wrong.

Dates should not be overdone either. If the model is checking deadlines, event order, delay, limitation periods, or treatment duration, it is better to keep dates in the form needed for the output. Sometimes it is enough to hide only the birth date or exact appointment time while keeping the signing date, admission date, prescription date, and discharge date.

Some data should almost always stay as is:

dosages and measurement units
amounts, rates, VAT, penalties
clause, appendix, and act numbers
lab results, if they are needed for the output
payment, delivery, treatment, and observation periods

Even a small change to such fragments changes the meaning. If you replace "5 mg" with "[dosage]", the model will not understand the overdose risk. If you hide "cl. 4.3" and "cl. 7.2", it will not connect the obligation, deadline, and liability.

A good rule is this: hide the identity, but do not hide the logic. In a contract, it should still be clear who owes what to whom and by when. In a medical record, it should still be clear who treated the patient, at what stage, based on which data, and with what result. Then the LLM analyzes the text instead of guessing what you removed.

How to set up anonymization step by step

Bring integration down to one API

Connect 500+ models and route requests through one endpoint.

Connect API

Anonymization works better when you start with the goal of processing, not with a list of fields. The same document can be sent to an LLM for different tasks: finding contract risk, making a short summary, or labeling the type of medical visit. The task determines what can be hidden and what needs to stay.

A workable flow usually looks like this:

First define the task. If the model is looking for a payment delay, it needs amounts, deadlines, party roles, and the order of events. If the model is writing a discharge summary, it needs the diagnosis, complaints, prescriptions, and date sequence.
Build two lists of fields: direct and indirect identifiers. The first group includes full names, ID numbers, phone numbers, addresses, policy numbers, and contract numbers. The second includes a rare job title, branch name, exact admission date, ward number, and an unusual combination of diagnosis and age.
Replace data by template, not with empty gaps. It is better to write [CLIENT_1], [DOCTOR_1], [CONTRACT_7], [DATE_1], [ORGANIZATION_2]. If the same entity appears five times, the same label should be used every time.
Test the rules on a small sample. Take 10-20 documents of different types: a short contract, a contract with an appendix, a discharge note, a consultation, and a lab result. After replacement, give the model a normal work query and compare the answer with the original.
Show the result to someone from the subject area. A lawyer will quickly see that after replacement it is no longer clear who pays the penalty and by when. A doctor will notice that anonymization erased the difference between the current diagnosis and the medical history.

A simple guide works well: after anonymization, the document should still be readable to a person who has not seen the original. If the text falls apart into holes and fragments, the rules are too rough. If the person is still recognizable from the combination of details, the rules are too soft.

When the setup works, save template and rule versions. In production, this matters a lot: later it is easy to see which masking version produced a good result and which one broke the model's answer.

Example with a supply contract

Take a simple supply contract between two legal entities. The original usually includes the full company names, registration numbers, addresses, bank details, signatories' full names, and the basis for their authority. If you send that text to an LLM as is, the model gets much more data than it needs to analyze the deal terms.

Company names are better replaced with roles: "Supplier" and "Buyer". That keeps the text readable and does not break the obligation logic. If more than two legal entities are involved, one role is not enough. Then you write "Supplier_1", "Supplier_2", "Buyer". Otherwise the model will confuse who ships the goods and who pays.

The original sentence might look like this: "TОО 'Alfa Supply' agrees to deliver goods to JSC 'CityBuild' within 15 calendar days from the request date". After replacement, it is better to keep it like this: "The Supplier agrees to deliver goods to the Buyer within 15 calendar days from the request date". The meaning is the same, and the extra details are gone.

At the same time, do not touch numbers that affect the output. If the contract says "payment within 7 banking days", do not replace it with "within a week". If the penalty is "0.1% for each day of delay, but no more than 10% of the overdue amount", keep the wording without rounding it off. For a lawyer, the difference between 7 and 10 days or between 0.1% and 1% changes the whole result.

Also check the internal references in the contract. After replacement, the text sometimes gets shorter, and the editor breaks numbering or line breaks. The model must clearly understand that clause 4.2 refers to clause 7.3, and appendix No. 2 is linked to the specification, not the delivery schedule.

Before sending such a contract, it helps to quickly check four things:

the parties' roles are distinct and not mixed up
amounts, payment deadlines, penalties, and limits are preserved exactly
clauses, subclauses, and appendices are readable without ambiguity
details, addresses, accounts, and full names are hidden if they are not needed for the task

If, after anonymization, the model can still answer who delivers the goods, when payment is due, and how the penalty is calculated, the document structure has been preserved.

Example with a medical record discharge note

Don't lose meaning after replacement

Run the same prompt before and after anonymization and compare the result.

Compare answer

If you remove too much from a discharge note, the model sees a set of fragments and starts filling in the blanks. In medical records, that is dangerous. A doctor or analyst needs not only the diagnosis and prescriptions, but also the order of events, doses, dates, and short negatives like "no allergy".

Take a short discharge note: the patient presented with cough and fever, had tests the next day, then received a diagnosis and treatment plan. In such a text, you need to hide the full name, phone number, address, card number, and other direct identifiers. But the clinical logic must stay untouched.

Пациент: Иванов Петр Сергеевич, 14.03.1986
ИИН: 860314300123
Номер медкарты: 4519-22
Телефон: +7 701 123 45 67
Жалобы: температура до 38.4, сухой кашель 3 дня, слабость.
02.05.2026 осмотр терапевта. Аллергии нет.
ОАК: лейкоциты 12.4 x10^9/л, CRP 28 мг/л.
Диагноз: внебольничная пневмония справа.
Назначено: амоксициллин 500 мг 3 раза в день 7 дней, парацетамол 500 мг при температуре выше 38.

After anonymization, the text can look like this:

Пациент: [ФИО скрыто], дата рождения: [скрыто]
ИИН: [скрыто]
Номер медкарты: [скрыто]
Телефон: [скрыто]
Жалобы: температура до 38.4, сухой кашель 3 дня, слабость.
02.05.2026 осмотр терапевта. Аллергии нет.
ОАК: лейкоциты 12.4 x10^9/л, CRP 28 мг/л.
Диагноз: внебольничная пневмония справа.
Назначено: амоксициллин 500 мг 3 раза в день 7 дней, парацетамол 500 мг при температуре выше 38.

The meaning stayed intact because the complaints, examination date, lab results, diagnosis, and exact treatment remain. The model still understands what happened first, what confirmed the diagnosis, and which treatment was prescribed. That is enough if you are summarizing, checking the completeness of the note, or looking for mismatches between diagnosis and therapy.

A common mistake is to delete every number in a row. Then the dosages, course length, and lab values disappear. Another mistake is to wipe out short phrases like "no allergy", "pregnancy denied", or "has not taken antibiotics". They may seem secondary, but they often change how the case is interpreted.

The rule is the same here: hide who it is, but do not hide what happened to the person. If the text later goes into an LLM through an internal gateway or API, first check that the cleaned version still preserves the clinical chain: complaints, examination, tests, diagnosis, prescriptions, and denials that affect the decision.

Mistakes that break the meaning

The most common problem is not that too little was hidden, but that too much was hidden too roughly. After such a change, the model no longer sees a document, but a distorted version of it. The answer sounds confident, but it is based on broken context.

The first mistake is to reduce different people to one marker like [person]. In a contract, that confuses the parties, signatories, beneficiaries, and power-of-attorney representatives. In a medical record, it can easily mix up the patient, the relative, the treating doctor, and the consultant. After that, the model no longer understands who said what and who made the decision.

It is just as dangerous to wipe out negatives. Phrases like "not present", "not detected", "does not take", and "no complaints" change the meaning completely. If the word "not" disappears during anonymization, the model may read the absence of a symptom as the symptom itself, or the absence of a diagnosis as a confirmed diagnosis.

Dates are the same story. Removing all of them is convenient, but the order of events is often more important than the personal data itself. In a contract, the difference between "notified before delivery" and "notified after delivery" changes how the dispute is interpreted. In a medical record, the chain matters: first complaints, then tests, then treatment, then a follow-up exam.

Numbers should also be handled very carefully. If the amount, deadline, dosage, temperature, glucose level, or size of a finding affects the conclusion, you cannot replace them with random values. A threshold of 10 days, 3 million tenge, or 38.5 degrees can change the entire model output.

A doctor's specialty should not be hidden without a reason. A cardiologist's, oncologist's, surgeon's, and therapist's conclusions are read differently even when the wording is similar. If only the marker [doctor] remains, the model loses important context and may confuse a specialist opinion with a general one.

A quick test is simple. If, after masking, the answer changes to any one of the questions below, the text is already damaged:

who exactly is acting or speaking
what is absent and what is confirmed
in what order the events happened
which numbers affect the threshold decision
whose opinion carries specialist weight

For an LLM, it is better not to erase the document, but to replace sensitive fields with stable and different labels: [patient_1], [cardiologist], [date_1], [amount_1]. That keeps privacy intact without breaking the meaning.

Quick check before sending to an LLM

Start with one template

Take one document type and quickly test your replacement rules in practice.

Start pilot

Even careful replacement of personal data can ruin a document because of a few small details. Before sending text to the model, it is useful to do a quick visual check. It takes only a few minutes, and it saves you from trying to explain a strange LLM answer that saw one contract party instead of another or mixed up the treatment course.

You need to check not only hidden fields, but also the meaning after replacement. The text should still make sense to a person who has not seen the original.

A handy checklist looks like this:

roles should not drift through the text. If it started with "Customer_1", it should not become "Buyer_2" on the third page. The same rule applies in a medical record to the patient, doctor, relative, and insurer
the chronology should still be readable. If you removed exact dates, it should still be clear what came first: complaint, examination, test, prescription, follow-up visit. In a contract, the order of steps, payment, and delivery should still be preserved
numbers should be checked separately. Amounts, doses, deadlines, percentages, measurement units, and clause numbers are the things most often lost during replacement
it is worth assessing the risk of re-identification from a set of facts. Even without a full name, a person can sometimes be identified by a rare diagnosis, age, exact hospitalization date, city, job title, and company name in one paragraph
it helps to show the text to someone who understands the subject. A lawyer will immediately see that the contract subject disappeared or the obligation logic broke. A doctor will notice that the link between symptom, test, and treatment is gone

An even simpler test works well: take the anonymized version and try to retell the document in two sentences. If you can say exactly who must do what, how much, when, and under what conditions, or what path the patient took from complaint to treatment, the meaning has been preserved.

If the document goes into production through an API, do not limit this check to the start of the project. A new contract template or a new discharge form will appear, and the old masking rules will start to fail. A quick manual review on a sample usually catches such mistakes earlier than any automatic metric.

What to do next

Do not try to anonymize the whole archive at once. It is much smarter to take one type of document and one simple task where a mistake is easy to spot. For example, start with supply contracts only for extracting deadlines and penalties, or with discharge notes only for short clinical summaries.

That kind of start quickly shows where masking preserves meaning and where it already breaks it. On a small volume, it is easier to see that replacing "Patient A" still works, while replacing dates, dosages, or party roles already ruins the model's answer.

It helps to set up a short working cycle:

choose one document template and 20-30 real examples
describe the replacement rules clearly and assign them a version
keep a log of borderline cases where the team was not sure right away what to hide and what to keep
collect examples where the model made a mistake after anonymization
review the rules once a week based on those mistakes, not on gut feeling

A log of borderline replacements is almost always needed. In contracts, disputes usually center on details, party roles, dates, and appendices. In medical records, age, rare diagnoses, hospitalization dates, test numbers, and links like "symptom - test - treatment" are what most often cause problems.

If the model starts confusing party obligations or losing clinical logic, do not rush to blame the model itself. First check whether you removed the anchor fields that held the context together. The most useful examples for rules are not the successful ones, but the broken ones: a wrong conclusion about deadlines, a mixed-up treatment episode, or a lost cause-and-effect link.

If the data must stay in Kazakhstan, it makes sense to compare your process not only by quality, but also by the data-handling mode. For example, AI Router provides an OpenAI-compatible API gateway, data storage inside the country, PII masking, and audit logs. For a team, that is a convenient checkpoint: whether the needed data mode is preserved and whether control over processing is not lost.

It is also worth checking the integration separately. If your stack already works through an OpenAI-format SDK, find out in advance whether you can connect a compatible gateway without rewriting code, just by switching base_url to api.airouter.kz. That saves time on the pilot and helps you test anonymization rules without extra approvals between lawyers, security teams, and developers.

Frequently asked questions

What should I hide in a contract first?

First, cover direct identifiers: full names, ID numbers, company registration numbers, addresses, phone numbers, email addresses, bank details, and signature blocks with name clarifications. Then check appendices, headers and footers, stamps, and the party details section, because that is where such data is often repeated.

For an LLM, it is better not to cut these fragments out, but to replace them with roles and labels like [Buyer_1], [Supplier_1], [DOC_ID]. That keeps the text understandable.

Which fields in a medical record most often give a person away?

Remove the direct identifiers: the patient's and relatives' full names, ID number, address, phone number, policy number, medical card number, and internal IDs. After that, look at exact dates, case numbers, test numbers, and other service fields that can be linked back to a database.

If the diagnosis and treatment course are needed for the task, keep them. It is better to hide the patient's identity than to erase the clinical picture.

Why can't I just delete all sensitive data?

Blank gaps break the links in the text. When the model sees a wall of [REDACTED], it loses the parties' roles, the order of events, and the meaning of the sentences.

It is much better to use stable labels: [Patient_1], [Organization_2], [Date_3]. One entity should get the same label everywhere in the document.

How should I handle dates without losing meaning?

Keep dates in the form the output needs. If the model checks payment deadlines, delays, or the course of treatment, preserve the order of events and the intervals.

Often, relative forms like day 1, day 8, or 7 days after admission are enough. That removes unnecessary precision without breaking the logic.

What is better to leave as is in a contract?

In contracts, you usually keep amounts, rates, VAT, penalties, deadlines, clause numbers, and appendix numbers. These are the details that help the model understand who owes what and when.

If you replace 0.1% with a generic label or hide cl. 4.3, the answer can easily become wrong even with a good prompt.

What should I not remove from a discharge note or medical history?

Keep complaints, diagnosis, test results, dosages, treatment duration, the sequence of events, and short negatives like no allergy. These fragments hold the clinical meaning together.

Do not hide all numbers in a row. If you remove the dose, the course length, or a lab value, the model will start guessing.

What about rare diagnoses and other indirect clues?

Look at the combination, not just one field. Age, a rare diagnosis, the department, the exact surgery date, and a small town together can make a person recognizable.

In such cases, it is better to generalize nearby details: keep an age group, shift dates, remove the exact location, and leave out doctors' surnames. Sometimes the diagnosis itself must stay, or the record loses its meaning.

How do I know anonymization has already damaged the document?

Check whether a person who has not seen the original can still retell the document in their own words. If, after replacement, it is no longer clear who is acting, what is confirmed, what is denied, and what order the events followed, the rules are too rough.

Another common sign is that the model mixes up contract parties, doctors, or treatment stages. That usually means you removed the anchor fields, not just the personal data.

How can I check anonymization rules before launch?

Take a small sample from different templates and run the same queries the team uses at work. Then compare the model's answer on the original and on the anonymized version.

After that, show the result to a lawyer or a doctor. They will quickly spot where a deadline disappeared, roles were mixed up, or the link between a symptom, a test, and treatment was lost.

Will a gateway with PII masking solve the whole problem on its own?

No, it does not replace them. A gateway with PII masking and audit logs lowers the leak risk and helps you keep the process under control, but a mistake in your replacement rules will still break the model's answer.

Start with one task and one document type. If your stack already works with an OpenAI-compatible API, you can connect a compatible gateway, switch base_url to api.airouter.kz, and check separately how well your templates preserve meaning.