The Evolution of an Extracted Data Schema Without Analytics Chaos
How to change fields, dictionaries, and versions without breaking old reports or making the numbers diverge.

Where the mess starts
Chaos in analytics rarely starts on the day someone adds a new field. Usually it begins earlier: two people look at the same value and understand it differently. For one person, "customer status" is the stage of a request; for another, it is the result of a deal. The name is the same, but the meaning has already drifted.
The most unpleasant case looks harmless. The field keeps its old name, reports keep running, dashboards do not break. But inside, the field already means something else. Yesterday approved meant "check passed", and today it means "contract signed". The same column is visible on the chart, even though the numbers are now answering a different question.
This often happens after a small tweak in data extraction. The team refined the rule, the model started filling the field more accurately, and everyone thinks it is just a quality improvement. For analytics, that is already a change in meaning. If 15% of records used to fall into unknown and then started going into other, the trend breaks even if the system runs without errors.
Dictionaries follow the same pattern. While there are only a few values, the team keeps them in their heads. Then a new value appears, and by habit it gets placed into an old group without an explicit rule. For example, one analyst classifies retry, on_hold, and manual_review as in progress, while another treats them as errors. A week later, product has one conversion rate, sales has another, and finance has a third. Everyone is calculating honestly, but they are calculating different things.
The first signals are usually simple:
- the same metric differs in two reports
- a field has an old description in documentation and a new one in chat
- new dictionary values appear without a separate decision
- analysts are afraid to touch old queries
An argument about numbers almost always points not to bad BI, but to the fact that the team never fixed the data contract. If nobody recorded the schema version and the change date, any agreement lives only in people's memory. When those people calculate metrics differently, report backward compatibility disappears not with a system error, but with confident yet wrong numbers.
That is how the mess starts: the meaning has already changed, but the schema acts as if nothing happened.
What people usually change in the schema
Problems rarely start with a major redesign. More often, the team changes one small thing in extracted data, and a week later someone notices that the dashboard no longer calculates the way it used to. The most dangerous changes usually look almost harmless.
The first common case is renaming a field. In the source, everything became clearer: it was client_type, and now it is customer_segment. For a developer, that is cosmetic. For analytics, it is a broken contract. Old queries cannot find the column, data marts stop building, and some reports quietly show empty values.
Changing the date format causes just as many problems. While the field came in one format, filters and grouping worked predictably. As soon as the date starts coming in differently, a report may group data by rows instead of by days, or exclude part of the records entirely. The worst case is when the error does not break the pipeline completely, but simply shifts the numbers.
Another common step is splitting one field into two. For example, one column used to hold the full order status, and then the team decided to store the stage and the reason separately. For the data model, that makes sense. But old calculations that used status in a filter no longer produce the same result. Conversion rate, cancellation share, and average processing time start to drift from history.
Dictionaries change even more often than people expect. Yesterday there were values new, approved, and rejected, and today on_hold was added. Old slices usually do not expect that. In one report it will fall into "other", in another it will disappear, and in a third it will break the rule where an analyst manually combined statuses.
A separate source of confusion appears when blank values and zero get mixed together. They are not the same. Blank means the data is missing or the field was not filled in. Zero means the value exists and it equals zero. If a team merges both cases into one, reports start lying: somewhere the number of "zero" orders grows, and somewhere data completeness drops.
Before any change, it is enough to check a few things: what the field is called, what format the date comes in, whether the meaning of the calculation changes, whether the dictionary expands, and how blank differs from zero. If you skip this short pass, even a careful schema quickly becomes a source of arguments about whose numbers are correct.
When it is better to add a new field
A new field should be added not when the old name feels awkward, but when the meaning of the data changes. If yesterday a field described one business fact, and today you want to store a different one in it, a simple rename almost always breaks reporting logic.
Suppose source first meant the source of a document, and then the team decided to use it for the customer acquisition channel. The name is similar, but the meaning is already different. In that situation, it is safer to introduce a new field and leave the old one as it is.
The old field should not be removed right away if reports, alerts, or exports to other systems still depend on it. It is better to support both fields in parallel for a while. Yes, the schema looks a little less tidy. But finance, sales, and operations will not see a sudden break in the numbers.
The problem often starts with dictionaries. The team adds a new value to a status or category and thinks nothing serious happened. But old groups already have business meaning. If a report has spent months splitting records into three clear categories, the quiet arrival of a fourth breaks comparisons across periods.
In that case, there are usually two safe paths:
- keep the old dictionary unchanged for the old field
- create a new field or a new field version for the expanded logic
This matters especially if the old groups are used in SLA, bonuses, limits, or regulatory reporting. Silent dictionary expansion tends to surface at the worst possible moment.
You should agree in advance on what counts as a new object and what counts as a new version of the same object. If the entity, the level of detail, or the business meaning changed, it is a new object and needs a separate field. If you only fixed normalization, the date format, or the extraction method without changing meaning, a new version is usually enough.
A simple check usually settles the debate. If the field meaning changed, the metric groups changed, old reports are still needed, and the new value cannot honestly be mapped one-to-one with the old one, it is better to add a new field.
How to make changes step by step
Any schema change is better handled like a small release, not as a side fix. If the team changes a field in ETL and the analysts learn about it from a broken dashboard, the problem has already happened.
A workable sequence is usually this:
- Make a list of everything the change will affect: fields, dictionaries, data marts, reports, and alerts.
- Write down briefly what exactly is changing: field name, type, date format, value meaning, or the set of statuses.
- Before release, assign a version to the schema and, if needed, a separate version to the dictionary.
- Prepare a mapping table between old and new values and run the calculations on the same sample.
- Set a date right away after which the old field will no longer be supported, and record it for all teams.
Many people skip the first step. That is a mistake. Even a small change like status -> order_status often affects BI exports, segmentation rules, manual Excel reports, and the API that another team pulls data from.
Next, you need a short change card. No long descriptions. It is enough to answer four questions: what was there, what will replace it, why is it changing, and which calculations may break. If the field changes not only in name but in meaning, it is no longer a rename. It is a new field, and it should be treated more strictly.
A schema version before release saves a lot of time in disputed areas. The same goes for the dictionary. If today an order status has five values and tomorrow you split one status into two, analysts need a clear marker of which dictionary version they are working with.
While the old and the new live side by side, there should be a transparent mapping table between them. For example:
approved->acceptedpending_manual_review->reviewdone->completed
After that, take the same sample, for example the last 30 days, and calculate the old and new metrics side by side. Look not only at the total, but also at slices by date, channel, and segment. Often the mismatch appears not in the sum, but in one narrow part of the data.
In teams where LLMs extract fields from documents or conversations, this step is especially useful. A new version of the prompt or dictionary may perform better in manual review, but break familiar grouping in reports. On the test sample, you can see that immediately.
And one more thing: the old field cannot live forever. If you do not set a removal date, it will stay in the system for years, and the confusion will become normal. Usually a specific date, an owner, and a short list of reports that need to be moved before the deadline are enough.
Rules worth agreeing on in advance
Confusion usually appears not after a major migration, but after a dozen small changes. One person renamed a field, another replaced the value dictionary, a third did not warn the team that the old value no longer arrives. That is how a numbers dispute begins almost out of nowhere.
That is why the schema should have one source of truth. Keep the field registry in one place: in shared documentation, a data catalog, or a table the team can always access. The registry needs not only names, but also the field type, allowed values, change date, current status, and version. If a field is deprecated, that should be visible immediately, not discovered by accident in an old dashboard.
Do not get fancy with names. Short abbreviations seem convenient only on the day you invent them. A few months later, stat_cd or src_tp already need decoding. A clear name like document_status or source_type is a couple of characters longer, but it does not break SQL readability and does not make people guess.
Every field should have a description of its meaning, not just its format. The note string, nullable tells you almost nothing. You need a short explanation: what the field stores, where it comes from, when it is filled, and which values count as normal. It is useful to add two or three examples. If you extract data from LLM responses, examples are especially important: the model may return values that look similar but mean different things.
You also need a schema owner. Not a committee of ten people, but a specific person or a small group that approves disputed changes. Otherwise every change to fields and dictionaries goes through the "this is how I see it" principle, and analytics deals with the consequences. The schema owner does not slow the work down. They simply check whether the new version breaks the data contract and report compatibility.
Old fields should not be deleted on the same day the new one appears. Always mark a transition period: for example, the field is considered deprecated today, runs in parallel for another 30 days, and then the team removes it from new exports. Such a deadline saves hours of arguments. People have time to update their calculations, and reports do not break overnight without warning.
Example with a status and a dictionary
It looks like a small thing: yesterday the status field held three values - new, in progress, done. Today the team added one more - in review. For product, that is a normal step, and for analytics that is exactly how a silent break starts.
The old funnel report only knows the previous dictionary. If it strictly expects three values, records with in review may simply disappear from the selection. In the end, the sum across stages no longer matches the total number of requests, and the team argues about where the rows went.
Let us imagine a simple day: 100 requests arrived. Of them, 25 have status new, 40 are in progress, 20 are in review, and 15 are done. The new operational report shows all four statuses, and that is correct. But the old funnel report was built for the previous model and needs to preserve the old grouping logic.
Here, the solution is not to rewrite every report at once, but to use a mapping table. It stores not only the source value, but also the reporting group it should be assigned to.
| status value | group for old reports |
|---|---|
| new | new |
| in progress | in progress |
| in review | in progress |
| done | done |
Then the old report does not lose any rows. It still counts three groups, but the 20 records with status in review go into the in progress block. Nothing breaks for the user of that report: historical charts remain comparable, and the totals add up.
The new report can work differently. It already shows the full detail and lists in review separately because that status matters for the current process. Both reports can live side by side if they have a clear contract:
- the raw value is stored as is
- the reporting group is defined separately
- the mapping rules are versioned
That is what normal schema evolution looks like. You do not hide the new value, and you do not break old calculations. You separate business reality from the way old reports aggregate it.
If you do this every time the dictionary changes, the dispute about "lost" rows usually ends in five minutes: everyone can see both the source status and the group it was assigned to.
How to keep old reports working
Old reports do not break at release time, but a little later, when someone notices a strange number and can no longer quickly understand where it came from. That is why, whenever the schema changes, it is better to think not only about new fields, but also about how old logic will live alongside them through at least one transition period.
If a field changes meaning or name, do not remove the old name immediately. Leave it as an alias and map it to the new structure with an explicit rule. For example, if status_code and status_reason replace status, the old status can temporarily be assembled from the new dictionary so that status group reports keep calculating the same way as before.
It is better to version the schema explicitly too. Bump the dataset version instead of silently editing the schema in the same stream. Then the analyst sees that they are working with dataset_v1 or dataset_v2, instead of trying to guess why yesterday's query gives a different result. That is standard discipline, not extra bureaucracy.
What to freeze during the transition
During migration, leave the old calculation logic unchanged. If a report used to calculate the share of approved requests using the old dictionary, do not change the formula in the middle of the month just because new categories appeared in the schema. Otherwise you are comparing not periods, but two different methods.
Another sensible measure is to set the support period for the old version in advance. That is usually enough for teams to move dashboards, checks, and exports to the new schema without rushing.
What to verify after release
A row-count check guarantees almost nothing. After release, it is better to compare not only the total, but also the structure of the data: metric sums, shares by main segments, groupings by dictionaries and statuses, the number of empty and unknown values, and the difference between the old and new version over the same period.
Growth in unknown, other, or empty categories should be caught on the first day. That is often the first sign that the new dictionary did not cover some old values or that the extraction started returning an unexpected code.
If the team has a data contract, these rules are better written directly into it: when a new version appears, how long the old one lives, and what checks are required before disabling the alias. Then report compatibility depends not on individual memory, but on a clear process.
Mistakes that make the numbers diverge
When the schema changes, numbers most often break not because of complex logic, but because of small uncoordinated edits. One service is already writing the new way, the data mart is still waiting for the old field, and the dashboard is silently calculating the wrong thing.
The most common mistake is simple: a field was renamed and someone decided it was just a technical detail. For example, it was status, and now it is document_status. Developers updated the pipeline, but analysts did not update the queries, and part of the rows started coming through as null. On the chart, that looks like a sharp drop or spike, even though nothing changed in the business.
It is no less dangerous to change the meaning of a value and keep the old name. Yesterday approved meant "the document was checked by an operator", and today it means "the document passed automated verification". The name is the same, the report is the same, but the metric is already about a different stage of the process. These mistakes live for a long time because at first glance everything seems to add up.
Another pain point is mixing null, an empty string, and 0 in one column. For one query, these are three different states: no data, an empty value arrived, and there is a numeric result of zero. For another query, they may collapse into a single group. In the end, one report shows 12% missing values, and another shows 3%, even though both are looking at the same table.
Dictionaries also drift silently. The team updated the status list in the extraction service, added in_review, and nobody touched the reference data in DWH or BI. The new value either falls into "other" or does not get into the aggregation at all. That is a typical scenario for systems where several services process LLM extraction and each one keeps its own set of allowed values.
Before releasing changes, it helps to run through a short list:
- which fields were renamed, removed, or added
- whether the meaning of old values changed
- whether
null, an empty string, and0are interpreted the same way - whether dictionaries and reference lists were updated across all layers
- whether the aggregates match before and after release
The last item is often skipped. The team checks two or three records manually, sees that the JSON looks fine, and ships the release. But a few correct examples prove nothing. You need to compare sums and distributions: number of records by day, share of empty values, number of objects in each status, median amount or processing time. If a field looks better in manual review after the change, but the weekly total suddenly drops by 18%, the problem is almost always in the data contract, not in the business.
Quick check and next steps
Before any schema change, it is useful to pause for a short check. It takes less than an hour, but later you will not spend a week figuring out why the new report does not match the old one.
Check five things:
- the schema has an explicit version and a date from which it starts working in production
- every changed dictionary has a mapping table between old and new values
- the old reports have already been run on a control sample and the numbers before and after the change have been compared
- the team knows the exact date when the old fields will stop being supported
- if an LLM fills the structure, the data contract lives separately from the model choice
The last item is often missed. The model may change the style of the response, the order of the fields, or name a status a little differently. If the data contract is described separately, you change the parser and checks once, instead of rewriting half of analytics every time the model changes.
In practice, it is enough to keep three things close by: the description of the current schema, the mapping table for dictionaries, and a short validation protocol on a control sample. That is already enough for an analyst to understand what changed and for an engineer to quickly find the place where the calculation could have broken.
If you test extraction with different LLMs, do not mix the experiment with a format change. First lock the contract, then compare models within the same set of fields. In that sense, AI Router is useful for a team that wants to run the same scenario through different models via one OpenAI-compatible endpoint without changing the integration. But the result format still has to be defined by your data contract, not by the model itself.
The plan after that is simple: assign a schema owner, add a version to each dataset description, create mapping tables for dictionaries, and choose one control sample on which every change will be checked. After that, you can safely add new fields, move statuses, and remove old values without chaos in reports.
Frequently asked questions
When should you create a new field instead of renaming the old one?
Add a new field if the business meaning changes. If source used to mean the document source, and now you want to store the customer acquisition channel, do not rename the old field — create a new one and give the team time to move over.
Can you keep the old field name if its meaning has changed?
No, that is not a good idea. The old name creates the impression that nothing changed, and reports keep running even though they are answering a different question.
How do you add a new status without breaking old reports?
Keep the raw value as is and define the reporting group separately. Then the new status can be mapped into the old group for historical reports, while new dashboards can show the full detail.
What is the difference between `null`, an empty string, and `0`?
null means there is no data, an empty string means an empty value was received, and 0 means the value exists and equals zero. If you mix these cases, reports start confusing missing values with real zeros.
What should you check before changing a schema?
A short release check is enough: describe what changes, assign versions to the schema and the dictionary, prepare a mapping between old and new values, and compare metrics on one sample. This takes little time and quickly catches mismatches.
How long should you keep the old and new fields in parallel?
Keep both fields until the teams have moved their reports, alerts, and exports. A fixed period such as 30 days is usually enough, but it is better to set the removal date right away, otherwise the old field will stay forever.
Should the dictionary be versioned separately from the schema?
Yes, if the allowed values or their grouping change. A separate dictionary version helps explain why the same query returns different slices on different periods.
What should be checked right after release?
Look at more than the total number of rows. Compare sums, shares by segment, status distribution, the number of unknown values and empty values on the same period.
Who should approve changes to fields and dictionaries?
Assign one schema owner or a small group. These people do not slow the work down — they check whether the change breaks field meaning, dictionaries, or compatibility with old calculations.
How do you change the LLM or prompt without breaking analytics?
First lock the data contract, then change the model, prompt, or provider. If the LLM starts returning statuses in a different format, the parser and checks will catch it before the error reaches analytics.