Sep 04, 2025·8 min read

Retiring a Model Without Breaking the Product

Model retirement needs a plan: notify teams, check dependencies, keep a dual-support window, and move traffic in stages.

What breaks if you remove the model too early

When a model is retired, the failure rarely looks like one big incident. Usually the product breaks in pieces: one path fails right away, another changes behavior without a clear error, and a third only shows up hours later.

Most often, the team forgets not the main scenario, but the quiet side one. The bot on the website has already been moved to the new model, but the old prompt is still in the nightly summary, moderation, ticket classification, or the internal assistant for operators. Until that path runs, everyone thinks the transition went fine.

Even if requests go through a single OpenAI-compatible gateway, the problem does not disappear. Code often still has a hard-coded model for a queue, a cron job, or an old fallback rule. Main traffic may already be on the new model, while background tasks keep sending requests to the old one.

Old prompts are another issue. The new model may answer correctly in meaning but in a different format. For a person, that is minor. For the product, it is a break. If a downstream service expects JSON with intent and priority but gets plain text, the ticket route breaks, the CRM card stays empty, and the automatic reply never goes out.

It usually looks like this:

the chat replies, but the next-step buttons do not appear
batch processing piles up retries and grows the queue
nightly reports arrive empty or with junk data
support gets complaints before the alert fires

Support almost always learns about the problem first. Users do not write, “you have a 404 on model id.” They write, “the bot is answering strangely,” “the request disappeared,” or “the export result is different.” If monitoring only checks response status and average latency, engineers may not notice for a long time that the response schema is already broken.

The worst case is when the old model is still in asynchronous flows. Queues, batch jobs, and retries live longer than the interface. Requests can keep flowing for another day after the team decides the migration is done. The error spreads over time: quiet during the day, backlog building at night, complaints arriving in the morning.

The early sign is simple: users are already seeing strange behavior, but the dashboard is still green. That usually means the model was removed before all dependencies were found and the response format was checked against the old prompts.

Who to notify before work begins

The first message should go out ahead of time, not on the day of the switch. Teams need more than a general announcement; they need a clear schedule with three dates so nobody learns about the old model shutdown from a spike in errors.

State it clearly:

freeze date - from this day on, the team no longer adds new prompts, features, or tests for the old model
switch date - from this day on, main traffic goes to the new model
shutdown date - from this day on, requests to the old model are blocked or sent to a backup route

Add the time zone, owner, and rollback condition to each date. If the date exists but nobody can make the rollback decision, the plan is almost useless.

Notify four groups separately. Product needs to understand which user flows change and where a temporary drop in quality is acceptable. Support needs a list of noticeable response changes, reply templates, and the date after which complaints count as an incident. Analytics should mark the switch window in advance so it does not confuse the effect of the new model with an ordinary metrics dip. Security checks logs, PII masking, AI-content labels, and access rules.

It also helps to build one list of affected areas. Show not only the model you are removing, but everything tied to it: product features, customers, channels, internal tools, and external integrations. That list removes half the arguments before the work even starts.

If traffic goes through a gateway like AI Router, this map is often easiest to build by API key, model ID, and the services using one OpenAI-compatible endpoint. That makes hidden dependencies easier to spot, especially ones the product team may not know about.

For questions, use one channel: chat or a queue in the tracker. For decisions, use one owner. Usually this is the engineering manager or tech lead for the service running the switch who can say: continue, roll back, or extend the dual-support window. When there are two channels and three owners, teams argue longer than the actual decommissioning takes.

How to build a dependency map

When retiring a model, teams often look only at the main service and miss everything around it. Then the old model disappears, but instead of the whole product breaking, a dozen small scenarios fail: a nightly export, an operator bot, a manual analyst script, a CRM response parser.

Start with call logs from the last 30–60 days. A month is not enough if you have weekly and monthly jobs. Look not only at request frequency, but also at the source, parameters, start time, and call path. One rare request at the end of the month may be required to close the reporting period.

Then collect the full list of places where the model is involved at all. Usually the service API and chatbot are found quickly, but less visible things are forgotten: scheduled batch jobs, internal tools for support and analysts, manual operations through an admin panel or notebooks, and partner or customer integrations that call the model directly.

Next, check not only the model call itself, but also the response shape. If the team once tuned the prompt for the old model, the new one may return different JSON, a different field order, or a slightly different writing style. For a person, that is minor. For a parser, it is enough to make everything fail. So next to each scenario, note the prompts, post-processing, regex, JSON schemas, and checks that expect the old format.

Also review operational dependencies. Alerts, dashboards, rate limits, and budgets are often tied to the old model name. After the switch, charts go blank, alerts stay silent, and the team thinks everything is calm. In reality, it has just gone blind.

The most unpleasant case is bypassing the main API. If part of the traffic goes through a single gateway while some clients still keep an old base_url, a separate key, or a direct provider call, you will not have the full picture. In a setup like AI Router, it is especially useful to check who uses the shared OpenAI-compatible endpoint and who is still on the old route.

A good dependency map answers three questions: who calls the model, what exactly they expect from it, and who will notice the failure first. If every scenario has a clear owner, the shutdown goes much more smoothly.

How to open a dual-support window

A dual-support window is not about being overly cautious. It is for an honest check. The new model is already taking part of the load, while the old one is still live. Usually 7–14 days is enough, but the duration should be set in advance. If you do not set an end date, the temporary setup will quickly become permanent.

The old and new models run in parallel on the same scenario. The team decides in advance which requests will go both ways. Usually they choose repeatable flows: support inquiries, short reply generation, classification, and extracting fields from documents. A random sample works too, if it covers peak hours, different user types, and rare but expensive mistakes.

If you work through a gateway like AI Router, this traffic is easier to split at the routing layer without changing the SDK, code, or base_url in the app. That reduces risk: the product keeps working the same way, while the comparison happens in infrastructure instead of in hastily rewritten business logic.

What to compare every day

Looking at text quality alone is not enough. Two models may sound similar but break the product in different ways.

does the response format match what the app expects
does latency grow on typical requests
does cost per thousand requests or per scenario change
does the share of errors, empty responses, and timeouts increase
do new failures appear on long or rare prompts

It is better to tag these requests in the same way and store both models’ results side by side. Then the team sees not one average number, but specific differences: where the new model is slower, where it breaks JSON more often, where it answers more expensively without clear benefit.

Even before the start, assign someone who can trigger rollback without long approvals. You also need a threshold in advance: for example, if the error rate rises above the agreed level or latency stays above normal for two hours in a row, traffic goes back to the old model.

During the window, do not change prompts, response templates, or business rules. Otherwise you will compare not two models, but two different product versions. For model decommissioning, that is a bad test: it creates noise instead of a clear signal and only delays the shutdown of the old model.

How to move traffic step by step

Check logs before shutdown

Look for old model IDs in audit logs before they show up at night.

View logs

A sudden switch almost always hurts not the model itself, but the product around it. That is why traffic is moved in stages, not with one click.

First, give the new model to people who can handle small issues without harming the business. Usually these are internal users, test projects, service scenarios, or low-risk requests. If a bot helps employees find policies, that is a good first candidate. If it answers customers about payments, that flow is better left for later.

A common rhythm looks like this: 5% of traffic to the new model, then 20%, then 50%, and only after several check cycles - 100%.

Do not rush from one stage to the next on the same day if traffic is uneven. Morning and evening traffic on weekdays can behave very differently. It is better to wait for a period when you can see normal peaks, not just quiet hours.

Look at more than just the response text. Teams often check meaning and tone, but miss the things that break the user experience first: timeout, rising retry counts, empty responses, latency spikes, format errors, and unexpected tool failures. If the new model answers a little better but goes into retries twice as often, the product has already gotten worse.

Fast rollback is needed at every step. Not through a request and not through a night release, but through a simple routing rule. If the metrics drop by 20%, you should be able to return to the previous route within minutes. Through AI Router, this is easy to do by traffic share, key, or user group without changing client code.

It helps to agree on stop signals in advance: an increase in empty responses above the normal level, latency above a set threshold, or a surge in manual complaints from support. Then there is no need to argue. The team just rolls back the step and investigates the cause.

This approach may look slow, but in practice it saves days. One careful move to 5%, 20%, and 50% is almost always cheaper than an urgent cleanup after fully shutting down the old model.

Example: a support bot

An online store’s support bot answers common questions: where the order is, how to make a return, and when the delivery will arrive. The old model does this in a dry way, but it returns stable JSON that the CRM reads without errors. The new model writes better and sounds more lively, but sometimes breaks the response format.

The problem was not the text, but the structure. In one response, the model changed a field name, in another it forgot order_id, and sometimes it returned a string instead of an array. The customer saw a normal answer, but the CRM could no longer create a card or pull the order status. These failures are dangerous because they are easy to miss in manual checks.

So the team did not turn off the old model right away. It kept both versions running for two weeks and sent the same requests through both. They compared not only the quality of the wording, but also the fields needed further down the chain: intent, order_id, delivery_status, refund_reason, confidence.

After a few days it became clear that the new model handled delivery status well, but made more mistakes in return scenarios, where the CRM expected a strict format. The team fixed the parser and added stricter schema validation. If requests go through a shared gateway like AI Router, it is easier to keep this on one API and switch models without rewriting the integration.

After that, traffic was moved by request type, not all at once. First, only delivery questions went to the new model, then order questions, and only at the end - returns. That way the team could quickly see where the error rate was growing and roll back one scenario instead of the whole bot.

The old model was turned off only when the tail disappeared from the logs. There were no background jobs left, no rare intents, and no repeated requests still going to the old branch. It is a boring final step, but it is what saves you from the situation where the model was supposedly already shut down, yet the product is quietly breaking on 2% of requests.

Where teams make mistakes most often

Account for data requirements

Keep data inside Kazakhstan and verify PII masking before switching.

Start

Most often, the problem is not the shutdown itself, but too narrow a view of traffic. The team looks at real-time requests from the product, sees that the new model answers normally, and decides the transition is done. Then the batch process fails at night: email parsing, report generation, ticket re-scoring, or any other background task.

Hidden dependencies live longer than people expect. The old model is removed from the main service, but queues, cron jobs, retries, and background workers are forgotten. In the end the code is clean, but requests still go to the old endpoint through a separate consumer or old configuration. Even if the team works through a gateway like AI Router, that does not protect you from a forgotten environment variable in a batch service.

Average metrics also often lull teams into false confidence. If you only look at total error rate, average latency, and average cost, it is easy to miss rare but expensive failures. They usually appear in scenarios that are barely visible on charts: long conversations, rare languages and mixed requests, large documents hitting the context limit, or strict JSON responses for downstream services.

One percent of those errors is easy to lose in the average. For support, that is already dozens of manual investigations, reopened tickets, and angry users.

Another common mistake is not warning anyone in advance. If support, account managers, and on-call engineers do not know about the model retirement, they spend hours looking for a “random” degradation. A short warning saves a lot of time: what is changing, when the switch starts, and which symptoms count as an incident.

Many teams close rollback too early. One calm day means almost nothing. Products have weekly cycles, month-end spikes, morning peaks, and heavy overnight windows. If you remove the old model right after the first quiet period, the risk does not disappear. It just shows up later, at a less convenient moment.

The old model should be turned off only after a full check cycle: online traffic, background processes, queues, cron jobs, and support. If even one of these layers was not checked, the transition is not finished yet.

Quick check before shutdown

Open a dual-support window

Keep one endpoint and check the new model without urgent app changes.

Start test

The old model is turned off based on facts, not the calendar. If a quiet call to the old API is still left somewhere, the break will not happen on migration day, but a week later when someone runs a rare scenario.

Before the final shutdown, it helps to run through a short checklist. It takes little time but often saves you from a night rollback.

Check logs for a clear period, for example 7–14 days. There should be no new requests to the old API address, old model id, or old route. Look not only at production, but also at background jobs, cron, admin panels, test environments, and mobile clients on older versions.
Make sure all teams and external clients received the shutdown date. One email is not enough. You need explicit confirmation: a chat reply, a ticket marked done, or an entry in the change calendar.
Separate monitoring for the old and new models. Errors, timeouts, rising cost, and quality drops need to be seen separately, otherwise a new problem will hide in the overall picture.
Assign the person who triggers rollback and describe the rollback step itself without guesswork. Who changes the route, where the config lives, how many minutes it will take to restore, and who gives approval.
Close not only the technical side, but also the operational requirements. If you have budget limits, invoicing, log retention, data residency, or PII masking requirements, they must work in the new setup too.

At this stage, unpleasant little things usually come up. For example, production has long been moved to the new model, but the old endpoint is still being called by a nightly batch process for reports. During the day everything looks clean, but at night the system starts sending traffic to something you were already about to shut down.

For teams in Kazakhstan, this is especially important when the model changes are driven not only by quality, but also by data requirements. If traffic is moved to AI Router, it is worth checking audit logs, AI-content labels, PII masking, rate limit rules, and how billing works in tenge. These things do not break the model response immediately, but they do break the process around it.

If even one item is still unclear, it is better to push the shutdown back. Model retirement goes smoothly only when the old path is already empty, the new path is observable, and rollback can be turned on in a few minutes.

What to do next

After the model is retired, do not close the work on the same day. Two or three days later, the team needs a short review: why the model was removed, what happened during the switch, which decisions worked, and which did not. Without that, no one will remember a month later why this exact switching order was chosen.

It is better to record numbers, not general conclusions. Compare errors before and after the switch, latency at p50 and p95, fallback share, and cost per 1000 requests or per million tokens. Even a note like “errors dropped from 1.7% to 0.8%, p95 rose by 120 ms, spend fell by 11%” is more useful than a long report without metrics.

Then clean up everything that still drags the product backward. Old prompts, tests, alerts, and temporary rules often live longer than the model itself. Because of that, a new team might accidentally restore the old route, while monitoring keeps making noise about a non-existent endpoint for another month.

After shutdown, four steps are usually enough:

delete old prompts and templates that are no longer used
remove tests tied to the old model response
disable alerts and dashboards that only tracked this model
update the runbook and rollback rule for the new setup

It is best to handle this cleanup in a separate PR. That makes it easier to check that there are no hidden dependencies or temporary workarounds left in the repo after migration.

Finances should also be recalculated. After the replacement, teams often look only at the model price and miss a longer prompt, more retries, or a lower cache hit rate. Sometimes the model is cheaper per token but more expensive on the final monthly bill. The same goes for latency: average response time may drop, but a long p95 tail can still hurt the user experience.

If you run multiple models through AI Router or api.airouter.kz, do not wait for the next incident. Set the replacement route for the next decommissioning now, enable audit logs, and set limits on keys. Then, during the next switch, you will quickly see which service is still calling the old model and can limit traffic precisely, without unnecessary risk to the whole product.

Frequently asked questions

Why can’t we just turn off the old model on migration day?

Because the failure rarely hits the whole product at once. More often, quiet scenarios break: queues, cron jobs, nightly reports, old fallback branches, and parsers that expect the old response format.

During the day everything may look fine, and at night retries and backlog build up. If users are already complaining while the dashboard is green, the model was removed too early.

Who should be notified before the work starts?

First, notify product, support, analytics, and security. Each group has a different job: product checks user flows, support prepares replies for complaints, analytics marks the switch window, and security reviews logs, PII, and access rules.

If these teams find out after the fact, you lose hours to back-and-forth and manual investigation.

Which dates should be announced in advance?

Usually three dates are enough: freeze, switch, and shutdown. Add the time zone, owner, and rollback condition to each date right away.

That way there are no gray areas. People know when they can no longer add new prompts for the old model, when main traffic moves to the new one, and when the old route is closed for good.

How do we find hidden dependencies on the old model?

Start with call logs from the last 30–60 days and look beyond frequent requests. A rare call at the end of the month may be required for reports or closing tickets.

Then check code, queues, cron jobs, admin panels, analyst notebooks, external integrations, and old configs with a hard-coded model id or base_url. If traffic goes through AI Router, it helps to look at API keys, model ID, and services on the same OpenAI-compatible endpoint.

Why open a dual-support window?

It gives you a real check, not a polished report. You keep the old and new models side by side, send the same requests to both, and see where the new version changes the format, becomes more expensive, or makes more mistakes.

Usually 7–14 days is enough. If you do not set an end date, the temporary setup will stay with you for too long.

What should we compare between the old and new model every day?

Do not look at the text alone. The app can break because of an empty field, a different JSON shape, a timeout, more retries, or a long latency tail, even if the answer sounds better.

It helps to store results from both models side by side and compare them on the same set of requests. Then you can see whether the problem is quality, format, or cost.

How do we move traffic without extra risk?

Move traffic in steps: first internal and low-risk scenarios, then 5%, 20%, 50%, and only after that the full flow. Between stages, let the system experience normal peaks, not just quiet hours.

If you use AI Router, change the route at the gateway level instead of rushing client code changes. That makes it much easier to roll back a bad step.

When should we roll back the migration?

The rollback decision should be defined in advance, not during the outage. Set simple thresholds: errors, empty responses, timeouts, cost, or latency above normal for an agreed amount of time.

Support also gives you a separate signal. If complaints start coming in like “the bot is answering oddly” or “the ticket disappeared,” do not wait for the dashboards to catch up.

What should we verify right before the final shutdown?

Before the final shutdown, check logs for 7–14 days and make sure new traffic is no longer going to the old endpoint, model id, or route. Review production, background jobs, cron, test environments, and mobile clients on older versions.

Also check that rollback really works in minutes, not through a long procedure. For teams in Kazakhstan, it is worth looking separately at audit logs, PII masking, AI-content labels, rate limits, and billing in tenge.

What should we do after turning off the old model?

Do not close the task on the same day. A few days later, review the numbers: errors, p50 and p95, fallback rate, cost per scenario, and support complaints.

After that, remove old prompts, tests, alerts, and temporary rules that still pull the product backward. If you leave that tail in place, someone may later bring back the old route or break monitoring.