Skip to content
Apr 06, 2026·8 min read

Internal Model Catalog: Statuses and Rules

An internal model catalog helps teams see model status, access, and retirement timelines so they do not choose models blindly.

Internal Model Catalog: Statuses and Rules

Why teams choose a model blindly

Usually the reason is simple: the team takes whatever is already at hand. Someone has an old code sample with one model, someone saw it in a demo, someone copies a setup from a neighboring service. When deadlines are tight, few people compare price, latency, data restrictions, and quality on their own task.

The problem does not show up right away. One developer connects an expensive model for a simple support ticket classification task. Another sends sensitive data to a service that internal rules forbid. A third builds a feature on an experimental version, and after the next update gets a different answer style and a different error rate.

If there are no shared rules, every team builds its own set of integrations. Names look alike, versions get mixed up, and nobody records why a model was chosen. A couple of months later it becomes hard to answer basic questions: which model can be used in production, who owns it, what to do when quality drops, and what to replace it with.

Costs also rise quickly. Companies pay for a model that is too powerful where a simple one would have been enough, keep several almost identical integrations alive, repeat the same tests, and run into failures after a provider changes a version or limits.

A simple list of available models does not help here. It only answers the question, "what do we even have". For real work, different answers are needed: what the model is suitable for, what data can be sent to it, whether it is allowed in production, and who approves exceptions.

This becomes especially visible when dozens or hundreds of models are available through one gateway. For example, if a company works through AI Router, technically connecting a new model is easy. But without statuses, access rules, and a clear lifecycle, teams will still choose based on memory, habit, or someone else’s advice.

The problem is not the number of models. The problem is the lack of a shared rule: we use this model for this scenario, we test that one, and we do not touch this one.

What a model card should contain

If a model does not have a proper card, people start choosing based on rumors: "this one is smarter", "that one is cheaper", "someone has already tried this one". For an internal catalog, that is enough to create extra spending, questionable data decisions, and confusion in production.

A good card answers work questions in a minute. A person opens the record and immediately understands what the model is for, who to contact, how much it costs, and what data can be sent to it.

There are five essential blocks to keep in the card.

First, a clear name and one purpose. It is better to write "Qwen 3 for support draft replies" than to leave only a technical identifier. If there are multiple purposes, employees will start interpreting the card in their own way.

Second, an owner and a live contact. It is better to name a team or role rather than one person. A format like "ML Platform, #llm-help channel" survives vacation, resignation, and project changes.

Third, explicit boundaries of use. Allowed scenarios should be listed directly: call summarization, request classification, draft email generation. Prohibitions are also better written without hints: do not use for final legal text, do not send personal data, do not use for decisions without human review.

Fourth, simple numbers. Not everyone needs a detailed token-based price list. Often three lines are enough: average cost of a typical request, usual latency, and the limit for context or request volume.

Fifth, data and storage rules. Here, general wording is not enough. It should state clearly whether personal data may be sent, where logs are stored, whether fields are masked, and whether the model is suitable for data that must remain inside Kazakhstan.

It is useful to add a short example. For example: "Suitable for classifying customer requests. Not suitable for medical advice. Average latency 1.8 seconds. Input with IIN is prohibited without masking." This format removes half the questions before the pilot even starts.

If the company works through a single gateway, some fields can be pulled in automatically. In the case of AI Router, these may include audit logs, key-level rate limits, PII masking rules, and scenarios where data must remain within the country. But the card is not needed for reporting. It is needed so that people choose models without guessing.

Which statuses are needed

Statuses are not for bureaucracy. They show where and under what conditions a model can be used.

The same model may write text well but fail to extract fields from a contract or break down where strict JSON is required. So the status should answer not "is the model good or bad" but "can we use it for this job".

For most companies, five statuses are enough:

  • Draft - the model has just been connected and basic behavior, limits, and cost are being checked.
  • Pilot - the model has been given to one team or one scenario for a limited time.
  • Approved for production - the model can be used in live processes within clearly defined boundaries.
  • Under observation - there are complaints about quality, latency, or stability, and new connections should be paused.
  • Sunset - an отключение date and replacement have already been chosen for the model.

Each status should have simple transition rules. Who moves the model forward, which tests are checked, how long the pilot lasts, how many complaints are enough for the "under observation" status, and who approves retirement from service. Without this, statuses quickly become a formality.

Next to the status, it is useful to keep four more fields: owner, date of last review, date of next review, and the reason for the current state. Then an engineer does not have to guess why one model is already approved while another has been stuck in pilot for three months.

When a company uses one gateway to different providers, such statuses are especially useful. Connecting a new model is technically easy. It is much harder to clean up the consequences later if each team chose it in its own way.

How to separate access to models

If all employees see the same list of models and can use it without restrictions, the catalog quickly loses its purpose. One team takes an expensive model for a simple task, another accidentally sends sensitive data to the wrong service. Access restrictions are needed so people choose from their safe set, not from the full list.

A full overview is usually needed by the platform team, architects, security, and those responsible for procurement and risk. Everyone else only needs a storefront of approved models for their class of tasks: chat, classification, search, summarization, code.

In practice, there are not many roles. The platform team sees the full catalog, changes statuses and limits. Product teams and ML engineers see approved models and can run pilots in their projects. Security and the data owner approve work with personal, financial, and medical data. The service owner or CTO opens production access after checking cost, quality, and risk.

A pilot without separate approval can be allowed only within clear boundaries. The data must be test data, anonymized, or pre-masked. The model must have a status that allows a pilot, and the team must have a budget and request-rate limit.

For production, the rules should be stricter. Access should be opened not by the person who found a model that looked good in a comparison, but by the person responsible for the service and the consequences of mistakes. Most often this means two approvals: the service owner confirms the business risk, and security confirms the data regime. If requests contain sensitive data, it is better to add approval from the data owner or compliance.

Exceptions are also needed, but only with an end date. Otherwise a temporary solution lives for years, and nobody remembers why it was opened. An exception request usually only needs to state which model is needed, for what task, what data will go into the request, who owns the experiment, when access will be closed, and how the team will roll back if something goes wrong.

If access to models goes through a single API gateway, such rules are easier to enforce technically. In AI Router, different keys, rate limits, audit logs, PII masking, and separate policies for locally hosted models can be used when data must remain inside the country.

How to manage a model through its lifecycle

Match the model to the data regime
Use one gateway for external providers and local models in the same setup.

A lifecycle is needed so that a new model goes through the same path instead of landing in the catalog after a couple of good prompts. If the path is not defined, teams start arguing after launch.

Usually, the request for a new model is submitted by the owner of the scenario: a product team, an ML engineer, or a tech lead. They should record not only the model name, but also the task itself, example prompts, acceptable price, latency requirements, and data restrictions. For a bank, telecom, or healthcare service, that is already enough to quickly rule out half the options.

The path itself can be kept short: under review, validation, pilot, production, archive. This is not the same as the operational status in the catalog. Lifecycle stages show where the request is and what the team has already checked.

In the validation stage, the platform, security, and scenario owner look at specific things: where the data goes, whether PII is masked, whether audit can be kept, how the model handles the required language, format, and response length, and how much a typical request costs. For companies in Kazakhstan, requirements for keeping data inside the country and complying with local AI rules are often added here.

A model should move into pilot only after a short test on real examples. Often 30–50 requests for one scenario are enough to reveal repeated failures: the model breaks the format, responds too slowly, ignores instructions, or exceeds the budget. The pilot is best limited to one team and one access path. Otherwise a rough version spreads across the company quickly.

A model goes into production when it has passed tests, fits the cost target, and has behaved stably for at least a few weeks. It should have an owner - a person or team that monitors quality, provider updates, and user complaints. If there is no owner, it is too early to give production status.

Rollback should also be a normal part of the process. If after an update the model starts answering worse, costs go up, the required format breaks, or the provider becomes unstable, it can be moved back from production to pilot or immediately to under observation. This is not a failure; it is a normal working measure.

A model is better removed from the catalog when it has no owner, there is no demand for it, the provider has stopped supporting it, or the company already has a better replacement. The most common mistake is leaving such entries "just in case". After a few months, nobody understands whether they can still be used.

How to launch the catalog step by step

You do not need to build the catalog for months. The first working version can be put together in a few days if you start not with a list of trendy models, but with real team tasks.

First, collect 5–10 scenarios where a model is already needed or will appear in the next month. For example, support answers customer requests, lawyers review contract templates, analysts summarize long reports. When the tasks are named clearly, the catalog stops being an abstract table.

Then set 3–4 selection criteria. Usually questions like these are enough: does the model write well in Russian or Kazakh, does it handle the required text length, does it stay within the allowed price, and does it respond fast enough for the scenario. If there are too many criteria, teams will go back to choosing by eye.

After that, you need a short cycle:

  • take one task from each team,
  • compare 2–3 models for it,
  • run the same examples,
  • assign an owner and a review date.

The comparison does not need to be complicated either. For the first pass, three fields are enough: how useful the answer is, how much a typical request costs, and how many seconds the user waits. If one model is only 5% better but costs four times more, it should not be opened to broad use.

A small example shows the logic well. Suppose support needs an LLM for answering common questions. The team takes three models, runs twenty real dialogues, and compares cost, errors, and average latency. One model is best overall, but sometimes exceeds the time limit. In the catalog, it can get the status "approved for offline tasks", while another option is left for chat.

If the company already has a single gateway like AI Router, the launch is faster: the model can be changed without rewriting code, and price and latency comparisons can be collected in one place. But even without that, the principle stays the same: one task, a short test, a clear status.

Example for one business task

Centralize access in one place
Connect AI Router and manage models, limits, and audit through a single OpenAI-compatible API.

Imagine a bank support team. The team wants to answer common questions faster: where to find a card limit, why a transfer has not arrived yet, how to change a phone number in the app. If the rules are not set in advance, operators will start using the first model they find. For customer service, that is a bad path.

For such a task, it is better to separate model roles from the start. A cheaper model prepares a draft answer based on the knowledge base, FAQ, and internal instructions. It helps save time, but it does not write to the customer directly.

For the external response, the bank keeps only models with a status like "approved for external replies". These models have already been checked for tone, wording accuracy, and compliance with banking restrictions. If a model writes quickly but often invents details, it should not get this status, even if it is noticeably cheaper.

Access is also better separated by level. The draft model works only with knowledge base articles and anonymized examples. The model for customer replies receives only the necessary fields from the CRM. Models without separate approval do not see full names, document numbers, or transaction history. For sensitive data, the bank may keep only locally hosted models or enable PII masking.

The process itself looks simple. The operator receives the customer’s question. The first model suggests a draft: a short answer and a list of steps. The second model from the approved set rewrites the text in the right tone, or the operator sends the answer after review. This way the team does not pay an expensive model for every intermediate step.

After a month, the rules are reviewed based on facts, not impressions. They look at cost per conversation, the share of manual edits, the number of errors and complaints, and the average response time. If the cheaper model saves money but the operator rewrites half the text, its status is lowered. If the approved model maintains quality and stays within policy, it gets more scenarios.

At that point, the catalog stops being a list of names. It starts telling people which model to use for a draft, which one for the customer, and which one should not be used for banking data at all.

Common mistakes

The catalog quickly loses its meaning if it is built like a display window instead of a working reference. Even with one gateway, confusion does not disappear: the team sees dozens of models, and people choose the one they have already seen somewhere.

The most common problem is unnecessary complexity. If a model has seven or eight statuses, employees stop telling them apart. In real life, clear states are usually enough: draft, pilot, approved, under observation, and sunset.

A shared access model for everyone does just as much harm. A model for internal knowledge-base search and a model for customer data processing cannot be opened in the same way. Support, analytics, and ML teams have different tasks and different risk levels. If this is ignored, one team will start using an unsuitable model for sensitive tasks, and another will get stuck in unnecessary approvals.

The catalog is most often broken by the same mistakes:

  • the model has no owner,
  • the card has no next review date,
  • the description only says "writes texts well" but does not explain the prohibitions,
  • the old version stays in the list after a new one is released,
  • almost identical models sit next to each other without explanation of cost, latency, and risk.

It is especially dangerous when the card has no "do not use for" section. If there is no direct prohibition, employees define the limits for themselves. In the end, the same model is used both for draft generation and for tasks involving personal data, even though it is not suitable for the second scenario.

Checking catalog quality is easy. Open any card and ask five questions: who is the owner, who can access it, when is the next review, what tasks is the model allowed for, and when will it be removed from the catalog. If the card does not answer at least two of these questions, the catalog is already steering teams by guesswork.

Short checklist

Don’t overpay for simple tasks
Route requests to the right models instead of using one expensive integration for everything.

Before publishing a model to the catalog, it is enough to quickly check five points:

  • the model has an owner and a next review date;
  • the status is understandable without a call or extra explanation;
  • the data rules are visible right away;
  • budget and expected latency are stated clearly;
  • the exception path is clear and does not live only in messages.

A good sign is when a new employee can understand without help why one model is open to everyone and another is available only for a pilot. A bad sign is when the answers live in chats and the card is empty or outdated.

For companies with strict data requirements, this list is especially useful. If the platform already shows data-type restrictions, audit logs, and key-level limits, these fields are better placed in the model card rather than hidden in settings.

If even one point is missing, it is better not to publish the model as the default available option. Fixing the consequences almost always takes longer than filling out the card properly once.

What to do next

Do not try to build the catalog in one go. To start, 10–15 models are enough - the ones teams already use or request most often. That is enough to test the rules on live tasks without drowning in tables and manual work.

Lock in one card format and one set of statuses right away. If one team calls a status "can go to prod" and another calls it "checked", the catalog will quickly spread out of control.

A good first step looks like this: choose a small set of models for 2–3 common scenarios, assign a catalog owner and owners by domain, bring access, audit, and limits into one framework, and after 30 days run a short review. Within a month, you can usually see which cards nobody opens, which statuses people confuse, and where teams still bypass the catalog.

When model access lives in one place, it is easier to manage. An admin can immediately see who is calling the model, how many tokens the team is using, where limits are exceeded, and which requests require audit.

When there are many sources, a single gateway really helps. For example, AI Router on airouter.kz gives access to different models through one OpenAI-compatible endpoint, and it also allows audit logs, key-level rate limits, PII masking, and scenarios with data stored inside Kazakhstan. For the catalog, this is useful not just on its own, but because the rules can be enforced not only in a document but also in the infrastructure itself.

Even that is already enough to make the catalog a working tool instead of just another table with a list of models.

Frequently asked questions

Why does a company need a model catalog at all?

The catalog helps teams avoid choosing a model from memory or by copying someone else’s example. It immediately shows what the model is good for, what data can be sent to it, who owns it, and whether it can go into production.

What must be included in a model card?

Start with the basics: the model’s purpose, owner, allowed scenarios, clear prohibitions, cost per typical request, usual latency, and data handling rules. If someone still has to ask in chat after reading the card, it is not ready yet.

What statuses should we introduce in the catalog?

For a start, five statuses are enough: draft, pilot, approved for production, under observation, and sunset. That is enough for people to understand what they can use right now and what they should leave alone.

Who should move a model between statuses?

Status changes should not be handled by one random team. It is usually a clear set of roles: the scenario owner looks at business value, the platform checks cost and stability, security reviews the data, and the service owner decides on production use.

How can we launch a pilot without too much bureaucracy?

Run the pilot in a narrow scope: one team, one scenario, test or masked data, a budget limit, and a clear end date. That way you can spot issues quickly without spreading an unfinished setup across the company.

When can a model be moved into production?

Move a model to production only after a short test on real examples and a couple of weeks of stable behavior. Before that, check cost, latency, response format, behavior in Russian or Kazakh, and the data handling mode.

What should we do if a model suddenly gets worse?

Lower the status right away and switch to a fallback if the model starts breaking format, getting more expensive, or answering worse. Do not wait for complaints to pile up: record the reason, set a review date, and decide whether to roll back or remove it from the catalog.

How should access to models be separated properly?

The easiest way is to open the full catalog only to the platform team, architects, and security. Product teams should see only approved models for their tasks, and access to sensitive data should be granted separately with a clear owner.

How many models should be included in the first catalog?

On the first pass, do not include every available model. Start with 10–15 of the most frequently needed ones. That is enough to test the card format, statuses, and access rules on real tasks without getting buried in a spreadsheet.

How do we keep the catalog from going stale quickly?

Give each model a next review date and do not publish a record without an owner. If you have one gateway, such as AI Router, enforce the rules in the infrastructure too: split keys, limits, audit, and data handling modes.