SDK Compatibility After Changing base_url: Where It Breaks
SDK compatibility after changing base_url often breaks not at authentication, but in streaming, tool calls, and JSON schemas. Here are the common failures.

Why Changing base_url Doesn't Preserve Behavior
On paper, it sounds simple: change base_url, keep the same SDK, and the app should keep working as before. In practice, that rarely happens. The SDK hides a lot of small details from the team: how it builds the HTTP request, which fields it adds by default, how it reads the event stream, and what it does when the provider returns the answer in a slightly different shape.
Because of that, the same call in code does not guarantee the same result. A request may complete successfully, but the behavior is already different. One provider will return tool_call in the expected format, another will add the field differently. One stream cuts the answer into convenient chunks, another does it its own way. From the outside it all looks "compatible," but inside the details diverge, and those details are what production depends on.
This becomes especially noticeable when a team moves to an OpenAI-compatible gateway. For example, a service is moved to AI Router, the old SDK and the same code are kept because the endpoint is compatible. The basic text generation request works, and the migration seems finished. Then chat stops streaming long answers reliably, the agent loses function arguments, and JSON sometimes fails validation. The problem is not one bug. Only the simplest scenario matched.
The first failure often shows up in production. A test request is usually short, with no streaming, no tools, and no strict response schema. Users do something different: they ask long questions, disconnect, retry, and send data with empty fields. That is where the difference between "the request succeeded" and "the system behaves the same way as before" becomes visible.
One successful request proves almost nothing. It is more useful to look not at whether there was a response, but at whether the behavior is repeatable. Do the error fields match? Does the event order in the stream match? Does the tool argument format match? What about schema strict mode, timeouts, and retries? If you do not check that in advance, the team will learn about the mismatch at the worst possible moment, when traffic is already live and rolling back costs more than checking properly before release.
What Breaks First in Streaming
After changing base_url, the code often connects on the first try, and the failure comes later when reading chunks. The SDK sends the request as before, but the stream handler often expects the field order to be extremely precise. That is easy to miss in testing. In production, the interface suddenly shows an empty answer, duplicates tokens, or hangs until generation ends.
Most often the problem starts with the delta format. One provider puts the text in delta.content, another first sends delta.role, then separate pieces of content, and a third adds internal fields your code was not expecting. If the parser assumes every chunk contains text, it starts joining null, empty strings, or simply crashes.
The role field is another common source of confusion. The team writes simple code: receive the first chunk, take role, create the message, start printing text. In practice, the first chunk may be empty, then role, and only then the text itself. Sometimes the text arrives before the interface has had time to create a container for the response, and the user thinks streaming is broken.
usage is similar. In many integrations, people want to see tokens during generation, but some OpenAI-compatible gateways and providers return usage only at the very end. If limits, billing, or internal stats depend on intermediate values, you see zeros until the stream finishes and draw the wrong conclusions.
Empty chunks break a lot of code. They are normal: the gateway may keep the connection alive, buffer the answer, or send a service event without text. A simple parser sees an empty event and decides the answer is over. After that, the service cuts the output off mid-sentence.
Another common mistake is expecting finish_reason too early. Some implementations send it only in the final event, when all the text has already been collected. If the app closes the stream after the first suspicious chunk or waits for finish_reason in every message, part of the answer is lost.
A reliable parser behaves more calmly. It skips empty chunks, collects text only from fields where it actually exists, does not depend on the order of role, content, and usage, and ends the stream on the final event or when the stream closes. In a test environment, it is useful to save the raw chunks too. That is what helps later when you need to see where the behavior diverges.
If streaming "almost works" after changing base_url, the problem is usually not the SDK. Almost always, the hidden assumptions in your parser about how events should arrive are what break.
Where Tool Calls Get Mixed Up
After changing base_url, what usually breaks first is not the chat itself, but the logic around tools. The model may still return text that looks fine, but tool_calls quickly reveal the differences between providers and models.
The first trap is the call ID. One SDK will happily accept call_abc123, another expects a UUID, and a third stores the ID as an opaque string and does not care about the format. If your code trims the ID by length, looks for a familiar prefix, or maps it to its own pattern, the failure will appear even though the model answered correctly.
The second problem is tool arguments in the stream. Not every API sends ready-made JSON in one piece. Often the model sends the tool name right away, while arguments are assembled piece by piece: first {"order_id":, then the value, then the closing brace. If the service tries to parse the JSON too early, it either crashes or runs the tool with empty fields.
This is easy to see in a simple scenario. Suppose a support bot needs to call get_order_status. During streaming, the SDK has already seen the function name, but the order number has not arrived in full yet. One client waits for the final fragment, another calls the handler immediately. The result is different even though the request was the same.
Parallel calls create even more confusion. Some models can return two tools in one response, while others return only one, even if you asked for parallel mode. The gateway may normalize the format, but it will not add behavior that does not exist upstream. So even through a single endpoint, teams still run into the real differences between models.
Retry deserves special attention. If the network drops after the tool has already run, the SDK may retry the request and trigger the same call again. For reads, that is annoying but tolerable. For sending an email, creating a ticket, or deducting bonuses, that becomes a double action.
In code, it is better to build in a few simple rules right away: accept the call ID as a plain string without extra format checks, collect arguments until the JSON is complete and only then run the tool, and keep side-effect actions idempotent. It also helps to have a fallback path without parallel calls and to check not only the call data but also finish_reason or its equivalent.
The final status is also not standardized. One SDK waits for tool_calls, another treats stop as normal, and in streaming the status may appear only in the last chunk. Because of that, the agent loop sometimes ends too early or, on the contrary, keeps waiting for a response that will never come. In practice, teams do best when they test each tool calling step separately on a live model.
Why Structured Output Diverges from the Schema
After changing base_url, many people expect JSON Schema to work the same everywhere. Usually it does not. One provider treats the schema almost like a contract, while another treats it like a suggestion the model may ignore.
The problem is not obvious at first. The SDK returns a successful response, and your service fails on the next step when it tries to parse the fields by type. For a developer, it looks strange: the request succeeded, the model answered, but the business logic cannot use the result.
The most common failure is simple. The model adds text around the JSON. Instead of a clean object, you get something like "Done, here is the result:" followed by the JSON itself, sometimes inside a markdown block. A person can read that answer easily; a parser usually cannot.
Type problems are just as common. A number comes back as a string, a boolean field comes back as "true", and an array turns into a single comma-separated string. If the code expects price: number and approved: boolean, that answer is no longer correct, even if it looks fine at a glance.
Deep schemas break more often than simple ones. When the response contains nested objects, lists of objects, enum, and many optional fields, the model is more likely to skip part of the structure or change the type at one level. A flat schema with a few fields usually holds up better.
This becomes especially visible when the team sends the same request to different models through one OpenAI-compatible gateway. The endpoint is the same, the SDK is the same, but schema behavior differs, because the difference is not in the SDK but in the model and in how the provider implemented structured output.
The risk goes down with fairly practical steps: make the schema simpler, do not ask the model to add explanatory text, validate types after the response, and keep a fallback path if the JSON fails validation. Another useful approach is to test with real prompts and data instead of a perfect example. That is where schemas usually fall apart.
It helps to think of it this way: the SDK checks that a response arrived, while your parser checks that the response is fit for use. Those are two different layers of compatibility.
How to Check Compatibility Step by Step
When a team changes only base_url, the differences are usually hidden not in the SDK code, but in the server responses. The same request may return the same text, but a different chunk order, a different tool_calls format, or a different error code on retry.
It is better to test this on a small, repeatable setup. Do not change the model, prompt, temperature, or SDK version while testing. Otherwise you will be comparing several variables at once and quickly get lost.
A workable setup is simple. Take one real scenario: a plain text request, one streaming response, and one tool call. Send that set through the old and new base_url with the same headers, timeouts, and parameters. Save the raw HTTP responses: status, headers, body, event stream, and the time between chunks. For streaming, write each chunk to a file separately, then save the final assembled body. After that, sort the results into four groups: text, tool_calls, errors, and retry behavior.
Application logs are usually too weak for this. They often hide empty deltas, service fields, finish_reason, the difference between null and a missing field, and sometimes even the real error text. If the gateway is only "generally" compatible, that is where the differences will surface.
It helps to keep a simple comparison table. In one column, the old base_url; in the other, the new one. Compare first latency, total time to last token, tool_calls structure, JSON in structured output, 4xx and 5xx codes, and behavior after 429 or a timeout. By that point, you can usually already see where client code will start to diverge from expectations.
If a team is moving a service to AI Router, it is better to do this test on a real working route rather than an abstract "hello world." Take a request from the production support chat or an internal copilot, remove personal data, and run it dozens of times. That way you will see not only whether a response came back, but whether the service can reliably survive streaming, tool_call, and request retries.
If you find differences, do not try to fix everything at once. First get identical behavior for non-streaming text, then for streaming, and only after that enable tools and schema-based responses. That order usually saves hours of debugging.
A Simple Migration Example in a Live Service
A support team moves its chat assistant to a new OpenAI-compatible gateway, for example AI Router. In the code, they change only base_url, keep the same SDK, the same prompts, and the same operator interface. In the demo, everything looks calm: ordinary text answers come back right away, with no surprises.
Problems do not start with a simple question like "what are your hours?" They start with real traffic. That is where it becomes clear what compatibility really means.
Streaming breaks first. In the interface, the operator sees the answer flow token by token, but after the migration the text sometimes eats pieces of a sentence: the customer gets the beginning, then a sudden jump to the end. The reason is often not the model, but the way the gateway and SDK assemble stream events, delta fields, and end markers.
Next comes the CRM call. The assistant should look up the customer record once by phone number, but after a retry the same tool goes out again. As a result, the CRM creates a duplicate note or applies the same tag twice. If the team has not made the tool idempotent, the mistake becomes expensive quickly.
Then the most painful part breaks: the structured response. The chat should return JSON for the customer card: name, contract status, last contact, reason for the request. The model technically answers in JSON, but one time it puts the date in as a string, another time it returns null where the validator expects an array. On the operator screen, that looks like "the card failed to load," even though the text response seems normal.
This is usually fixed not with a big refactor, but with a few precise checks: compare the raw stream events before and after the migration, prevent a tool from running again without a unique request_id, validate JSON before sending it to the interface, and log not only the answer text but also the tool_call body.
After that, the team leaves simple text responses on the shared route and adds separate tests for CRM and JSON using real support scenarios. One perfect question proves almost nothing.
A good migration result looks boring. The operator does not notice the gateway change, the CRM gets no duplicates, and the customer card passes validation every time. That is exactly what should be checked first, even if changing base_url took five minutes.
Mistakes Teams Make Most Often
The most common mistake is simple: the team changes base_url and expects the SDK to smooth out all differences on its own. On a demo, that often works. In a live service, protocol details, event formats, and model behavior show up that the SDK does not hide.
The second mistake appears in almost every project: tests are run only in normal, non-streaming mode. The full response arrived, the JSON parsed, so everything must be fine. But in production, the product lives on streaming, and there the order of chunks, finish_reason, partial tool_calls, and even empty events can differ.
The third mistake is quieter, but more expensive. The code still contains options understood by only one provider: a custom response_schema format, a nonstandard tool_choice parameter, a special flag for reasoning, or seed. Through one gateway, such a request may work for some models and break for others.
The fourth problem is parsing. Teams often write one universal parser and run every model through it. That is convenient until the first mismatch: one model puts tool arguments in a string, another in an object, and a third returns JSON that is technically valid but does not match the schema in field types.
A few simple measures help: run the same scenario with and without streaming, test tool_calls separately across multiple models, log raw events before parsing, keep an allowlist of supported parameters, and validate JSON with a schema instead of by eye.
Another common miss: timeouts, retries, and rate limits are left unchanged. After changing the route, not only the model response changes, but also the latency profile. If the service used to wait 15 seconds and make two attempts, that may no longer be enough for a long stream or a tool call.
A good rule is boring but effective: only call something compatible if it passed your test set on your model, in your mode, with your limits. Everything else should be treated as a hypothesis.
Quick Checks Before Launch
If the service already answers through the new gateway, that still does not mean the behavior matches. Compatibility usually breaks on small things: a different request parameter, an extra retry, or a parser that expects perfect JSON and fails on the first empty chunk.
Before launching in production, it is worth checking not just the code, but the actual traffic. Especially if the team wants to keep the same SDK and prompts without rewriting them.
Before release, it helps to check a few things manually. Compare the model, temperature, and max_tokens in the old and new calls. It is a basic mistake, but a common one: base_url was changed, but the model name or token limit was left over from the previous provider. Check headers, timeouts, and retry behavior. The same SDK may behave differently with 30 seconds versus 120 seconds of waiting, and a repeated request may produce a different answer even with the same prompt. Save raw requests and responses at least during testing. Without that, the team usually argues about symptoms instead of comparing JSON field by field. And always run the tool schema on real data, not on the example from the documentation. User input quickly exposes mismatches in types, enum, and required fields.
Identical settings matter more than people think. If the old provider used temperature 0.2 and the new SDK defaults to 1.0, you will get more than just a different writing style. The model may choose a different tool, fill fields differently, or even go outside the schema.
It is better to compare logs in pairs: the old answer next to the new one, the same input, the same request_id inside your service. Then you can see exactly where behavior diverges. Sometimes the problem is not the model at all, but a proxy stripping a header, the client closing the connection too early, or the library quietly enabling request retries.
In practice, a short set is often enough: one ordinary chat request, one streaming request, one tool call, and one structured output with a strict schema. If even one of those four scenarios is unstable, it is better to stop the release for a day than to fix it in production later.
What to Do Next
Build a small test suite and run it every time you change the gateway, the model, or the settings. Compatibility almost always breaks not on the "big" request, but on the small details the code has quietly become dependent on.
A minimal set usually looks like this: one regular request without streaming, one streaming request that checks all chunks and the final event, one tool call with real JSON in the arguments, one structured response with JSON Schema validation, and one retry after a timeout or connection drop. That is enough to catch most differences quickly.
For critical functions, keep a fallback path. If the tool does not run, the service can temporarily switch to a normal answer. If the JSON does not pass the schema, the code can save the raw text and mark the task for reprocessing. It is not pretty, but it saves production when the new provider behaves slightly differently.
Also document the fields your code depends on. Teams often think they only use content, and then discover that finish_reason, tool_calls, role, usage, the response ID, or the field order in delta are actually part of the logic. Those dependencies are better described directly in the service contract and checked by test, not by team memory.
Before switching, check the gateway defaults. temperature, max_tokens, JSON mode, parallel tool_calls, retries, timeouts, and streaming format can differ even with the same SDK. One unnoticed default can change behavior more than the model itself.
If you are testing migration through AI Router, it is useful to run the same set of scenarios across several providers and across models hosted on your own GPU infrastructure. That makes it faster to see where the difference lies in the route itself and where it lies in the model. For teams in Kazakhstan, it is also a convenient way to test behavior through one OpenAI-compatible endpoint api.airouter.kz without rewriting the SDK and client code.
The working order is simple: first fix the contract, then run the tests, then turn on the new route for a small portion of traffic. If one scenario fails, do not argue with the SDK. Either add a thin adapter for that difference, or remove the dependency on the questionable field before it breaks the service on a workday.