Your AI runs inside Kazakhstan
We don't rent a slice of someone else's cloud. AI Router operates its own GPU fleet inside Kazakhstani data centers: Blackwell for frontier models, Hopper for production, L40S for high-throughput 7–32B inference. Every enterprise customer gets an isolated GPU pool — your data and weights never mix with anyone else's.
GPU tiers
3 generations
NVIDIA architectures
Blackwell · Hopper · Ada
Data residency
Kazakhstan
In-country data centers
Production, DR, logs, backups, billing — every byte physically in Kazakhstan. No cross-border transit, no offshore replicas.
Almaty
Kazakhstan
- Power
- 2N power · N+1 cooling
- Network
- Dual-uplink 100 GbE · BGP multi-homed
- Compliance
- Uptime Institute certified · AI Law RK
Astana
Kazakhstan
- Power
- 2N+1 power · free-cooling chillers
- Network
- Dark fiber · < 25 ms to Almaty
- Compliance
- Real-time replication · daily backups
Three inference tiers. Same API.
From trillion-parameter frontier models to cost-efficient 8B fleets — we run the whole stack in-country. You pick the tier that matches your latency, volume, and SLA.
NVIDIA Blackwell
B200 · GB200 NVL72
Trillion-parameter inference
The 2026 flagship. Dual-die architecture, 192 GB HBM3e, native FP4, and a second-gen Transformer Engine. Up to 4× faster LLM inference than H100, and 30× on trillion-parameter models in an NVL72 rack. Liquid-cooled, one 72-GPU NVLink domain.
Key specs
- 192 GB HBM3e · 8 TB/s
- 20 PFLOPS FP4 · 10 PFLOPS FP8
- NVLink 5 · 1.8 TB/s
- TEE-I/O · confidential compute
Typical workloads
- GPT-OSS 120B · Llama 4 Behemoth
- DeepSeek V3.2 685B · Qwen 3 235B
- Custom 400B+ models in FP4
Deployment
Dedicated 8-GPU node or NVL72 rack slice · liquid-cooled
NVIDIA Hopper
H200 NVL · H100
Production workhorse
The battle-tested Hopper platform: 141 GB HBM3e, 4.8 TB/s bandwidth, up to 2× inference on Llama-class models vs H100. Air-cooled — deploys in any rack. The sweet spot on price/performance for 30–120B models.
Key specs
- 141 GB HBM3e · 4.8 TB/s
- 3.96 PFLOPS FP8
- NVLink 4 · 900 GB/s
- Transformer Engine FP8
Typical workloads
- Llama 4 Maverick · Mistral Large 3
- Claude-class · GPT-class 30–120B
- Long-context RAG · agents
Deployment
Dedicated 4-GPU or 8-GPU node with NVLink · air-cooled
NVIDIA Ada Lovelace
L40S
High-throughput small models
The most cost-efficient tier per token for 7–32B models. 48 GB memory, 4th-gen Tensor Cores with FP8 via Transformer Engine. Ideal for high-QPS chat fleets, embedding pipelines, and multimodal pre-processors.
Key specs
- 48 GB GDDR6 · 864 GB/s
- 1.47 PFLOPS FP8
- Transformer Engine FP8
- Air-cooled · 350 W
Typical workloads
- Llama 4 Scout · Qwen 3 8B/32B
- Gemma 3 12B/27B · Phi-5
- Embeddings · reranking · chat
Deployment
2-GPU and 4-GPU nodes · PCIe Gen4 · standard rack
Tier-by-tier comparison
Numbers below are steady-state figures on dedicated 8-GPU nodes with typical production batching. Your numbers depend on model, context length, and batch size — we always benchmark your exact workload before you commit.
| Specification | Blackwell B200 | Hopper H200 | Ada L40S |
|---|---|---|---|
| GPU memory | 192 GB HBM3e | 141 GB HBM3e | 48 GB GDDR6 |
| Memory bandwidth | 8.0 TB/s | 4.8 TB/s | 864 GB/s |
| Peak FP8 | 10 PFLOPS | 3.96 PFLOPS | 1.47 PFLOPS |
| Peak FP4 | 20 PFLOPS | — | — |
| Interconnect | NVLink 5 · 1.8 TB/s | NVLink 4 · 900 GB/s | PCIe Gen4 · 64 GB/s |
| TDP / cooling | 1000 W · liquid | 700 W · air | 350 W · air |
| Best model size | 70B–1T+ | 30B–120B | 7B–32B |
| Tokens/sec · 70B FP4/FP8 | ~8,000 (FP4) | ~2,000 (FP8) | — |
| Tokens/sec · 13B FP8 | ~24,000 | ~9,000 | ~3,200 |
| Concurrent streams · 70B | 64–128 | 32–48 | — |
| Chat RPS (p95 < 500 ms) | 40–80 | 20–30 | 30–60 |
| Time to first token (70B, p50) | ~180 ms | ~240 ms | — |
| Confidential compute (TEE-I/O) | Yes | — | — |
| Inference cost · 70B class | from $0.12 / $0.36 per 1M | from $0.20 / $0.60 per 1M | — |
| Inference cost · 8–13B class | — | from $0.10 / $0.30 per 1M | from $0.05 / $0.15 per 1M |
Prices shown as input/output per 1M tokens for reserved dedicated capacity. Proxied models from third-party providers are billed at their list price with zero markup from us — see the pricing page.
Dedicated hardware. Zero multi-tenancy.
Enterprise customers get a physically isolated GPU pool — not a slice of a shared inference API. Your weights, KV cache, logs, and metrics live only on hardware assigned to your tenant.
Physical GPU isolation
Named GPUs and nodes assigned to your tenant. No shared inference queues. No noisy-neighbor latency spikes.
TEE-I/O on Blackwell
Trusted Execution Environment I/O encrypts weights and prompts with near-zero throughput penalty. Built for regulated workloads — finance, healthcare, government.
Weights stay on your node
Your fine-tunes, LoRA adapters, and KV caches never leave the GPUs assigned to your tenant. No cross-tenant cache pooling.
Per-tenant VLAN · private endpoints
Optional per-tenant VLAN isolation, private endpoints, and IP allowlisting. Traffic never crosses tenant boundaries inside the rack.
Tenant KMS envelope
Disk encryption keys, session tokens, and API key material are envelope-encrypted per tenant in our HSM-backed KMS.
Per-tenant audit trail
Immutable per-tenant logs. SIEM export via webhook or S3. Retention policy configured to your regulator's requirements.
Data sovereignty, end to end
Your data never leaves Kazakhstan. Not for training, not for logging, not for billing reconciliation.
All infrastructure in-country
Production, DR, logs, metrics, backups, API gateway — every byte physically in Kazakh data centers.
AI Law RK compliant
Per-request regulatory-context labels. Named data controller in Kazakhstan. DPA with every enterprise customer.
Billing in local currency
Microdollar accounting with KZT invoicing. Bank transfer, VAT-compliant invoices, no cross-border payment flows.
Support on your timezone
Named SRE on-call in Almaty. 15-minute P1 response. Russian, Kazakh, and English support.
Reserve dedicated GPUs for your workload
We benchmark your exact model and traffic pattern on each tier — then reserve the right mix for your SLA and budget.