Skip to content

Blog

Practical writing on LLM application architecture, model routing, cost optimization, and operating AI systems in production.

Latest posts

Warm model pool: how to calculate reserve for peak hours
Jun 14, 2025·7 min read
Warm model pool: how to calculate reserve for peak hours
A warm model pool helps you get through peak hours without unnecessary cost. We show how to estimate GPU reserve, watch the queue, and avoid paying for idle capacity.
warm model poolLLM peak load
Extracting Tables from PDFs: How to Build Clean Data
Jun 07, 2025·8 min read
Extracting Tables from PDFs: How to Build Clean Data
Extracting tables from PDFs takes more than parsing: you also need line normalization, total checks, and manual review of ambiguous cases.
PDF table extractionPDF table parsing
When a Small Model Is Better Than a Large One for Work Tasks
Jun 05, 2025·10 min read
When a Small Model Is Better Than a Large One for Work Tasks
We look at when a small model is better than a large one: classification, field extraction, cost, latency, errors, and a simple way to choose.
when a small model is better than a large oneLLM text classification
Criteria for Evaluating a Support Assistant in Manual Review
Jun 04, 2025·11 min read
Criteria for Evaluating a Support Assistant in Manual Review
Learn how to set criteria for manually reviewing a support assistant: accuracy, tone, usefulness, and safety without unnecessary complexity.
support assistant evaluation criteriamanual review of AI answers
Migration to Multiple AI Providers Without Service Downtime
Jun 03, 2025·9 min read
Migration to Multiple AI Providers Without Service Downtime
Migrating to multiple AI providers without downtime: stages, SDK compatibility checks, shadow launch, and response comparison before switching.
migration to multiple AI providersOpenAI SDK compatibility
Planning-Based Agent or Scenario-Based Agent: How to Choose
May 27, 2025·7 min read
Planning-Based Agent or Scenario-Based Agent: How to Choose
We break down when a planning-based agent or a scenario-based one is the better fit for support, search, and internal automation, without unnecessary theory.
scenario-based agent or planning-based agentLLM agent for support
Source citations in assistant answers: how to build them
May 24, 2025·11 min read
Source citations in assistant answers: how to build them
Source citations in assistant answers help verify conclusions. Here we explain how to gather quotes by document, not by random text snippets.
source citations in assistant answersdocument-based citations
A Single API for LLMs: When It Is Better Than Separate Integrations
May 23, 2025·11 min read
A Single API for LLMs: When It Is Better Than Separate Integrations
A single API for LLMs helps compare a shared gateway with separate integrations in terms of cost, launch speed, access control, and team growth.
single API for LLMscentralized AI platform
LLM Stream Cancellation: How to Stop Paying for Extra Tokens
May 23, 2025·9 min read
LLM Stream Cancellation: How to Stop Paying for Extra Tokens
LLM stream cancellation helps stop extra tokens when a user leaves the page. We look at signals, timeouts, logs, and checks.
LLM stream cancellationextra tokens
LLM Gateway Metrics in Production: A Short Daily Set
May 23, 2025·10 min read
LLM Gateway Metrics in Production: A Short Daily Set
LLM gateway metrics help you see quality, latency, errors, and costs every day. Here is a short set of numbers for making production decisions.
LLM gateway metricsproduction LLM monitoring
Runbook for the On-Call Engineer on an LLM Service: First 15 Minutes
May 22, 2025·6 min read
Runbook for the On-Call Engineer on an LLM Service: First 15 Minutes
A short runbook for the on-call engineer on an LLM service: how to check error spikes, cost, and latency in 15 minutes, prioritize the right issues, and avoid service disruption.
LLM service on-call runbookLLM incident checklist
When You Don’t Need Fine-Tuning: Data, Prompt, or Routing
May 16, 2025·6 min read
When You Don’t Need Fine-Tuning: Data, Prompt, or Routing
When you don’t need fine-tuning: a practical guide to signs that clean data, a strong prompt, eval, and model routing will solve the task better.
when you don't need fine-tuningprompt instead of fine-tuning
The Evolution of an Extracted Data Schema Without Analytics Chaos
May 14, 2025·11 min read
The Evolution of an Extracted Data Schema Without Analytics Chaos
How to change fields, dictionaries, and versions without breaking old reports or making the numbers diverge.
evolution of extracted data schemaschema versioning
Human in the Loop: Confidence Thresholds Without Manual Hell
May 08, 2025·7 min read
Human in the Loop: Confidence Thresholds Without Manual Hell
Human-in-the-loop is not needed for every check: learn confidence thresholds, request types, and a simple escalation path to an operator.
human-in-the-loopLLM confidence thresholds
Admission control for long prompts in an LLM service
May 07, 2025·8 min read
Admission control for long prompts in an LLM service
Admission control for long prompts helps keep an LLM service available under load. We will cover priorities, truncation, rejections, and quick checks.
admission control for long promptsLLM request queues
AI Tasks Through a Queue: When to Move to an Async Pipeline
May 07, 2025·8 min read
AI Tasks Through a Queue: When to Move to an Async Pipeline
We’ll look at when AI tasks through a queue work better than a web request, how to build an async pipeline, and where it lowers timeouts, cost, and failure risk.
AI tasks via queueasync pipeline for LLMs
Prompt Caching: When It Actually Lowers Your LLM Bill
May 05, 2025·9 min read
Prompt Caching: When It Actually Lowers Your LLM Bill
Prompt caching does not help in every case. We break down repeat-request thresholds, a savings formula, quality risks, and a quick way to check.
prompt cachingLLM request repetition
Peak load on LLM functions: how not to bring your product down
Apr 29, 2025·6 min read
Peak load on LLM functions: how not to bring your product down
Peak load on LLM functions should not take your product down. Learn when to use queues, simplify responses, and route traffic to lighter models.
peak load on LLM functionsLLM request queues
LLM Log Retention Periods: How to Separate Records by Class
Apr 27, 2025·6 min read
LLM Log Retention Periods: How to Separate Records by Class
Let's break down LLM log retention periods: how to separate operational, debug, and audit records so you do not accumulate unnecessary data.
LLM log retention periodsLLM log audit
Enterprise LLM Pilot: Where to Start and How Not to Drag It Out
Apr 25, 2025·9 min read
Enterprise LLM Pilot: Where to Start and How Not to Drag It Out
An enterprise LLM pilot is easier to start with one business pain point, a short four-week plan, basic data checks, and clear success metrics.
enterprise LLM pilotlaunching LLM in a company
System Prompt or Short Rules: How to Reduce Drift
Apr 25, 2025·7 min read
System Prompt or Short Rules: How to Reduce Drift
System prompt or short rules: when to choose one long block and when to use a modular set to reduce drift and make review easier.
system prompt or short rulesinstruction drift
When a Bot Should Hand the Conversation Over to an Operator Without Arguing
Apr 16, 2025·9 min read
When a Bot Should Hand the Conversation Over to an Operator Without Arguing
We explain when a bot should hand a conversation over to an operator: risk signals, customer emotions, uncertainty in the answer, setup mistakes, and a quick check.
hand off a conversation to an operatorchatbot escalation
LLM API Retries: How Not to Double Your Bill When Failures Happen
Apr 13, 2025·10 min read
LLM API Retries: How Not to Double Your Bill When Failures Happen
Retries for LLM APIs help you survive failures, but without limits and idempotency they can quickly drive costs up. We break down timeouts, delays, and checks.
LLM API retriesrequest idempotency
Prompt mistakes: 5 reasons your LLM bill is bloated
Apr 13, 2025·7 min read
Prompt mistakes: 5 reasons your LLM bill is bloated
Learn how extra instructions, repetitions, and long context increase token usage, and how to remove prompt mistakes without losing quality.
prompt mistakesLLM request cost
SQL Agent Without Risk to the Production Database: Read-Only and Limits
Apr 12, 2025·11 min read
SQL Agent Without Risk to the Production Database: Read-Only and Limits
SQL agent without risk to the production database: how to set up read-only access, a SQL query allowlist, timeouts, and quick checks before launch.
safe SQL agent for production databaseread-only database access
Model Quantization: Checks Before Moving to 8-bit and 4-bit
Apr 06, 2025·9 min read
Model Quantization: Checks Before Moving to 8-bit and 4-bit
Model quantization requires quality checks on your own dataset: choose the right metrics, find failures, and compare FP16, 8-bit, and 4-bit before release.
model quantizationFP16 vs 8-bit
Batch Inference or Online Calls for Nighttime Tasks
Apr 06, 2025·7 min read
Batch Inference or Online Calls for Nighttime Tasks
Batch inference suits overnight processing, but it does not always beat online LLM calls. Let’s break down extraction, categorization, and draft generation.
batch inferenceonline LLM calls
Response validation before writing to CRM and ERP without failures
Apr 06, 2025·11 min read
Response validation before writing to CRM and ERP without failures
Response validation helps catch a broken schema, wrong numbers, and bad links before writing to CRM or ERP and reduces manual corrections.
response validationschema validation
RAG or Long Context: How to Choose a Search Setup
Apr 04, 2025·7 min read
RAG or Long Context: How to Choose a Search Setup
RAG or long context: see how these approaches affect document search, cost, and latency so you can choose the right setup for your product.
RAG or long contextdocument search
LLM expense report for accounting and the CTO without manual reconciliations
Mar 31, 2025·10 min read
LLM expense report for accounting and the CTO without manual reconciliations
An LLM expense report brings tokens, models, and teams into one format so accounting and the CTO can reconcile the numbers without manual work.
LLM expense reportLLM token tracking
Migrating to an OpenAI-Compatible Endpoint Without Surprises
Mar 29, 2025·11 min read
Migrating to an OpenAI-Compatible Endpoint Without Surprises
Migrating to an OpenAI-compatible endpoint looks like a simple base_url swap, but it often breaks on SDKs, timeouts, streaming, and JSON responses.
migrating to an OpenAI-compatible endpointreplacing OpenAI base_url
PDF Review by Page or Whole: What to Choose
Mar 24, 2025·6 min read
PDF Review by Page or Whole: What to Choose
Page-by-page PDF checking works well for long files with mixed templates, while full parsing is better for stable documents and summary fields.
page-by-page PDF parsingextracting requisites from PDF
Prompt Versioning for Releases Without Surprises
Mar 23, 2025·10 min read
Prompt Versioning for Releases Without Surprises
Prompt versioning helps ship changes without breakage: we’ll cover repo structure, testing, rollback, and a team workflow.
prompt versioningprompt repository
Shadow Traffic for a Model Migration Without Breaks or Surprises
Mar 15, 2025·10 min read
Shadow Traffic for a Model Migration Without Breaks or Surprises
Shadow traffic for model migration helps compare answers, latency, and cost before launch. Learn how to measure differences and switch calmly.
shadow traffic for model migrationparallel LLM requests
Golden Set for LLMs: How to Keep It Without the Clutter
Mar 11, 2025·9 min read
Golden Set for LLMs: How to Keep It Without the Clutter
A golden set for LLMs helps you check quality without chaos: how to choose cases, archive old examples, and keep rare complex requests.
golden set for LLMsLLM quality evaluation
Model Routing: Why the First Setup Doesn’t Pay Off
Mar 10, 2025·6 min read
Model Routing: Why the First Setup Doesn’t Pay Off
Model routing often does not pay off on the first try: teams introduce complex rules too early. Here is how to start with a small set of signals.
model routingLLM request routing
Checking Links and Details After Email Generation
Mar 09, 2025·9 min read
Checking Links and Details After Email Generation
Checking links and details after email generation helps catch broken URLs, IIN mistakes, and old contract numbers before the client sees them.
checking links and details after email generationbroken URLs in emails
JSON Schema Fallback: How to Switch Models Without Breaking Tool Mode
Mar 05, 2025·7 min read
JSON Schema Fallback: How to Switch Models Without Breaking Tool Mode
JSON schema fallback matters when a backup model changes fields, types, or response format. We break down how to choose backup models, validators, and checks.
JSON schema fallbackbackup LLM models
Synthetic Examples for Testing LLMs Before Production
Feb 28, 2025·6 min read
Synthetic Examples for Testing LLMs Before Production
Synthetic examples help test LLMs when real data is scarce. Learn how to build test cases, write expected results, and catch failures before launch.
synthetic examples for LLM testingLLM test cases
Online and Offline Quality Evaluation: When to Trust Which
Feb 26, 2025·7 min read
Online and Offline Quality Evaluation: When to Trust Which
Online and offline quality evaluation answer different questions: clicks and conversions catch the effect in production, while labels and expert review surface mistakes earlier.
online and offline quality evaluationclicks and conversions
Anonymizing Contracts and Medical Records for LLMs Without Losing Meaning
Feb 23, 2025·6 min read
Anonymizing Contracts and Medical Records for LLMs Without Losing Meaning
Anonymizing contracts and medical records before sending them to an LLM requires precise rules: which fields to hide, what to keep, and how to avoid distorting legal or clinical meaning.
anonymizing contracts and medical recordssensitive fields in documents
LLM Routing for Production: How to Choose a Strategy
Feb 16, 2025·7 min read
LLM Routing for Production: How to Choose a Strategy
For production LLM routing, choose based on one task set and on cost, latency, and quality metrics, not on broad benchmarks.
LLM routing for productionmodel routing
PII masking before calling the model: where and how to do it
Feb 05, 2025·10 min read
PII masking before calling the model: where and how to do it
PII masking helps hide personal data before sending a request to an LLM. We show where to place redaction, how to measure meaning loss, and how to safely return fields.
PII maskingpersonal data redaction
When to Stop an AI Agent in Finance, Healthcare, and Law
Feb 05, 2025·7 min read
When to Stop an AI Agent in Finance, Healthcare, and Law
When to stop an AI agent: we look at risk signals in finance, healthcare, and law, and show where the agent should hand the task to a person.
when to stop an AI agenthandoff to a human
Assistant Personalization Without Extra Profile or Risk
Jan 27, 2025·10 min read
Assistant Personalization Without Extra Profile or Risk
Assistant personalization works better when you store only the signals that change the answer: role, language, request goal, and fresh context.
assistant personalizationdata minimization
How to Calculate an LLM Budget for Multiple Teams
Jan 27, 2025·7 min read
How to Calculate an LLM Budget for Multiple Teams
We’ll show how to break an LLM budget down by team, limits, and use cases so costs don’t grow after the pilot and the move to production.
LLM budgetLLM costs
Knowledge Base Search: Embeddings or a Generative Model?
Jan 27, 2025·11 min read
Knowledge Base Search: Embeddings or a Generative Model?
Knowledge base search can be built with embeddings or a generative model. Here we cover indexing, reranking, and answers with citations.
knowledge base searchembeddings
What to Store for Prompt Debugging Without Privacy Risk
Jan 27, 2025·10 min read
What to Store for Prompt Debugging Without Privacy Risk
What to store for prompt debugging: how to separate raw requests, masked copies, and metrics without exposing personal data.
what to store for prompt debuggingdata masking in LLMs
Token Usage Forecasting: How to Spot Overspending in Time
Jan 26, 2025·7 min read
Token Usage Forecasting: How to Spot Overspending in Time
Token usage forecasting helps you spot overspending early, set thresholds, catch model spikes, and avoid waiting for the invoice at month-end.
token usage forecastLLM usage anomalies
SLOs for LLM Applications: How to Measure Against Business Goals
Jan 25, 2025·7 min read
SLOs for LLM Applications: How to Measure Against Business Goals
SLOs for LLM applications help connect latency, valid response share, and cost to business expectations, not to charts made for reporting.
SLOs for LLM applicationsLLM latency and quality
Contact center call summarization without noise
Jan 23, 2025·6 min read
Contact center call summarization without noise
Call summarization helps only when the call card shows the topic, outcome, risk, and next step without extra fields.
contact center call summarizationcall card for supervisors
Source-based fact checking: how to build a test suite
Jan 18, 2025·10 min read
Source-based fact checking: how to build a test suite
Source-based fact checking helps you build tests where the answer is compared against a document, table, or database. We’ll cover the test suite structure, common mistakes, and a checklist.
source-based fact checkingautomated test suite
Model Evaluation on Your Own Data for Product Use Cases
Jan 16, 2025·9 min read
Model Evaluation on Your Own Data for Product Use Cases
Evaluating models on your own data helps you choose the right LLM for product tasks: how to collect scenarios, gold answers, and metrics, and compare responses fairly.
model evaluation on your own datauser task scenarios
Red Teaming a Corporate Bot Before Launch
Jan 13, 2025·8 min read
Red Teaming a Corporate Bot Before Launch
Red teaming a corporate bot helps uncover data leaks, instruction bypasses, and toxic replies before release so you can fix them step by step.
red teaming a corporate botLLM data leak attacks
Deleting Data at a Provider: What to Ask Before Buying
Dec 30, 2024·8 min read
Deleting Data at a Provider: What to Ask Before Buying
Data deletion at a provider should not be checked by word of mouth. Before buying, ask for contract clauses, logs, cleanup timelines, and the audit process.
data deletion at a providerdata storage review
When Fine-Tuning Pays for Itself, and Prompting No Longer Does
Dec 25, 2024·9 min read
When Fine-Tuning Pays for Itself, and Prompting No Longer Does
When fine-tuning pays off: we look at signs that a prompt has reached its limit, which tasks benefit most, how to estimate ROI, and common mistakes before launch.
when fine-tuning pays for itselfwhen a prompt hits its limit
AI-Powered Review of Credit and Legal Documents
Dec 24, 2024·10 min read
AI-Powered Review of Credit and Legal Documents
AI-powered review of credit and legal documents helps you spot risky clauses faster, but the final decision on the case still belongs to a specialist.
credit and legal document reviewAI for contract review
Tool-Output Injection: How to Protect an Agent
Dec 24, 2024·11 min read
Tool-Output Injection: How to Protect an Agent
Tool-output injection often hides in CRMs, emails, and HTML. Learn how to filter data, isolate tools, and add checks.
tool-output injectionLLM agent protection
Different tokenizers across providers: why the numbers don't match
Dec 15, 2024·8 min read
Different tokenizers across providers: why the numbers don't match
Different tokenizers across providers change price, limits, and the real context length. Let's look at where the calculations diverge and how to check them in advance.
different tokenizers across providersLLM token counting
LLM regressions: how to catch hidden drift before complaints
Dec 09, 2024·10 min read
LLM regressions: how to catch hidden drift before complaints
LLM regressions are not always obvious right away. Here we break down daily runs, alerts, control cases, and the checks to do before users complain.
LLM regressionsLLM control cases
Scheduled switching between hosted and self-hosted models
Dec 08, 2024·6 min read
Scheduled switching between hosted and self-hosted models
Switching between hosted and self-hosted models can reduce cost and latency if you separate use cases by time of day, data sensitivity, and load spikes.
switching between hosted and self-hosted modelsexternal LLM API
Patient consent for LLMs in a clinic: what to record
Dec 04, 2024·6 min read
Patient consent for LLMs in a clinic: what to record
We explain how to document patient consent for LLM use in a clinic: what to record before summarization, triage, and chart-based answers.
patient consent for LLMs in a clinicclinic triage
Normalizing LLM API Error Codes for Product and Support
Dec 03, 2024·11 min read
Normalizing LLM API Error Codes for Product and Support
Error code normalization for LLM APIs helps reduce timeouts, limits, and bad requests into one dictionary for product, logs, and support.
LLM API error code normalizationunified error dictionary
A chain of models or one strong model: which works better where
Dec 02, 2024·9 min read
A chain of models or one strong model: which works better where
We break down when a chain of models or one strong model gives the better result: comparing price, latency, quality, and the risk of unnecessary complexity.
chain of models or one strong modelLLM pipeline
Two Answers to One Request: When Choice Beats a Single Answer
Nov 28, 2024·10 min read
Two Answers to One Request: When Choice Beats a Single Answer
We break down when two answers to one prompt help users choose tone, format, or action faster, and when that approach only creates confusion.
two answers to one queryAI answer alternatives
AI Agent State Storage: Redis, DB, or Event Log
Nov 26, 2024·6 min read
AI Agent State Storage: Redis, DB, or Event Log
How AI agent state is stored affects pauses, approvals, and restarts. We look at when to choose Redis, a database, or an event log.
AI agent state storageRedis for paused workflows
How to avoid overpaying for long context: what to cut and what to keep in memory
Nov 22, 2024·10 min read
How to avoid overpaying for long context: what to cut and what to keep in memory
How to avoid overpaying for long context: we break down chat history trimming, compression, and dialog memory choices to preserve meaning and reduce tokens.
how to avoid overpaying for long contextcontext compression
Sandbox for AI Tools: Write Access Without Extra Permissions
Nov 15, 2024·6 min read
Sandbox for AI Tools: Write Access Without Extra Permissions
A sandbox for AI tools helps isolate writes in CRM, databases, and documents so the agent changes only the needed fields and does not get extra access.
sandbox for AI toolsAI agent write permissions
Token Spike: How to Find the Cause Before the Bill After Release
Nov 08, 2024·9 min read
Token Spike: How to Find the Cause Before the Bill After Release
A token spike after a release is easy to miss. Learn how to check prompt length, call frequency, retries, and strange post-release behavior before the bill arrives.
token spikeprompt length
Query Cache Payback: Formula and Calculation Examples
Nov 02, 2024·9 min read
Query Cache Payback: Formula and Calculation Examples
Query cache payback is easy to calculate with a simple formula. We show the repeat threshold for search, support, and email generation.
query cache paybackquery caching formula
What to Log in an LLM App Without Unnecessary Risk
Nov 02, 2024·10 min read
What to Log in an LLM App Without Unnecessary Risk
Learn what to log in an LLM app to debug failures, track incidents, and pass audits without storing prompts, PII, or extra data.
what to log in an LLM appminimal LLM log set
When a reranker pays off: recall, latency, and cost
Oct 31, 2024·7 min read
When a reranker pays off: recall, latency, and cost
Let’s look at when a reranker pays off in search: how to measure recall gains, the impact on latency, request cost, and when the extra step is not worth it.
when a reranker pays offreranker in search
AI Content Labeling in the Interface: Editor, CRM, Chat
Oct 30, 2024·9 min read
AI Content Labeling in the Interface: Editor, CRM, Chat
Show how AI content labeling works in the interface and where to place the label in an editor, CRM, and chat so it helps instead of getting in the way.
AI content labeling in the interfaceAI label in the editor
Model access policies for single requests without unnecessary risks
Oct 11, 2024·7 min read
Model access policies for single requests without unnecessary risks
Model access policies help set rules by role, data, and environment so you can control costs and keep sensitive data from leaving your systems.
model access policiesrestricting expensive models
Deduplicating Repeat Requests in Chats and Forms Without Hurting UX
Oct 07, 2024·10 min read
Deduplicating Repeat Requests in Chats and Forms Without Hurting UX
Deduplicating repeat requests helps remove double form submissions and duplicate chat messages, preserve UX, and avoid losing data during network and queue failures.
duplicate request deduplicationdouble form submissions
Customer Complaint Classification: How to Combine Rules and LLMs
Oct 06, 2024·7 min read
Customer Complaint Classification: How to Combine Rules and LLMs
Customer complaint classification helps assign queues and SLAs faster when you combine simple rules, LLMs, confidence checks, and manual review.
customer complaint classificationrequest routing
LLM Service Load Testing: Peak, Queues, Bottlenecks
Sep 26, 2024·11 min read
LLM Service Load Testing: Peak, Queues, Bottlenecks
Load testing an LLM service helps you find where queues grow, what breaks under peak load, and where the bottleneck sits in the API, network, and retries.
LLM service load testingqueues in LLM API
Tenant-based feature flags for AI features: launch plan
Sep 23, 2024·10 min read
Tenant-based feature flags for AI features: launch plan
Feature flags for AI features let you enable new models by tenant without a global release: launch plan, checks, failures, and an example.
feature flags for AI featurestenant-based model rollout
Prompt Library for the Team: Cards, Tags, Owners
Sep 17, 2024·7 min read
Prompt Library for the Team: Cards, Tags, Owners
A prompt library helps a team keep working templates in one place: cards, tags, owners, examples, and an update routine.
prompt libraryprompt card
Protecting RAG from Prompt Injection Through Documents in Practice
Sep 04, 2024·9 min read
Protecting RAG from Prompt Injection Through Documents in Practice
Protect RAG from prompt injections: clean documents, limit tools, verify sources, and reduce the risk of false answers.
RAG prompt injection defenseRAG security
Enriching Product Listings with Small Models Without Extra Cost
Aug 30, 2024·6 min read
Enriching Product Listings with Small Models Without Extra Cost
Product listing enrichment can be handled by a small local model when you need attributes, tags, and short descriptions without complex generation.
product listing enrichmentlocal model for attributes
Open-Weight Model: Choosing for the Internal Stack
Aug 24, 2024·6 min read
Open-Weight Model: Choosing for the Internal Stack
How to choose an internal LLM: compare open-weight models by size, languages, response format, and GPU needs on real-world tasks.
open-weight modelinternal LLM
Choosing the Right Model Type for a Task on a Single Domain Dataset
Aug 16, 2024·9 min read
Choosing the Right Model Type for a Task on a Single Domain Dataset
Choosing the right model type is easier when you run one domain dataset through summarization, extraction, classification, and chat, then compare the metrics.
choosing the right model type for a tasksummarization vs extraction comparison
Moderating Outgoing Replies: Where to Place Filters and a Second Model
Aug 16, 2024·9 min read
Moderating Outgoing Replies: Where to Place Filters and a Second Model
Outgoing response moderation helps prevent risky text from slipping into chat, email, and CRM. We will look at where to place rules, filters, and a second model call.
outgoing response moderationLLM filters
Normalizing Dates, Currencies, and Numbers After LLMs Without Confusion
Aug 14, 2024·9 min read
Normalizing Dates, Currencies, and Numbers After LLMs Without Confusion
Normalizing dates, currencies, and numbers helps bring LLM outputs into one format by removing inconsistency in dates, amounts, separators, and currency codes.
date, currency, and number normalizationdate formatting after LLMs
Session Context and User Profile: How to Separate Them
Aug 13, 2024·11 min read
Session Context and User Profile: How to Separate Them
Session context and user profile should be stored separately so the assistant does not mix one-time details, preferences, history, and personal data.
session context and user profileassistant memory
Vendor lock-in: leaving without refactoring
Aug 07, 2024·11 min read
Vendor lock-in: leaving without refactoring
Learn how to reduce dependence on a single vendor with an abstraction layer, compatibility tests, and step-by-step migration without a major refactor.
single-vendor lock-inLLM abstraction layer
LLM Context Trimming Without Losing Meaning: Windows and Summaries
Aug 06, 2024·11 min read
LLM Context Trimming Without Losing Meaning: Windows and Summaries
LLM context trimming helps keep a conversation within the token limit. We’ll cover context windows, priorities, conversation summaries, and quick checks.
LLM context trimmingcontext window
Embedding Dimensionality: Where Search Breaks and Code Breaks
Aug 04, 2024·7 min read
Embedding Dimensionality: Where Search Breaks and Code Breaks
Embedding dimensionality affects search, indexes, and storage schemas. We show where code breaks, where quality drops, and how to migrate safely.
embedding dimensionalityvector search
Domain Search Glossary: Often More Useful Than the Model
Aug 03, 2024·8 min read
Domain Search Glossary: Often More Useful Than the Model
A domain search glossary helps the system understand company terms, synonyms, and codes. Often it brings more accuracy than switching models.
domain search glossarycorporate terminology dictionary
Timeouts in an LLM Chain: How to Split the Time Budget
Jul 29, 2024·7 min read
Timeouts in an LLM Chain: How to Split the Time Budget
Timeouts in an LLM chain affect the answer just as much as model choice. We’ll show how to split a shared SLA between the gateway, search, tools, and the model.
LLM chain timeoutsLLM latency budget
LLM Cost in Tenge: How to Build an Annual Budget
Jul 25, 2024·10 min read
LLM Cost in Tenge: How to Build an Annual Budget
We show how to calculate the cost of LLMs in tenge for a year: tokens, exchange rates, traffic spikes, a test buffer, and a clear budget for the team.
LLM cost in tengeannual LLM budget
Hedged Requests to Two Models: When p95 Drops
Jul 23, 2024·7 min read
Hedged Requests to Two Models: When p95 Drops
Hedged requests to two models can remove rare slow responses, but sometimes they only double costs. Let’s break down thresholds, metrics, and mistakes.
hedged requests to two modelsreducing p95
Draft and Action in the AI Workflow: How to Set a Barrier
Jul 17, 2024·6 min read
Draft and Action in the AI Workflow: How to Set a Barrier
Draft and action in the AI workflow help prevent a model from immediately changing a ticket status, limit, or record. Let’s break down the rule, steps, and checks.
draft and action in the AI workflowseparating draft from action
External LLM provider outage: a day-of action plan
Jul 16, 2024·10 min read
External LLM provider outage: a day-of action plan
External LLM provider outage: a step-by-step day-of guide for switching routing, adding limits, simplifying features, and coordinating teams.
external LLM provider outagemodel routing
Who Can Change Prompts in Production: A Practical Framework
Jul 15, 2024·9 min read
Who Can Change Prompts in Production: A Practical Framework
Who should be allowed to change prompts in production? Let’s break down roles, review, change logs, and rollback so the team does not rely on private agreements.
who can change prompts in productionprompt ownership
Judge Model for Auto-Evaluation: Where to Trust and Where to Check
Jul 06, 2024·11 min read
Judge Model for Auto-Evaluation: Where to Trust and Where to Check
Judge models for auto-evaluation help you check answers quickly, but not everywhere. Here is how to use a rubric, manual sampling, and signs of systematic errors.
judge model for auto-evaluationLLM evaluation rubric
Pre-release evaluation pipeline: from golden set to regressions
Jul 03, 2024·9 min read
Pre-release evaluation pipeline: from golden set to regressions
A pre-release evaluation pipeline helps catch regressions before launch: how to build a golden set, choose metrics, and create a report people can read in 10 minutes.
pre-release evaluation pipelinegolden set for LLM