Audit Logs for LLMs: What Banks and the Public Sector Should Store
LLM audit logs help banks and public agencies investigate incidents: what to put in each event, how long to keep records, and who should have access.

Why chaos starts without logs
The problem rarely starts with the model itself. Chaos begins the moment the answer has already broken the process, and the team cannot quickly reconstruct the chain of events. For a bank or public agency, that is especially painful: one strange answer affects the customer, the operator, internal review, and reporting to security.
Without proper audit logs for LLMs, people quickly fall back on memory and guesses. One engineer says the failure was noticed at 10:05, an analyst is sure the complaint came later, and the product owner looks at a different dashboard and gives a third time. At this point an hour is already lost, even though what you need first is one simple fact: who saw the problem first and when it happened.
Then the argument usually gets worse. Someone blames the model, someone else blames the prompt, a third person blames the limit, and sometimes the real cause is the call rule itself. If the log does not contain the exact request that triggered the problem, the team checks the wrong scenario and gets a false 'everything works.'
If requests go through a gateway like AI Router, the same call can reach different models or different providers depending on routing rules. That is why the event needs not only the request text, but the whole path of the response as well. Otherwise it is unclear whether this is a failure in a specific model, a routing error, or a settings change.
One event log should answer several questions right away:
- who noticed the failure first;
- the exact time and request identifier;
- which model and which route produced the response;
- who changed the prompt, limit, or call rule before the incident;
- what response the system returned externally.
Imagine a typical case. In a banking chat, the bot suddenly started answering too long and exposed an internal template. If there is no change history, the team argues whether the new system prompt or the higher token limit is to blame. If there is no route, everyone looks at model A, even though the answer came from model B through another provider. If there is no first-signal time, it is hard to connect the incident to a specific release.
People argue not because they are bad at their jobs. They simply do not have shared facts. A good log removes guesswork and quickly shows what happened, in what order, and who changed the system behavior before the failure.
What to include in one event
One event should answer a simple question: who sent the request, when, from where, and through which model; what came back; and how much it cost. If that is missing, the incident turns into a search through chats, screenshots, and other people's memory.
Start with time. Record the exact timestamp down to milliseconds and keep the time zone. Better yet, store two values: UTC and the system's local time. For banks and the public sector, that is not a minor detail. When logs from the API, queue, and application live in different zones, records are almost impossible to order without an explicit time zone.
Then you need identifiers. An event usually has a request_id, and often also a session_id and a user or service ID. These fields connect the LLM call to the operator's screen, a batch job, or an internal microservice. If user_id cannot be stored in plain form, keep a stable pseudonym or hash. That lets you see the chain of actions without revealing the person.
What the record should contain
A minimal record usually includes:
- call time, time zone, environment, and region;
- who initiated the request: user, service, channel, and access key;
- where the request went: model, provider, route, version of the system prompt or template;
- how it ended: status, error code, duration, number of input and output tokens, cost;
- how to find neighboring events: trace_id, correlation_id, or another shared chain identifier.
A separate block is needed for the processing route. If the team uses a gateway like AI Router, the model name alone is not enough. The same OpenAI-compatible call can go to different providers under different rules. So the log should store not only model_name, but also provider, route_id, and the version of the routing policy. Otherwise you will see gpt-4.1 but not understand why latency was 900 ms yesterday and 4 seconds today.
The end of the event matters too. Success without metrics says very little. The log should show the HTTP status or internal status, error code, step-by-step duration, tokens, and price. If the request failed after a retry, that should be noted too. Then, during the review, it becomes clear whether the problem was in the model, the provider, the access limit, or the application itself.
A good record can be read in 20 seconds and answers three questions: who called the model, which route the request took, and why the result was exactly what it was.
What to write and what to mask immediately
Do not put everything into LLM audit logs. If a full card number, IIN, or phone number ends up in the log, the log itself becomes a source of risk. Then the incident has to be investigated not only through the model, but also through a leak from the log itself.
The working rule is simple: keep everything needed for search, comparison, and investigation; mask anything that directly identifies a person at the moment of recording. Not a day later, and not in a separate task, but immediately in the same flow.
What to keep in the record
A good record separates the request text from service metadata. Metadata is needed almost always: time, request ID, service, model, prompt version, route, response status, token count, error code, who called the API, and which policy was triggered.
For sensitive fields, it is better to store a safe form instead of the original value:
- a visual mask, for example 4403****1298;
- a hash for exact matching;
- an internal customer ID instead of full name and contacts;
- a flag showing that PII was found in the request;
- the data class of the record: public, internal, personal, banking secrecy.
This setup makes investigations much easier. The team can see that the request came from the mobile channel, went to a specific model, timed out, and contained personal data, but it does not read someone else's phone number in full.
What to hide right away
The full prompt and response are better treated as a separate storage layer. Most roles do not need them. Developers usually only need metadata, errors, tokens, and a masked version of the text.
If a user wrote: 'My IIN is 123456789012, phone 87771234567, reissue card 4403...', the log is better off keeping the meaning and context while hiding direct identifiers. For example: 'Request to reissue card; IIN, phone number, and card number found.' That is already enough to understand the type of action and the course of the incident.
For each record, set a data class label. Without it, the log quickly turns into a pile where a harmless technical call and a sensitive customer conversation are exposed in the same way. Give full-text access only to a narrow group: security, compliance, the service owner, and the on-call team during the review.
If the platform supports PII masking, audit logs, and key-level restrictions, it is best to turn that on before production. Fixing logging habits later is much harder. This is where a single gateway also helps: with AI Router, such mechanisms can be built into one common layer instead of being assembled piece by piece in every service.
How to choose retention periods
One retention period for all logs almost always creates extra risk. For LLM audit logs, it is better to separate two groups from the start: full requests and responses, and call metadata. Full text is needed less often, but it carries more risk. Metadata is usually safer and often more useful for review.
Banks and the public sector rarely need a permanent archive of raw payloads. If the prompt contains personal data, internal information, or a piece of a customer contract, long retention becomes a problem on its own. The retention period should be tied to three things: incident risk, complaint handling time, and the length of internal review.
In practice, it is convenient to split retention like this:
- keep full requests and responses for a short period, for example 30-90 days;
- keep masked copies longer if the team reviews errors without access to sensitive data;
- keep event metadata much longer, often 1-3 years, if internal control requires it;
- keep anonymized summaries of load and limits even longer, if they contain no request text.
This approach reduces risk without breaking incident reviews. When a complaint about a model response comes in two months later, the team often only needs the call time, model, prompt version, user, review status, triggered rules, and request identifier. The full payload is not needed nearly as often.
If you have opened an incident, deletion should be stopped immediately. Otherwise some traces will disappear under the normal TTL, and the investigation will again depend on memory. Freezing should work by request_id, user, time period, or the service involved in the incident.
The deletion rule should live next to the event schema. Each field should have its own status: do not write, write masked, keep for 30 days, keep for a year. When that is collected in one place, the team does not argue at every review about why the response text was deleted but the request route was kept.
Test the archive in practice. Pull up an old log set and time the search. If an engineer has to hunt for the needed event manually across several systems, retention is not the main problem anymore. If you have an OpenAI-compatible gateway like AI Router, it is useful to make sure the archive is equally easy to assemble for your own services and for requests sent to an external provider.
Who should get access
Access to logs should not be handed out on the basis of 'who might need it.' In a bank or the public sector, that usually ends with unnecessary viewing, disputes about who saw what, and a weak incident review.
It is better to split rights by task, not by job title. The on-call team usually only needs metadata: request time, session ID, model, route, error code, masked fields, and technical status. That is enough to understand where the flow broke and whom to call next.
Security needs broader access, but still not everything without a reason. The security team checks anomalies, mass exports, rate limit bypasses, attempts to send PII in prompts. For this, full event chains, IP, service account, content labels, and access-change history are useful.
Compliance usually looks not at the conversations themselves, but at whether the rules were followed: retention period, masking, who requested disclosure, and who approved viewing. Development rarely needs raw text. It usually only needs anonymized snippets, traces, and service fields.
In practice, the split looks like this:
- the on-call team sees metadata and masked parts;
- security sees events, links between them, and access history;
- compliance sees the basis, timing, and log action history;
- development gets anonymized data or an approved sample for an incident.
It is better to open the full request and response text only by request. You need a clear path: who created the request, who approved it, for how long access was granted, and why it is needed. If the incident involves customer data, a second approval level is usually justified.
Every view and every export should be written to a separate log. Not the same stream where LLM events live, but a log of actions on logs. Then the team can quickly see who opened the record, who exported the file, and who tried to search by sensitive fields.
Temporary access should not be left 'until the end of the week.' The incident is reviewed, then access is closed immediately. Auto-expiry after a few hours is more reliable than manual promises.
And one more boring but useful check: match roles against the real organizational structure. People move between teams, contractors leave, responsibilities change. If the access list lives separately from the company's living structure, audit logs themselves become a source of risk.
How to set up the log step by step
Start not with fields, but with a map of the flow. The team should see every place where a request can enter an LLM at all: the in-app chat, the operator interface, the backend service, batch processing, an agent that calls external tools. If you miss even one entry point, the log will be leaky, and the incident review will again depend on guesswork.
Next, it helps to trace one request from start to finish and mark where it changes. For example, the service adds a system prompt, the gateway chooses a model, and then an external tool returns data. In a bank, that is a normal situation: the customer writes in chat, the backend adds CRM context, and the LLM goes through a single API gateway to another model if the first one is unavailable.
Minimum sequence
- Record all entry points and all points where the request changes. This includes retries, fallback to another model, and tool calls.
- Agree on the mandatory event fields. Usually request_id, time, user or service, channel, model, provider, prompt version, response status, latency, tokens, and whether a tool was called are enough.
- Turn on masking before the log is written. Do not store raw data with the hope of cleaning it up later.
- Separate retention periods. The full event trail can be kept for less time, while anonymized metadata can be kept longer if it is needed for investigations and reporting.
- Assign access roles. The developer sees technical fields, the security team sees the incident log, and raw fragments are available only to a narrow group by request.
At the field-setting stage, do not try to record everything. Too much noise is no less harmful than empty logs. If the team uses an OpenAI-compatible gateway, it is also useful to log the route: which model was selected, whether traffic was shifted to another provider, where rate limits were triggered, and whether masking was enabled.
After setup, you need a training incident. Simulate an alert: for example, a user received a response with a fragment of personal data. The team should go through the entire path without manual guesswork: find the event, see the request route, check where masking failed, and put together a short report. If that takes half a day, the log is still too raw.
A real-world incident without theory
A bank customer writes in chat and asks whether they can take out a new loan to close the current one. The assistant answers too confidently: it recommends taking the product right away and barely mentions the risks, income verification, or speaking to a bank employee. An hour later, a complaint goes to support.
If the team has LLM audit logs, the review does not turn into a memory contest. The on-call person takes the response time from the complaint, finds the request_id, and opens the exact record for the right minute.
In the event card, the team immediately sees:
- which model was called;
- through which provider or route the request passed;
- which version of the system prompt the service used;
- which context reached the model after masking;
- which response went to the customer and how long the call took.
That is already enough to remove half the guesswork. It turns out the chat should not have given an almost ready-made loan recommendation without an explicit disclaimer and escalation to an operator. But in the new version of the response template, the escalation phrase disappeared, and the model on that route answered too directly.
Then security gets involved. The specialist does not read the entire flow line by line. They open the access log for the record and check who accessed the event after the complaint, who exported response fragments, and whether anyone carried out extra fields. If the access log is assembled properly, it shows the account, time, reason for viewing, and the fact of export.
After that, the team does not fix everything at once. It fixes the two things that caused the failure. First, it adjusts the response template: the chat no longer suggests a loan product as a direct action until it gathers the required data and shows the disclaimer. Then it enables the escalation rule: if the user asks about approval, rates, or refinancing, the dialog goes to a human or to a separate scenario with stricter limits.
A good incident review looks boring. That is the advantage. The team does not argue about who remembers what; it sees facts: the request, the route, the prompt version, the masked context, and the employees' actions after the complaint. The next similar dialog will not repeat the same mistake.
Where teams most often go wrong
When the incident has already happened, it is too late to discover that the log only contains the request text and no service trail. For a bank or public agency, this is a common problem: you can see what the user sent, but not which service accepted the request, which route was used, which model answered, or who later read the result. Such audit logs are almost useless.
The first mistake is simple: the team keeps only the prompt and the response. Without request_id, time, environment, template version, model name, call parameters, and the status of PII masking, the review quickly runs into guesswork. If a customer complained about a strange response at 14:07, the log should let you rebuild the chain in a minute, not piece it together from five systems.
The second mistake is even more basic: test and production write to the same log. After that, investigators, security, and developers see noise instead of facts. Test traffic is easy to confuse with real traffic, especially if the team runs realistic scenarios. As a result, the production incident gets buried under draft checks and training requests.
Teams often open read access too widely. A contractor, intern, or external integrator gets almost the same access as the internal team. That is a bad idea even where the data is already masked. The log can still reveal who worked with whom, what topics were discussed, and when the failure happened.
Another common gap is not recording a change in model, system prompt, or template. Yesterday the request went to one model with one instruction, today it goes to another, but the log does not show that. Then everyone argues about why the answer changed, even though the reason is obvious. If the company uses a gateway like AI Router, the route and selected model should be written to the event automatically.
Another pain point is deleting records too early. The team sets a short retention period to avoid storing too much, but the incident review takes longer. Two weeks later part of the trail is gone, and the picture cannot be restored.
Usually five rules are enough:
- store not only the text, but also the technical context of the call;
- separate test, pilot, and production into different logs;
- grant access by role and task, not to everyone;
- record every change in model, template, and settings;
- freeze deletion of records if a review is in progress.
If these rules are missing, even a small failure turns into a long dispute between security, development, and the business. A good log ends the dispute quickly: it shows who sent what, through which route it went, and why the system answered the way it did.
Quick check before launch
Before launch, do not check the schema on paper; check one live request: from the API entry point to the log record and back to searching for that record by ID. If the team cannot walk this path manually in a few minutes, the log will only add noise during an incident.
For banks and the public sector, this minimum is usually enough:
- each request has its own ID, exact time, and clear link to a service, user, or system role;
- the record shows which model answered, which provider handled the call, and which prompt version was used;
- the system masks PII before the log is written, not after;
- each role sees only its own fields;
- the on-call team can quickly pull up the needed record by request_id, time, customer key, or operation number.
Check the deletion rule separately. It should clean up old data on schedule, but not break an incident review that started at the edge of the retention period. Usually this leaves a short freeze period for deletion by ticket or incident.
A good test is very simple. Take one failed request, one successful request, and one request with personal data. Then ask three roles - developer, security, and audit - to find them in the log and answer two questions: what happened and who has the right to see the details. If the answers come together quickly and without manual decoding, the system is ready to go live more calmly.
What to do next
Do not try to describe all LLM traffic across all services right away. Take one process where a mistake is expensive: replying to a customer, searching internal documents, or generating text for an operator. To start, the minimal event schema is enough: who sent the request, when, from which service it came, which model it went to, which prompt template was used, how many tokens were consumed, what the model returned, and how the operation ended.
Then run one training incident. Not on production data, but on a test set. Simulate a simple failure: the model revealed an extra fragment, the service started using three times more tokens, or the operator saw data they should not have seen. If the log cannot restore the chain of actions in 15-20 minutes, it is still poorly assembled.
It is worth agreeing on a few rules in advance:
- who reads full logs;
- who sees only metadata;
- how long to keep raw events and how long to keep aggregates;
- who approves exports for investigations;
- who is responsible for deletion under the retention policy.
These things are better agreed before launch with security, compliance, and the product owner. Otherwise the argument will start on the day of the incident, when everyone needs an answer, not a discussion of roles and timelines.
If the audit logs go through a single gateway, check it manually. For AI Router, it makes sense to see whether the gateway writes events in a single format, masks PII before writing, supports data storage inside the country, and applies rate limits at the access-key level. These are not details. They later show who made the request, what exactly went into the model, and why the incident was not stopped earlier.
A normal plan for the next few days is simple: fix the event schema on one page, turn on logging for one critical process, and run a training incident. After such a run, it is almost always clear what is missing: one field, a separate access role, or a clearer retention period.
Frequently asked questions
What should be in one LLM audit-log event?
Store request_id, the exact time with time zone, the service or user, the channel, the model, the provider, the route, the system prompt version, the status, the error, latency, tokens, and price. That is enough to quickly understand who sent the request, where it went, and why the response turned out the way it did.
Should the full prompt and full response be stored?
No, full text is not needed for every role or every case. For day-to-day work, the team usually only needs metadata and a masked fragment, while the full prompt and response are better kept separately and opened only for an incident or review.
How should PII be masked in logs?
Mask sensitive data at the moment of recording, not later. Card numbers, IIN, phone numbers, and full names are better replaced immediately with a mask, hash, or internal ID so the team can search for the event without access to someone else's data.
How long should LLM logs be kept?
Usually the full text of requests and responses is kept for a short period, for example 30-90 days. Event metadata can be stored longer, often 1-3 years, if internal control requires it and if there is no extra personal data in it.
Who needs access to full audit logs?
Give access by task, not to everyone. The on-call team usually only needs metadata, security reviews the event chain and access history, and the full text should be opened only to a narrow group by request and for a short time.
Why record the model, provider, and route separately?
Because one model name does not explain what really happened. If the request goes through a gateway like AI Router, the call may go to another provider or follow another rule, and without the route you will not understand where latency grew or where the answer broke.
What should be done with log deletion during an incident?
Freeze deletion immediately by request_id, time period, user, or service. Otherwise the normal TTL will erase part of the trail, and the team will start arguing from memory again instead of reviewing the facts.
Why can't test and production events be written to the same log?
Do not mix them. When test and production write to the same log, noise hides real failures, and people waste time on other people's checks instead of the production incident.
How can a log be checked quickly before launch?
Take one live request and walk the whole path by hand: the call, the log entry, a search by request_id, and the route, status, and masking. If the developer, security, and audit teams can answer about that event without a long search, the setup is already working.
Where should you start when implementing LLM audit logs?
Start with one process where mistakes are expensive: a customer chat, document search, or guidance for an operator. Fix the base event schema on one page, turn on logging, run a training incident, and then close the gaps in fields, access, and retention.