Skip to content

Latency expectations

This page sets honest expectations for the latency of each governance endpoint, so you can plan integrations correctly. Numbers below are wall-clock end-to-end (including network, auth, retrieval/detection, and metering) measured against the production InPolicy backend in us-central1.

For record_turn (when not using checkOutput: true) and the legacy single-shot pre-inference endpoint:

Scenariop50p95
Repeated session (cache hit)~100 ms~140 ms
Cold session, prompt of any length~100 ms~460 ms

Pre-inference latency is flat across input size. Retrieval is bounded by top-K policies, not the length of the prompt. The cache TTL is 24 hours, so steady-state is essentially always the warm path.

Verdict: unconditionally fast. Drop into the hot path of any agent without measurable impact.

For check_output and record_turn with checkOutput: true:

Output lengthModep50p95
Up to ~120 charsshort~1.6 s~2.1 s
~500 charsshort~2.2 s~2.6 s
>1200 charslong~6.1 s~8.0 s

The detector escalates to a more capable model on outputs longer than ~1200 characters. The p50 can jump from ~2.2 s to ~6.1 s by crossing that threshold.

Verdict: good for asynchronous review, human-in-the-loop drafting, compliance audit. Not yet good for sub-second real-time chat. For that, use streaming check_output (v1.1).

Tool call payloadp50p95
Typical (small structured args)~1.6 s~3 s
Large argument bodies (>1200 chars)~6 s~8 s

Same short/long mode boundary as check_output.

Verdict: good for gating high-stakes actions (send_email, execute_payment). Not for fast-fire actions like search or autocomplete.

FailureLatency
Bogus API key~700 ms (bcrypt comparison)
Revoked API key~5 ms (rejected at guard)
Rate-limit exceeded~5 ms
Feature gate disabled (tool_call_governance: "disabled")~5 ms

The bogus-key path takes longer than valid-key paths because bcrypt is intentionally slow. This is useful to know if you’re tuning rate limits: bogus-key floods are bcrypt-bound, not DB-bound.

Use caseVerdict
Pre-inference policy injection on every turn✅ Always do this. <200 ms.
Post-inference verification on customer-service drafts (human reviews before sending)✅ 2-3 s is invisible to a human
Post-inference verification on real-time chat outputs⚠️ Acceptable on short outputs, painful on long. Use streaming when available.
Post-inference verification on voice agents❌ Wait for streaming.
Tool-call gating before any external action✅ 1-3 s is fine for action-bound calls.
Compliance audit pipeline running over yesterday’s outputs✅ Long mode latency is irrelevant.

Why these numbers, and what we’re doing about it

Section titled “Why these numbers, and what we’re doing about it”

Pre-inference is fast because it’s deterministic: hybrid search over a finite policy set, cached, no LLM hop. We’re not actively optimizing here.

Post-inference is slow because it’s an LLM evaluation pass. Fundamentally bound by Gemini latency. The two levers we have:

  1. Mode selection (already shipped). Short content goes through a flash-lite model with no thinking budget. Long content gets a fuller model with reasoning. The 1200-char boundary is calibrated to balance accuracy and speed.
  2. Streaming detection (v1.1, planned). For long outputs, instead of waiting for one big call to complete, we run detection on rolling windows as the model emits text. Time-to-first-violation drops from ~6 s to ~1.5 s. The total work is similar; the perceived latency is dramatically better for real-time UX.

If your agent generates long outputs and your UX can’t absorb the long-mode latency, switch to record_turn without checkOutput and rely on pre-inference policy injection alone. The model sees the policies in its system prompt and complies in most cases. Use check_output only on outputs your UX can hold for verification.

The numbers above will improve as we tune retrieval and the AI service. We’ll surface live percentiles on the [status page] (TODO: link when public). For now, treat the p95 as a soft SLO. If you see consistent regressions, open an issue and we’ll investigate.