Route, cache, guardrail, and audit every LLM call — across 15+ providers
Most gateways route. G8KEPR also auto-injects Anthropic prompt caching (38% cost cut at 2 calls, 88% at 10), adapts circuit breakers with Z-score baselines instead of static thresholds, and stamps an EU AI Act risk class on every completion. BYOK with AES-256-GCM — your keys, your provider rates, your audit trail.
15+ providers, one API. Add a custom endpoint and we route through that too:
The missing infrastructure layer for production LLM applications
Production AI apps need multiple LLM providers for reliability, cost optimization, and feature coverage. But managing them means juggling separate keys, billing accounts, rate limits, and error handling scattered across every service that calls an LLM.
Most teams hard-code provider-specific logic across services. Every switch is a code change, a redeploy, and a test pass. Cost tracking is manual. When a provider degrades, you find out from your users — not your monitoring.
One unified API that intelligently routes to the best LLM provider
client.chat.completions.create(model="auto")Single API call with model="auto" - no provider-specific code
✓ Routed to Claude • Cost-optimized • 145ms response • Failover to GPT-4 if unavailable
Ordered provider list — first healthy provider wins. Predictable routing with deterministic fallback order.
✓ Claude → GPT-4 → GeminiCheapest provider that meets the requested model capability. Gemini Flash ($0.50/M) for simple tasks, Claude ($3/M) for reasoning.
✓ Save up to 92%Lowest p95 latency from historical tracking. Real-time metrics, not stale benchmarks. Critical for user-facing chat.
✓ Sub-200ms p95Distribute load evenly across healthy providers. Sidesteps single-provider rate-limit ceilings during traffic bursts.
✓ No 429s under burstClassify the prompt (CODING, CREATIVE, ANALYSIS, GENERAL) and route to the model that excels at that intent.
✓ Best-fit model per callFive capabilities that exist in the platform — not in OpenAI's SDK, not in LangChain, not in a one-file proxy you wrote on a Friday afternoon.
Tracks system-prompt hashes per org. After 2 identical, auto-injects Anthropic cache_control:ephemeral. Zero user code changes.
gateway/cache_optimizer.pyStatistical baselines per provider per hour-of-day. Trips when failure rate > mean + 3σ. Progressive recovery 10/25/50/100%.
gateway/router.pyToxicity, bias, topic-block, PII, prompt-injection, regex, rate-limit. Block / redact / warn / log per policy. Every violation logged.
ai_guardrail_policiesEvery completion stamped with the EU AI Act risk class for the model used (MINIMAL / LIMITED / HIGH / UNACCEPTABLE) — wired at the gateway, not bolted on.
X-AI-Risk-ClassKeys encrypted at rest, decrypted only into process memory. Never written to Redis or logs. Per-key monthly cost limits enforced.
EncryptionServiceSystem prompts are usually 80%+ of the token cost on every call — and they're identical every time. G8KEPR fingerprints them with SHA-256, and after observing two identical prompts, automatically injects Anthropic's cache_control: ephemeral directive. No SDK changes, no flag flipping — it just happens.
Static thresholds break when traffic patterns shift. G8KEPR's breaker uses statistical baselines per provider, per hour — it knows that a 2% failure rate is normal at 3 a.m. and abnormal at 3 p.m.
Trip at 50% failure rate. Doesn't know that this provider runs hot at 3 a.m. anyway. False trips during normal off-peak. Misses gradual degradation that stays under the threshold.
Hour-specific baseline. 2% failures at 3 a.m. is normal — won't trip. 2% failures at peak is a spike — trips immediately. Catches what threshold breakers miss.
How G8KEPR customers save 60-90% on LLM costs with intelligent routing
See exactly what you're spending across all LLM providers in one dashboard
| Provider | Model | Requests | Tokens | Rate | Cost |
|---|---|---|---|---|---|
C Claude | 3.5 Sonnet | 12,456 | 2.3M | $3/M | $6.90 |
G OpenAI | GPT-4 Turbo | 1,234 | 0.8M | $30/M | $24.00 |
G Google | Gemini Flash | 45,123 | 8.2M | $0.50/M | $4.10 |
| Total | 58,813 | 11.3M | Avg $3.10/M | $35.00 | |
Everything you need to manage multi-LLM applications in production
Bring your own keys for any provider. Encrypted at rest, decrypted only into process memory — never written to Redis or logs. Per-key monthly cost limits enforced.
✓ EncryptionService · ai_gateway_keysConfigure ordered fallback chains: Claude → GPT-4 → Gemini. Automatic re-route on 5xx, 429, latency-SLO miss, or health-check failure. Health tracked per provider per hour.
✓ Per-provider exponential backoffTag requests with user_id, team_id, project_id. Per-org / per-user / per-provider / per-model rollup. Z-score anomaly detection fires Slack and email when spend spikes.
✓ gateway_usage_logs · cost_budgetsToxicity, bias, topic-block, PII, prompt-injection, regex, and rate-limit policies — evaluated on the prompt before it leaves the gateway. Block / redact / warn / log per policy.
✓ ai_guardrail_policies · ai_guardrail_violationsOutbound prompts scanned for emails, financial IDs, identity docs, contact info, network IDs, location, credentials, and health data. Type-safe placeholder redaction or hard block.
✓ pii_filters · per-org rulesLLM_HARD_MAX_TOKENS=16,384 enforced on every provider, regardless of user input. A caller can't pass max_tokens=999999 to drain the budget in one shot.
✓ Cost-amplification attack defenseEvery completion stamped with X-AI-Risk-Class (MINIMAL / LIMITED / HIGH / UNACCEPTABLE) for the model used. eu_risk_class field on model version records. Wired at infra level.
✓ X-AI-Risk-Class · eu_risk_classAll outbound calls (especially BYOI custom endpoints) routed through SSRFProtectedTransport. IPv4-mapped IPv6 normalized, 169.254.169.254 metadata blocked, HTTP/2 keep-alive pool.
✓ gateway/http_client.pyDrop-in replacement — change base URL, route to any of 15+ providers. No code changes to application logic. Same chat-completions schema, same streaming, same function-call format.
✓ One-line integrationEverything you need to know about multi-LLM routing
Need help setting up multi-LLM routing?
Talk to our AI Gateway experts →Every completion carries the EU AI Act risk class and a full audit-log entry. Mappings are pre-built — auditors get exports, not spreadsheets.
Subject to independent audit and attestation. G8KEPR provides the technical controls and evidence — your auditor issues the certification.
Auto prompt caching, adaptive circuit breakers, seven guardrail policies, EU AI Act headers, and BYOK with AES-256-GCM — across 15+ providers, with no per-token markup.
No credit card required • BYOK - no markup • Cancel anytime