Skip to main content
15+ Providers · 5 Routing Strategies · Adaptive Circuit Breakers

One API for
Every LLM Provider

Route, cache, guardrail, and audit every LLM call — across 15+ providers

Most gateways route. G8KEPR also auto-injects Anthropic prompt caching (38% cost cut at 2 calls, 88% at 10), adapts circuit breakers with Z-score baselines instead of static thresholds, and stamps an EU AI Act risk class on every completion. BYOK with AES-256-GCM — your keys, your provider rates, your audit trail.

Auto Prompt Caching
Z-Score Circuit Breaker
7 Guardrail Policies
EU AI Act Headers
BYOK · No Markup

15+ providers, one API. Add a custom endpoint and we route through that too:

A
Anthropic
O
OpenAI
G
Google
A
Azure OpenAI
B
AWS Bedrock
G
Groq
C
Cohere
M
Mistral
D
DeepSeek
F
Fireworks
T
Together
X
xAI
C
Cerebras
O
Ollama
+
BYOI Custom
15+
Providers
incl. BYOI
5
Routing modes
priority · cost · latency · RR · semantic
7
Guardrail policies
pre-prompt
8
PII categories
redact · block · warn
88
API routes
gateway sub-router
88%
Auto cache hit
system tokens at 10 calls
16,384
Hard token cap
cost-amplification block
Z>3σ
Adaptive breaker
beats Hystrix thresholds

What is an AI Gateway?

The missing infrastructure layer for production LLM applications

The Multi-LLM Challenge

Production AI apps need multiple LLM providers for reliability, cost optimization, and feature coverage. But managing them means juggling separate keys, billing accounts, rate limits, and error handling scattered across every service that calls an LLM.

API Key Management
Separate keys for OpenAI, Anthropic, Google, Azure, Bedrock...
Cost Tracking
Different pricing per model — Claude $3/M, GPT-4 $30/M, Gemini $0.50/M
Rate Limit Handling
Each provider has different limits — OpenAI 10k RPM, Anthropic 50 RPM
Failover Logic
Manual retry logic when OpenAI is down — switch to Claude or Gemini

The Traditional Approach

Most teams hard-code provider-specific logic across services. Every switch is a code change, a redeploy, and a test pass. Cost tracking is manual. When a provider degrades, you find out from your users — not your monitoring.

Vendor Lock-In
Switching from OpenAI to Claude requires changing every API call
No Failover
If OpenAI goes down, your entire app is down
Hidden Costs
No visibility into which models/users cost the most
Scattered Logic
Provider-specific code duplicated across services

How G8KEPR AI Gateway Works

One unified API that intelligently routes to the best LLM provider

Intelligent LLM Routing
1. Your Application
client.chat.completions.create(model="auto")

Single API call with model="auto" - no provider-specific code

2. G8KEPR Routing Engine
Selects optimal provider
Check Costs: Claude $3/M, GPT-4 $30/M, Gemini $0.50/M
Measure Latency: Claude 145ms, GPT-4 180ms
Health Check: Is provider available?
Rate Limits: Has quota remaining?
C
Claude 3.5
✓ Selected
$3/M • 145ms
G
GPT-4
Standby
$30/M • 180ms
G
Gemini
Standby
$0.50/M • 200ms

✓ Routed to Claude • Cost-optimized • 145ms response • Failover to GPT-4 if unavailable

Priority

Ordered provider list — first healthy provider wins. Predictable routing with deterministic fallback order.

Claude → GPT-4 → Gemini

Cost-Optimized

Cheapest provider that meets the requested model capability. Gemini Flash ($0.50/M) for simple tasks, Claude ($3/M) for reasoning.

Save up to 92%

Latency-Optimized

Lowest p95 latency from historical tracking. Real-time metrics, not stale benchmarks. Critical for user-facing chat.

Sub-200ms p95

Round-Robin

Distribute load evenly across healthy providers. Sidesteps single-provider rate-limit ceilings during traffic bursts.

No 429s under burst

Semantic Intent

Classify the prompt (CODING, CREATIVE, ANALYSIS, GENERAL) and route to the model that excels at that intent.

Best-fit model per call
Generic Multi-LLM Proxies Don't Do These

What G8KEPR Adds That LangChain Routers Don't

Five capabilities that exist in the platform — not in OpenAI's SDK, not in LangChain, not in a one-file proxy you wrote on a Friday afternoon.

Auto Prompt Caching

Tracks system-prompt hashes per org. After 2 identical, auto-injects Anthropic cache_control:ephemeral. Zero user code changes.

SHA-256 fingerprintgateway/cache_optimizer.py
OpenAI SDK / LangChain do not auto-cache.

Adaptive Z-Score Breaker

Statistical baselines per provider per hour-of-day. Trips when failure rate > mean + 3σ. Progressive recovery 10/25/50/100%.

4 windows · 3σgateway/router.py
Hystrix and Resilience4j use static thresholds.

7 Guardrail Policies

Toxicity, bias, topic-block, PII, prompt-injection, regex, rate-limit. Block / redact / warn / log per policy. Every violation logged.

pre-promptai_guardrail_policies
Generic gateways forward unfiltered.

EU AI Act Headers

Every completion stamped with the EU AI Act risk class for the model used (MINIMAL / LIMITED / HIGH / UNACCEPTABLE) — wired at the gateway, not bolted on.

every responseX-AI-Risk-Class
No SDK ships this header.

BYOK with AES-256-GCM

Keys encrypted at rest, decrypted only into process memory. Never written to Redis or logs. Per-key monthly cost limits enforced.

process-local onlyEncryptionService
Most BYOK is plaintext-in-config.
Anthropic Prompt Caching · Automatic

The Cheapest Token
Is The One You Don't Resend

System prompts are usually 80%+ of the token cost on every call — and they're identical every time. G8KEPR fingerprints them with SHA-256, and after observing two identical prompts, automatically injects Anthropic's cache_control: ephemeral directive. No SDK changes, no flag flipping — it just happens.

Tracks up to 100 unique system-prompt fingerprints per org
Cache write: 1.25× input price · Cache read: 0.10× input price
Hit rate and cumulative token savings tracked per org in real time
Works with any Anthropic-routed call — no app code changes
Auto-Cache Savings Curve (System Tokens)
1 call
2 calls
-38%
-38%
5 calls
-70%
-70%
10 calls
-88%
-88%
100 calls
-90%
-90%
Break-even at
2 calls
Steady-state
88% off
Beats Netflix Hystrix & Resilience4j

Adaptive Circuit Breaker

Static thresholds break when traffic patterns shift. G8KEPR's breaker uses statistical baselines per provider, per hour — it knows that a 2% failure rate is normal at 3 a.m. and abnormal at 3 p.m.

4 Time Windows
1m · 5m · 15m · 1hr
Rolling windows track success/failure rates across four scales — short bursts and long-running degradation both surface.
Z-Score > 3σ
mean + 3 × stddev
Trips when observed failure rate exceeds three standard deviations from the hour-specific baseline. Anomalies classified as spike, degradation, or sustained.
Progressive Recovery
10 → 25 → 50 → 100%
Doesn't flip back to 100% on first success. Ramps traffic gradually so a flapping provider can't re-cause an outage.
Why Static Thresholds Fail
Hystrix / Resilience4j

Trip at 50% failure rate. Doesn't know that this provider runs hot at 3 a.m. anyway. False trips during normal off-peak. Misses gradual degradation that stays under the threshold.

G8KEPR Adaptive

Hour-specific baseline. 2% failures at 3 a.m. is normal — won't trip. 2% failures at peak is a spike — trips immediately. Catches what threshold breakers miss.

Real Cost Savings Examples

How G8KEPR customers save 60-90% on LLM costs with intelligent routing

Customer Support Chatbot

-92%
BEFORE (Single Provider)
Provider:GPT-4 Turbo (only)
Cost:$2,400/month
Volume:100k/month
AFTER (G8KEPR Routing)
Strategy:Gemini Flash + Claude fallback
Cost:$180/month
Saved:$2220/mo

Code Generation Tool

-75%
BEFORE (Single Provider)
Provider:GPT-4 (only)
Cost:$1,800/month
Volume:50k/month
AFTER (G8KEPR Routing)
Strategy:Claude 3.5 Sonnet + Gemini fallback
Cost:$450/month
Saved:$1350/mo

Document Analysis Pipeline

-80%
BEFORE (Single Provider)
Provider:Claude Opus (only)
Cost:$3,000/month
Volume:200k docs
AFTER (G8KEPR Routing)
Strategy:Gemini Pro + Claude for complex docs
Cost:$600/month
Saved:$2400/mo

AI Assistant SaaS

-76%
BEFORE (Single Provider)
Provider:Mix of providers (manual switching)
Cost:$5,000/month
Volume:500k/month
AFTER (G8KEPR Routing)
Strategy:G8KEPR auto-routing (cost priority)
Cost:$1,200/month
Saved:$3800/mo

Unified Cost Tracking

See exactly what you're spending across all LLM providers in one dashboard

Monthly Cost Breakdown

Last 30 days
ProviderModelRequestsTokensRateCost
C
Claude
3.5 Sonnet12,4562.3M$3/M$6.90
G
OpenAI
GPT-4 Turbo1,2340.8M$30/M$24.00
G
Google
Gemini Flash45,1238.2M$0.50/M$4.10
Total58,81311.3MAvg $3.10/M$35.00
92% savings vs GPT-4 only ($450/month)
Track your costs →

AI Gateway Features

Everything you need to manage multi-LLM applications in production

BYOK · AES-256-GCM

Bring your own keys for any provider. Encrypted at rest, decrypted only into process memory — never written to Redis or logs. Per-key monthly cost limits enforced.

EncryptionService · ai_gateway_keys

Provider Failover Chain

Configure ordered fallback chains: Claude → GPT-4 → Gemini. Automatic re-route on 5xx, 429, latency-SLO miss, or health-check failure. Health tracked per provider per hour.

Per-provider exponential backoff

Cost Tracking + Anomaly Alerts

Tag requests with user_id, team_id, project_id. Per-org / per-user / per-provider / per-model rollup. Z-score anomaly detection fires Slack and email when spend spikes.

gateway_usage_logs · cost_budgets

7 AI Guardrail Policies

Toxicity, bias, topic-block, PII, prompt-injection, regex, and rate-limit policies — evaluated on the prompt before it leaves the gateway. Block / redact / warn / log per policy.

ai_guardrail_policies · ai_guardrail_violations

PII Filter — 8 Categories

Outbound prompts scanned for emails, financial IDs, identity docs, contact info, network IDs, location, credentials, and health data. Type-safe placeholder redaction or hard block.

pii_filters · per-org rules

Hard max_tokens Cap

LLM_HARD_MAX_TOKENS=16,384 enforced on every provider, regardless of user input. A caller can't pass max_tokens=999999 to drain the budget in one shot.

Cost-amplification attack defense

EU AI Act Risk Headers

Every completion stamped with X-AI-Risk-Class (MINIMAL / LIMITED / HIGH / UNACCEPTABLE) for the model used. eu_risk_class field on model version records. Wired at infra level.

X-AI-Risk-Class · eu_risk_class

SSRF-Protected Transport

All outbound calls (especially BYOI custom endpoints) routed through SSRFProtectedTransport. IPv4-mapped IPv6 normalized, 169.254.169.254 metadata blocked, HTTP/2 keep-alive pool.

gateway/http_client.py

OpenAI SDK Compatible

Drop-in replacement — change base URL, route to any of 15+ providers. No code changes to application logic. Same chat-completions schema, same streaming, same function-call format.

One-line integration

AI Gateway FAQs

Everything you need to know about multi-LLM routing

Calling LLM APIs directly means juggling separate keys, billing accounts, rate limits, error handling, and provider-specific code in every service. An AI Gateway is the central control plane: one OpenAI-compatible API across 15+ providers, five routing strategies (priority, cost-optimized, latency-optimized, round-robin, semantic intent), an adaptive Z-score circuit breaker, automatic Anthropic prompt caching, seven pre-prompt guardrail policies, outbound PII filtering, EU AI Act risk headers, BYOK with AES-256-GCM, and a hash-chain audit trail. None of that exists when you call provider APIs directly.

Need help setting up multi-LLM routing?

Talk to our AI Gateway experts →

Gateway Logs Generate Control Evidence For

Every completion carries the EU AI Act risk class and a full audit-log entry. Mappings are pre-built — auditors get exports, not spreadsheets.

EU AI Act
Art. 12 — Record Keeping
EU AI Act
Art. 13 — Transparency
SOC 2 Type II
CC7.2 — System Monitoring
GDPR
Art. 30 — Records of Processing
NIST AI RMF
Govern · Map · Measure · Manage

Subject to independent audit and attestation. G8KEPR provides the technical controls and evidence — your auditor issues the certification.

Start Routing in 5 Minutes

One API for Every LLM
Zero Markup on Costs

Auto prompt caching, adaptive circuit breakers, seven guardrail policies, EU AI Act headers, and BYOK with AES-256-GCM — across 15+ providers, with no per-token markup.

30-day free trial
No per-token markup
BYOK supported
Targets sub-5ms routing

No credit card required • BYOK - no markup • Cancel anytime