15+ Providers · 5 Routing Strategies · Adaptive Circuit Breakers

One API for
Every LLM Provider

Route, cache, guardrail, and audit every LLM call — across 15+ providers

Most gateways route. G8KEPR also auto-injects Anthropic prompt caching (38% cost cut at 2 calls, 88% at 10), adapts circuit breakers with Z-score baselines instead of static thresholds, and stamps an EU AI Act risk class on every completion. BYOK with AES-256-GCM — your keys, your provider rates, your audit trail.

Auto Prompt Caching

Z-Score Circuit Breaker

7 Guardrail Policies

EU AI Act Headers

BYOK · No Markup

Start Free Trial See How It Works

15+ providers, one API. Add a custom endpoint and we route through that too:

Anthropic

OpenAI

Google

Azure OpenAI

AWS Bedrock

Groq

Cohere

Mistral

DeepSeek

Fireworks

Together

xAI

Cerebras

Ollama

BYOI Custom

15+

Providers

incl. BYOI

Routing modes

priority · cost · latency · RR · semantic

Guardrail policies

pre-prompt

PII categories

redact · block · warn

API routes

gateway sub-router

88%

Auto cache hit

system tokens at 10 calls

16,384

Hard token cap

cost-amplification block

Z>3σ

Adaptive breaker

beats Hystrix thresholds

What is an AI Gateway?

The missing infrastructure layer for production LLM applications

The Multi-LLM Challenge

Production AI apps need multiple LLM providers for reliability, cost optimization, and feature coverage. But managing them means juggling separate keys, billing accounts, rate limits, and error handling scattered across every service that calls an LLM.

API Key Management

Separate keys for OpenAI, Anthropic, Google, Azure, Bedrock...

Cost Tracking

Different pricing per model — Claude $3/M, GPT-4 $30/M, Gemini $0.50/M

Rate Limit Handling

Each provider has different limits — OpenAI 10k RPM, Anthropic 50 RPM

Failover Logic

Manual retry logic when OpenAI is down — switch to Claude or Gemini

The Traditional Approach

Most teams hard-code provider-specific logic across services. Every switch is a code change, a redeploy, and a test pass. Cost tracking is manual. When a provider degrades, you find out from your users — not your monitoring.

Vendor Lock-In

Switching from OpenAI to Claude requires changing every API call

No Failover

If OpenAI goes down, your entire app is down

Hidden Costs

No visibility into which models/users cost the most

Scattered Logic

Provider-specific code duplicated across services

How G8KEPR AI Gateway Works

One unified API that intelligently routes to the best LLM provider

Intelligent LLM Routing

1. Your Application

client.chat.completions.create(model="auto")

Single API call with model="auto" - no provider-specific code

2. G8KEPR Routing Engine

Selects optimal provider

Check Costs: Claude $3/M, GPT-4 $30/M, Gemini $0.50/M

Measure Latency: Claude 145ms, GPT-4 180ms

Health Check: Is provider available?

Rate Limits: Has quota remaining?

Claude 3.5

✓ Selected

$3/M • 145ms

GPT-4

Standby

$30/M • 180ms

Gemini

Standby

$0.50/M • 200ms

✓ Routed to Claude • Cost-optimized • 145ms response • Failover to GPT-4 if unavailable

Priority

Ordered provider list — first healthy provider wins. Predictable routing with deterministic fallback order.

✓ Claude → GPT-4 → Gemini

Cost-Optimized

Cheapest provider that meets the requested model capability. Gemini Flash ($0.50/M) for simple tasks, Claude ($3/M) for reasoning.

✓ Save up to 92%

Latency-Optimized

Lowest p95 latency from historical tracking. Real-time metrics, not stale benchmarks. Critical for user-facing chat.

✓ Sub-200ms p95

Round-Robin

Distribute load evenly across healthy providers. Sidesteps single-provider rate-limit ceilings during traffic bursts.

✓ No 429s under burst

Semantic Intent

Classify the prompt (CODING, CREATIVE, ANALYSIS, GENERAL) and route to the model that excels at that intent.

✓ Best-fit model per call

Generic Multi-LLM Proxies Don't Do These

What G8KEPR Adds That LangChain Routers Don't

Five capabilities that exist in the platform — not in OpenAI's SDK, not in LangChain, not in a one-file proxy you wrote on a Friday afternoon.

Auto Prompt Caching

Tracks system-prompt hashes per org. After 2 identical, auto-injects Anthropic cache_control:ephemeral. Zero user code changes.

SHA-256 fingerprintgateway/cache_optimizer.py

OpenAI SDK / LangChain do not auto-cache.

Adaptive Z-Score Breaker

Statistical baselines per provider per hour-of-day. Trips when failure rate > mean + 3σ. Progressive recovery 10/25/50/100%.

4 windows · 3σgateway/router.py

Hystrix and Resilience4j use static thresholds.

7 Guardrail Policies

Toxicity, bias, topic-block, PII, prompt-injection, regex, rate-limit. Block / redact / warn / log per policy. Every violation logged.

pre-promptai_guardrail_policies

Generic gateways forward unfiltered.

EU AI Act Headers

Every completion stamped with the EU AI Act risk class for the model used (MINIMAL / LIMITED / HIGH / UNACCEPTABLE) — wired at the gateway, not bolted on.

every responseX-AI-Risk-Class

No SDK ships this header.

BYOK with AES-256-GCM

Keys encrypted at rest, decrypted only into process memory. Never written to Redis or logs. Per-key monthly cost limits enforced.

process-local onlyEncryptionService

Most BYOK is plaintext-in-config.

Anthropic Prompt Caching · Automatic

The Cheapest Token
Is The One You Don't Resend

System prompts are usually 80%+ of the token cost on every call — and they're identical every time. G8KEPR fingerprints them with SHA-256, and after observing two identical prompts, automatically injects Anthropic's cache_control: ephemeral directive. No SDK changes, no flag flipping — it just happens.

Tracks up to 100 unique system-prompt fingerprints per org

Cache write: 1.25× input price · Cache read: 0.10× input price

Hit rate and cumulative token savings tracked per org in real time

Works with any Anthropic-routed call — no app code changes

Auto-Cache Savings Curve (System Tokens)

1 call

—

2 calls

-38%

5 calls

-70%

10 calls

-88%

100 calls

-90%

Break-even at

2 calls

Steady-state

88% off

Beats Netflix Hystrix & Resilience4j

Adaptive Circuit Breaker

Static thresholds break when traffic patterns shift. G8KEPR's breaker uses statistical baselines per provider, per hour — it knows that a 2% failure rate is normal at 3 a.m. and abnormal at 3 p.m.

4 Time Windows

1m · 5m · 15m · 1hr

Rolling windows track success/failure rates across four scales — short bursts and long-running degradation both surface.

Z-Score > 3σ

mean + 3 × stddev

Trips when observed failure rate exceeds three standard deviations from the hour-specific baseline. Anomalies classified as spike, degradation, or sustained.

Progressive Recovery

See exactly what you're spending across all LLM providers in one dashboard

Monthly Cost Breakdown

Last 30 days

Provider	Model	Requests	Tokens	Rate	Cost
C Claude	3.5 Sonnet	12,456	2.3M	$3/M	$6.90
G OpenAI	GPT-4 Turbo	1,234	0.8M	$30/M	$24.00
G Google	Gemini Flash	45,123	8.2M	$0.50/M	$4.10
Total		58,813	11.3M	Avg $3.10/M	$35.00

92% savings vs GPT-4 only ($450/month)

Track your costs →

AI Gateway Features

Everything you need to manage multi-LLM applications in production

BYOK · AES-256-GCM

Bring your own keys for any provider. Encrypted at rest, decrypted only into process memory — never written to Redis or logs. Per-key monthly cost limits enforced.

✓ EncryptionService · ai_gateway_keys

Provider Failover Chain

Configure ordered fallback chains: Claude → GPT-4 → Gemini. Automatic re-route on 5xx, 429, latency-SLO miss, or health-check failure. Health tracked per provider per hour.

✓ Per-provider exponential backoff

Cost Tracking + Anomaly Alerts

Tag requests with user_id, team_id, project_id. Per-org / per-user / per-provider / per-model rollup. Z-score anomaly detection fires Slack and email when spend spikes.

✓ gateway_usage_logs · cost_budgets

7 AI Guardrail Policies

Toxicity, bias, topic-block, PII, prompt-injection, regex, and rate-limit policies — evaluated on the prompt before it leaves the gateway. Block / redact / warn / log per policy.

✓ ai_guardrail_policies · ai_guardrail_violations

PII Filter — 8 Categories

Outbound prompts scanned for emails, financial IDs, identity docs, contact info, network IDs, location, credentials, and health data. Type-safe placeholder redaction or hard block.

✓ pii_filters · per-org rules

Hard max_tokens Cap

LLM_HARD_MAX_TOKENS=16,384 enforced on every provider, regardless of user input. A caller can't pass max_tokens=999999 to drain the budget in one shot.

✓ Cost-amplification attack defense

EU AI Act Risk Headers

Every completion stamped with X-AI-Risk-Class (MINIMAL / LIMITED / HIGH / UNACCEPTABLE) for the model used. eu_risk_class field on model version records. Wired at infra level.

✓ X-AI-Risk-Class · eu_risk_class

SSRF-Protected Transport

All outbound calls (especially BYOI custom endpoints) routed through SSRFProtectedTransport. IPv4-mapped IPv6 normalized, 169.254.169.254 metadata blocked, HTTP/2 keep-alive pool.

✓ gateway/http_client.py

OpenAI SDK Compatible

Drop-in replacement — change base URL, route to any of 15+ providers. No code changes to application logic. Same chat-completions schema, same streaming, same function-call format.

✓ One-line integration

AI Gateway FAQs

Everything you need to know about multi-LLM routing

Calling LLM APIs directly means juggling separate keys, billing accounts, rate limits, error handling, and provider-specific code in every service. An AI Gateway is the central control plane: one OpenAI-compatible API across 15+ providers, five routing strategies (priority, cost-optimized, latency-optimized, round-robin, semantic intent), an adaptive Z-score circuit breaker, automatic Anthropic prompt caching, seven pre-prompt guardrail policies, outbound PII filtering, EU AI Act risk headers, BYOK with AES-256-GCM, and a hash-chain audit trail. None of that exists when you call provider APIs directly.

Need help setting up multi-LLM routing?

Talk to our AI Gateway experts →

Gateway Logs Generate Control Evidence For

Every completion carries the EU AI Act risk class and a full audit-log entry. Mappings are pre-built — auditors get exports, not spreadsheets.

EU AI Act

Art. 12 — Record Keeping

EU AI Act

Art. 13 — Transparency

SOC 2 Type II

CC7.2 — System Monitoring

GDPR

Art. 30 — Records of Processing

NIST AI RMF

Govern · Map · Measure · Manage

Subject to independent audit and attestation. G8KEPR provides the technical controls and evidence — your auditor issues the certification.

Start Routing in 5 Minutes

One API for Every LLM
Zero Markup on Costs

Auto prompt caching, adaptive circuit breakers, seven guardrail policies, EU AI Act headers, and BYOK with AES-256-GCM — across 15+ providers, with no per-token markup.

30-day free trial

No per-token markup

BYOK supported

Targets sub-5ms routing

Start Routing Today View Pricing

No credit card required • BYOK - no markup • Cancel anytime

One API forEvery LLM Provider

What is an AI Gateway?

The Multi-LLM Challenge

The Traditional Approach

How G8KEPR AI Gateway Works

Priority

Cost-Optimized

Latency-Optimized

Round-Robin

Semantic Intent

What G8KEPR Adds That LangChain Routers Don't

Auto Prompt Caching

Adaptive Z-Score Breaker

7 Guardrail Policies

EU AI Act Headers

BYOK with AES-256-GCM

The Cheapest TokenIs The One You Don't Resend

Adaptive Circuit Breaker

Real Cost Savings Examples

Customer Support Chatbot

Code Generation Tool

Document Analysis Pipeline

AI Assistant SaaS

Unified Cost Tracking

Monthly Cost Breakdown

AI Gateway Features

BYOK · AES-256-GCM

Provider Failover Chain

Cost Tracking + Anomaly Alerts

7 AI Guardrail Policies

PII Filter — 8 Categories

Hard max_tokens Cap

EU AI Act Risk Headers

SSRF-Protected Transport

OpenAI SDK Compatible

AI Gateway FAQs

How is an AI Gateway different from calling LLM APIs directly?

How does the automatic prompt caching actually work?

What makes the adaptive circuit breaker different from Hystrix or Resilience4j?

Do you mark up LLM costs or charge per token?

How do the AI guardrail policies work?

How does the EU AI Act compliance feature work?

How does intelligent routing decide which LLM to use?

Does this work with fine-tuned models and custom deployments?

Can I track costs per user, team, or project?

Gateway Logs Generate Control Evidence For

One API for Every LLMZero Markup on Costs

One API for
Every LLM Provider

The Cheapest Token
Is The One You Don't Resend

One API for Every LLM
Zero Markup on Costs