Generated with Google Imagen 3.
← Home

Building an AI Gateway from Scratch in Rust

There’s a tax you pay every time you start a new AI-powered project.

You pull in the provider SDK, wire up the API key, write a thin wrapper around chat/completions, and move on. The project ships. Then comes the second project. And the third. After a few months, you have a half-dozen repositories all talking directly to OpenAI or Anthropic, each with its own key, its own retry logic, its own cost blind spot, and its own way of not telling you what anything costs.

That was the state of my personal project portfolio. Every project — a scheduling platform, a bookmark manager, a research assistant, a speech analytics pipeline — had its own direct line to the model providers. No shared cost visibility. No unified rate limiting. No way to swap a model without touching multiple codebases.

The solution was obvious: a gateway. A single HTTP proxy that sits between all my projects and the upstream providers, speaking OpenAI’s wire format so clients don’t need to change a line of code.

What was less obvious was building it in Rust.


Why Rust?

The honest answer: I wanted to.

The slightly better answer: a gateway is a latency-sensitive, always-on proxy. Every millisecond the gateway adds is a millisecond your users wait. Rust’s zero-cost abstractions and predictable memory behavior — no garbage collection pauses, no JVM warmup — make it a natural fit for infrastructure software that needs to stay out of the hot path.

Beyond performance, the Rust type system enforces constraints that matter for billing infrastructure. There’s no float for currency anywhere in this codebase. The rust_decimal crate gives us exact decimal arithmetic, meaning $0.000127 stays $0.000127 and doesn’t silently become $0.00012699999999. That distinction matters when you’re aggregating thousands of requests into monthly invoices.

I also wanted an excuse to build something non-trivial with Axum, Tokio’s ergonomic web framework. After a few weeks of production use, I have no regrets.


Architecture Overview

The gateway is a single Axum binary that handles three concerns:

  1. Proxy — forward POST /v1/chat/completions to the right upstream provider, with fallback
  2. Cache — avoid redundant API calls with a two-tier semantic cache
  3. Billing — record every call with precise token counts and USD costs

The full request flow looks like this:

Client


POST /v1/chat/completions

  ├─ auth_middleware        Bearer token check (or open in dev mode)

  ├─ SemanticCache lookup
  │     ├─ Tier 1: exact SHA-256 match → return cached response
  │     └─ Tier 2: cosine similarity on embeddings → return if above threshold

  ├─ Provider selection     try providers in configured order, fallback on 5xx

  ├─ Forward upstream       stream or buffer response

  ├─ Cache store            fire-and-forget, never blocks the response

  └─ Return response

The billing layer is entirely optional. Without a DATABASE_URL, the gateway runs in proxy-only mode — it still forwards requests, still caches, still rate-limits, but doesn’t record costs. This lets the binary start and serve traffic in a minimal Docker container without a database dependency.

AppState

Everything the handlers need flows through a single AppState struct:

pub struct AppState {
    pub providers:          Vec<Arc<dyn LlmProvider>>,
    pub embedding_provider: Option<Arc<dyn LlmProvider>>,
    pub cache:              Arc<SemanticCache>,
    pub gateway_api_key:    Option<String>,
    pub pool:               Option<PgPool>,
    pub pricing:            Arc<PricingCache>,
}

providers is an ordered fallback chain — the gateway tries each in sequence and stops on the first success (or fast-fails on 4xx client errors). pool being Option<PgPool> is the graceful degradation story made explicit in the type system. If you call state.billing(), you get Option<(&PgPool, &PricingCache)> — None means no database, and the handler decides what to do from there.


Provider Abstraction

Every upstream provider implements the same trait:

#[async_trait]
pub trait LlmProvider: Send + Sync {
    fn name(&self) -> &str;
    async fn chat_completion(&self, request: &ChatRequest) -> Result<ChatResponse, GatewayError>;
    async fn stream_completion(&self, request: &ChatRequest) -> Result<BoxStream<...>, GatewayError>;
    async fn embed(&self, text: &str) -> Result<Vec<f32>, GatewayError>;
}

OpenAI and Anthropic are implemented today. The provider list is built at startup from environment variables — if OPENAI_API_KEY is set, an OpenAI provider is registered; same for ANTHROPIC_API_KEY. The order in the Vec determines the fallback priority.

The Anthropic adapter translates the OpenAI-compatible request format into Anthropic’s API format and back. Clients never know which provider they hit. This is the whole point: adding Google Gemini or AWS Bedrock later means writing a new struct that implements LlmProvider, not touching any client code.


The Two-Tier Semantic Cache

This is the part I’m most pleased with.

The cache sits in front of every provider call. It operates in two tiers:

Tier 1: Exact Match

A fingerprint is computed from the request’s key parameters: model, messages, max_tokens, temperature, top_p, and stop sequences. The fingerprint is the SHA-256 hash of their stable JSON serialization, stored as a hex string.

pub fn cache_fingerprint(&self) -> String {
    let repr = serde_json::json!({
        "model": self.model,
        "messages": self.messages,
        "max_tokens": self.max_tokens,
        "temperature": self.temperature,
        "top_p": self.top_p,
        "stop": self.stop,
    });
    let hash = Sha256::digest(repr.to_string().as_bytes());
    hex::encode(hash)
}

Lookups hit a DashMap — a lock-free concurrent hashmap — for O(1) reads with no contention under concurrent requests.

Tier 2: Semantic Match

When there’s no exact match, the cache computes an embedding of the request’s “semantic fingerprint” — a prose representation of the messages — using the configured embedding provider. It then walks the cache looking for a stored entry whose embedding cosine-similarity exceeds the configured threshold (default: 0.95).

similarity = (A · B) / (|A| × |B|)

If "What is the capital of France?" and "Tell me the capital city of France" both resolve to near-identical embeddings, the second request never touches the upstream API. At 0.95 similarity, false positives are rare. The threshold is configurable — tighten it to 0.99 for factual precision, loosen it to 0.90 for creative tasks where variation in semantically-equivalent prompts is fine.

Why this matters at scale: LLM API calls are expensive and slow. For any system with a predictable query distribution (a customer support bot, a code assistant with common patterns, a summarization pipeline), a meaningful fraction of calls are semantically redundant. The cache pays for itself quickly.

LRU Eviction

The cache enforces both capacity limits (default: 10,000 entries) and TTL (default: 1 hour). On capacity overflow, it evicts the least recently used entry. Cache stores are fire-and-forget — they run on a spawned task so they never add to response latency.


Billing and Multi-Tenancy

The billing layer is built for a SaaS topology: multiple organizations, each with their own users and virtual API keys, all sharing the same gateway infrastructure.

The schema has six tables:

TablePurpose
organizationsTop-level tenants with a slug identifier
usersPer-org users with an external_id for mapping to your auth system
api_keysVirtual keys with optional budget caps and expiry
model_pricing$/1M token rates for 18+ models, hot-reloaded every 5 minutes
llm_requestsImmutable audit log — one row per completed API call
usage_dailyPre-aggregated daily rollups for dashboard queries

Virtual API keys are hashed with SHA-256 before storage — the raw key is only returned once at creation time, never stored. A client presents their virtual key to the gateway; the gateway hashes it, looks up the corresponding organization, checks the budget, then forwards the request under the gateway’s own upstream API key. The client never needs direct provider credentials.

The llm_requests table is the core of cost accounting:

CREATE TABLE llm_requests (
    id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id           UUID NOT NULL REFERENCES organizations(id),
    user_id          UUID REFERENCES users(id),
    model            TEXT NOT NULL,
    provider         TEXT NOT NULL,
    prompt_tokens    INTEGER NOT NULL DEFAULT 0,
    completion_tokens INTEGER NOT NULL DEFAULT 0,
    total_tokens     INTEGER GENERATED ALWAYS AS (prompt_tokens + completion_tokens) STORED,
    input_cost_usd   NUMERIC(12,6) NOT NULL DEFAULT 0,
    output_cost_usd  NUMERIC(12,6) NOT NULL DEFAULT 0,
    total_cost_usd   NUMERIC(12,6) GENERATED ALWAYS AS (input_cost_usd + output_cost_usd) STORED,
    latency_ms       INTEGER,
    cache_hit        BOOLEAN NOT NULL DEFAULT FALSE,
    is_streaming     BOOLEAN NOT NULL DEFAULT FALSE,
    error            TEXT,
    tags             JSONB NOT NULL DEFAULT '{}',
    created_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

total_cost_usd is a generated column — the database computes it from the input and output costs, so there’s no risk of the application writing an inconsistent total. NUMERIC(12,6) gives six decimal places, enough precision for sub-cent billing at high request volumes.

Pricing Hot-Reload

Model prices change. The gateway handles this with a PricingCache — a HashMap behind an Arc<RwLock<>> — seeded with hardcoded defaults at startup and refreshed from the database every five minutes in a background task. Handlers never wait on a lock more than a few microseconds. If the database is unavailable during a refresh, the old prices remain in memory and the gateway keeps running.

Hardcoded fallbacks ensure the gateway can compute costs even before the first database connection succeeds:

fn default_prices() -> HashMap<String, (Decimal, Decimal)> {
    let entries = [
        ("openai/gpt-4o",            "2.500000", "10.000000"),
        ("openai/gpt-4o-mini",       "0.150000",  "0.600000"),
        ("anthropic/claude-opus-4-5", "15.000000", "75.000000"),
        // ... 15 more entries
    ];
    // build HashMap<"provider/model", (input_per_mtok, output_per_mtok)>
}

The Admin Console

The gateway ships with a React admin console served as a static SPA at /admin. It covers the full operational surface:

  • Dashboard — daily spend charts, per-model cost breakdown, token volume, top spenders
  • Organizations — create and manage tenants
  • Users — per-org user management
  • API Keys — issue virtual keys with optional budget caps and expiry dates
  • Pricing — view the current model pricing table loaded from the database

The UI is built with Vite, talks to the same Axum backend via the /v1/admin/* routes, and is bundled into the Docker image at build time. No separate frontend deployment needed.

Authentication is straightforward: a GATEWAY_API_KEY environment variable, passed as a Bearer token. In open/dev mode (no key configured), all requests pass through. The /health and /cache/stats endpoints bypass auth entirely for load balancer health checks.


Key Design Decisions

No compile-time query macros

sqlx offers query! macros that verify SQL at compile time against a live database. That’s a great feature in a codebase where you want to catch schema mismatches during CI. I chose query_as runtime functions instead.

Why: the gateway needs to compile without a DATABASE_URL in the environment — CI, local dev without Docker, and the Docker build stage itself all run without a database. The query_as! macros require DATABASE_URL at compile time or a cached schema file checked into the repo. Both add friction. Runtime queries eliminate the dependency entirely at the cost of pushing schema errors to runtime instead of compile time. For a small, well-tested schema, that tradeoff is worth it.

rust_decimal everywhere money appears

Floating-point arithmetic is not appropriate for currency. This is not a controversial claim. But it’s surprisingly easy to forget when f64 is the path of least resistance.

Every monetary value in this codebase — request costs, budget limits, aggregated totals — is Decimal from the rust_decimal crate, stored as NUMERIC in PostgreSQL. The type system makes it impossible to accidentally mix f64 and currency.

Streaming without buffering

LLM streaming responses are server-sent events (SSE): a long-lived HTTP connection over which the model streams tokens as it generates them. Buffering the full response before sending would defeat the purpose of streaming — users would wait for the complete generation before seeing anything.

The gateway pipes SSE chunks directly from the upstream response to the client using axum::response::sse::Sse and async streams. The entire chain — request stream → transform → axum SSE — is zero-copy with respect to buffering. Provider fallback for streaming requests works differently from non-streaming: since you can’t retry a partial response, the gateway tries providers in order before opening the stream, not mid-stream.

Multi-stage Docker build

The Dockerfile has three stages:

  1. Node.js builder — builds the React admin console (npm run build)
  2. Rust builder — builds the release binary with LTO and stripped symbols
  3. Debian slim runtime — copies binary + static files, runs as a non-root user

The dependency cache layer is a standard Rust Docker optimization: copy Cargo.toml and Cargo.lock, build a dummy binary to cache all dependencies, then copy the real source and build only what changed. On a cold build, the release binary takes a few minutes. On a warm cache, incremental rebuilds are fast.


What’s Running in Production

The gateway is deployed on a single EC2 t3.micro instance at ai-gateway.ai-assisted.dev. PostgreSQL runs on the same instance — a deliberate choice to keep the operational footprint minimal at this stage. The previous architecture used RDS, but a managed database for a personal project is difficult to justify at $15-30/month when a local PostgreSQL process on an already-running instance costs nothing.

All projects that make LLM calls — the scheduling platform, the bookmark manager, the research assistant — route through this gateway. Costs are tracked per-organization. Rate limiting protects the upstream API keys from runaway processes. The semantic cache reduces repeated calls in development workflows where similar prompts recur frequently.


Trade-offs and Limitations

No authentication for individual LLM requests in proxy-only mode. Without DATABASE_URL, the gateway has no concept of virtual API keys or per-org tracking. Anyone who knows the GATEWAY_API_KEY can make unlimited requests. This is acceptable for a personal deployment where the only clients are your own projects.

Semantic cache accuracy depends on embedding quality. A 0.95 cosine similarity threshold is high, but “high similarity” in embedding space doesn’t always mean “semantically equivalent for your use case.” Factual prompts with subtle but meaningful differences can theoretically share a cache entry. The threshold is configurable; safety-critical workloads should use exact matching only.

Single-binary architecture doesn’t scale horizontally as-is. The semantic cache is in-process memory. Multiple instances would maintain independent caches with no shared state. Replacing DashMap with Redis would solve this, at the cost of an additional infrastructure dependency.

Streaming responses are not cached. The cache only operates on buffered (non-streaming) responses. Streaming calls always hit the upstream provider.


What’s Next

The current implementation is a solid foundation. The interesting directions from here:

Budget enforcement. The schema already has budget_usd on API keys and organizations. The missing piece is the enforcement middleware that rejects requests when a key has exceeded its budget. The data is there; the check isn’t wired in yet.

More providers. Adding Google Gemini, AWS Bedrock, and Mistral means implementing LlmProvider for each. The pricing table already has entries for their models. The abstraction is ready.

Redis cache backend. Swapping the in-process DashMap for a Redis backend would make the cache durable across restarts and shareable across multiple gateway instances.

Policy-as-code. Rate limiting currently applies uniformly. Per-org or per-key rate limits — expressed as rules evaluated per-request — would give finer control for multi-tenant deployments.

Request routing rules. Route gpt-4o requests to Claude Sonnet when cost exceeds a threshold, or route specific models to specific providers regardless of the fallback chain. This turns the gateway into a full traffic manager.

Webhook notifications. Alert when an org crosses a spend threshold, when a key expires, or when the error rate spikes. The audit log has all the data; it just needs something listening to it.


The Code

The full source is at ai-gateway.ai-assisted.dev. The stack is Rust + Axum + PostgreSQL + React, deployed on a single EC2 instance, built and shipped as a Docker image.

If you’re running multiple AI-powered projects and find yourself copy-pasting SDKs and losing track of costs, building a gateway — or running one — is worth the investment. The OpenAI wire format compatibility means your existing clients don’t change at all. You just update the base URL.


Further Reading