API Reference

gateway.fast routes your requests to the best available frontier model — automatically, based on workload type, capability tier, and live utilisation. One endpoint for the active gateway.fast model catalog.

Base URLhttps://api.gateway.fast/v1

Overview

gateway.fast exposes inference endpoints under https://api.gateway.fast/v1. Requests are routed in real time to the best available model. You can influence routing via request headers — or let gateway.fast decide automatically.

Two routing modes are available:

Some models are TEE-enabled. The model catalog shows which models support TEE.

Authentication

All requests require a Bearer token in the Authorization header. API keys are issued after purchase at gateway.fast/pricing and follow the format sk-gw-….

http
Authorization: Bearer sk-gw-<your-api-key>
Keep your key safe. API keys grant full access to your credit balance. Rotate from your dashboard if compromised.

Quick start

Send your first request in under a minute:

curl
curl -X POST https://api.gateway.fast/v1/messages \
  -H "Authorization: Bearer sk-gw-<your-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Summarise the key steps of a RAG pipeline." }
    ],
    "max_tokens": 1024
  }'

POST /v1/messages

The primary inference endpoint. Accepts a messages payload and returns a completion.

POSThttps://api.gateway.fast/v1/messages

Routes to the best available model. Model selection is controlled by routing headers (see below).

Request body

ParameterTypeDescription
messagesarrayRequiredArray of message objects with role and content fields.
max_tokensintegerOptionalMaximum tokens to generate. Defaults to 4096.
streambooleanOptionalSet to true to stream the response as SSE.
temperaturenumberOptionalSampling temperature. Passed through to the model.
top_pnumberOptionalNucleus sampling threshold.
toolsarrayOptionalTool definitions. Presence influences routing toward tool-capable models.
tool_choiceobjectOptionalTool choice control. Passed through to the model.
response_formatobjectOptionalStructured output format. Presence scores toward structured-output models.

POST /v1/chat/completions

An OpenAI-compatible alias that delegates to /v1/messages. Use this as a drop-in replacement if your SDK or framework targets the OpenAI API format.

POSThttps://api.gateway.fast/v1/chat/completions

Identical behaviour to /v1/messages. All routing headers apply.

Routing headers

These request headers control how gateway.fast routes your request. All are optional — omitting them triggers fully automatic routing.

HeaderValuesDescription
x-modeauto · directRouting mode. auto (default) lets gateway.fast decide. direct routes to a specific model via x-model.
x-modelmodel slugTarget model slug for direct mode, e.g. kimi-k2.6-tee. Ignored in auto mode.
x-latencylow · normalLatency preference. low biases routing toward faster Tier 1–2 models.
x-privacyteeSet to tee to restrict routing to TEE-only models.
x-tier-max1 · 2 · 3Cap the maximum model tier used. Useful for cost control.

Auto mode

In auto mode, gateway.fast uses a two-stage routing pipeline:

  1. LLM classifier — A lightweight Claude call analyses your messages for signals: agentic intent, tool use, code, reasoning depth, latency sensitivity. Completes in under 2 seconds.
  2. Availability-weighted scorer — Combines capability tier score with live availability data to pick the model least likely to be rate-limited right now.

If the classifier times out, routing falls back to the heuristic scorer alone — no request is dropped.

curl — auto mode with hints
curl -X POST https://api.gateway.fast/v1/messages \
  -H "Authorization: Bearer sk-gw-<your-key>" \
  -H "Content-Type: application/json" \
  -H "x-latency: low" \
  -H "x-privacy: tee" \
  -d '{
    "messages": [{ "role": "user", "content": "Fix this bug: ..." }],
    "tools": [{ "name": "bash", "description": "Run shell commands" }],
    "max_tokens": 2048
  }'

Direct mode

Use x-mode: direct with x-model to target a specific model. The availability scorer is bypassed — your request goes to exactly the model you specify.

curl — direct mode
curl -X POST https://api.gateway.fast/v1/messages \
  -H "Authorization: Bearer sk-gw-<your-key>" \
  -H "Content-Type: application/json" \
  -H "x-mode: direct" \
  -H "x-model: kimi-k2.6-tee" \
  -d '{
    "messages": [{ "role": "user", "content": "..." }],
    "max_tokens": 4096
  }'

Response headers

Every response includes metadata headers describing what happened:

HeaderDescription
x-model-usedSlug of the model that handled the request, e.g. kimi-k2.6-tee.
x-tierTier of the selected model (14).
x-scoreRouting score that won (0–100).
x-cost-microCost of this request in µ$ (micro-dollars). $1 = 1,000,000 µ$.
x-cost-usdCost in USD as a decimal string, e.g. 0.000312.
x-balance-remaining-microRemaining balance in µ$ after this request.
x-balance-remaining-usdRemaining balance in USD, e.g. 18.4231.
x-classifier-sourcellm — LLM classifier ran. heuristic — fell back to scorer. direct — direct mode.
x-classifier-confidenceClassifier confidence score (0–1), if available.
x-request-idUUID for this request — include in support queries.

Streaming

Set "stream": true in your request body to receive a Server-Sent Events stream. Each event follows the standard SSE format with a data: prefix. The stream terminates with data: [DONE].

Note: Cost and token counts are not included in response headers for streaming requests — they are logged server-side and visible in your dashboard.
python — streaming
import requests, json

resp = requests.post(
    "https://api.gateway.fast/v1/messages",
    headers={
        "Authorization": "Bearer sk-gw-<your-key>",
        "Content-Type": "application/json",
    },
    json={
        "messages": [{"role": "user", "content": "Write a sorting algorithm."}],
        "max_tokens": 2048,
        "stream": True,
    },
    stream=True,
)

for line in resp.iter_lines():
    if line and line.startswith(b"data: "):
        data = line[6:]
        if data == b"[DONE]":
            break
        chunk = json.loads(data)
        print(chunk, flush=True)

Errors

All errors return JSON with an error field describing the issue.

StatusMeaning
400Bad request — invalid JSON body or missing messages array.
401Unauthorised — missing, malformed, or inactive API key.
402Payment required — insufficient balance. Top up at gateway.fast/pricing.
403Forbidden — restricted models require an enterprise key.
500Internal server error — routing or provider failure.
502/503Provider error — upstream model returned an error. Retry after a short delay.

Model catalog

The active model catalog is exposed through GET /v1/models. Provider names are shown as gateway.fast in customer-facing surfaces.

ModelSlugTierContextInput /1MOutput /1MTEE
DeepSeek V4 Flashdeepseek-v4-flashT11 million$0.154$0.308
DeepSeek V4 Prodeepseek-v4-proT21 million$0.4785$0.957

Tier system

Models are grouped into four capability tiers. Auto routing uses tiers as a primary signal alongside live utilisation.

TierCharacteristicsBest for
T1Balanced — fast, cost-efficientSummarisation, classification, high-throughput pipelines
T2Frontier agentic — multi-step, tool-capableAgentic workflows, tool use, reasoning chains
T3Cutting edge — SWE-bench leadersComplex coding, deep reasoning, long-context tasks
T4Enterprise — restricted premium modelsEnterprise accounts only
Questions? Email hello@gateway.fast or check your dashboard at gateway.fast/dashboard.