API Reference

gateway.fast routes your requests to the best available frontier model — automatically, based on workload type, capability tier, and live utilisation. One endpoint for the active gateway.fast model catalog.

Base URLhttps://api.gateway.fast/v1

Overview

gateway.fast exposes inference endpoints under https://api.gateway.fast/v1. Requests are routed in real time to the best available model. You can influence routing via request headers — or let gateway.fast decide automatically.

Two routing modes are available:

Auto mode (default) — gateway.fast classifies your request and picks the optimal model by tier, live load, and workload type.
Direct mode — You specify exactly which model to use via x-model.

Some models are TEE-enabled. The model catalog shows which models support TEE.

Authentication

All requests require a Bearer token in the Authorization header. API keys are issued after purchase at gateway.fast/pricing and follow the format sk-gw-….

http

Authorization: Bearer sk-gw-<your-api-key>

Keep your key safe. API keys grant full access to your credit balance. Rotate from your dashboard if compromised.

Quick start

Send your first request in under a minute:

curl

curl -X POST https://api.gateway.fast/v1/messages \
  -H "Authorization: Bearer sk-gw-<your-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Summarise the key steps of a RAG pipeline." }
    ],
    "max_tokens": 1024
  }'

POST /v1/messages

The primary inference endpoint. Accepts a messages payload and returns a completion.

POSThttps://api.gateway.fast/v1/messages

Routes to the best available model. Model selection is controlled by routing headers (see below).

Request body

Parameter	Type		Description
messages	array	Required	Array of message objects with `role` and `content` fields.
max_tokens	integer	Optional	Maximum tokens to generate. Defaults to `4096`.
stream	boolean	Optional	Set to `true` to stream the response as SSE.
temperature	number	Optional	Sampling temperature. Passed through to the model.
top_p	number	Optional	Nucleus sampling threshold.
tools	array	Optional	Tool definitions. Presence influences routing toward tool-capable models.
tool_choice	object	Optional	Tool choice control. Passed through to the model.
response_format	object	Optional	Structured output format. Presence scores toward structured-output models.

POST /v1/chat/completions

An OpenAI-compatible alias that delegates to /v1/messages. Use this as a drop-in replacement if your SDK or framework targets the OpenAI API format.

POSThttps://api.gateway.fast/v1/chat/completions

Identical behaviour to /v1/messages. All routing headers apply.

Routing headers

These request headers control how gateway.fast routes your request. All are optional — omitting them triggers fully automatic routing.

Header	Values	Description
x-mode	`auto` · `direct`	Routing mode. `auto` (default) lets gateway.fast decide. `direct` routes to a specific model via `x-model`.
x-model	model slug	Target model slug for direct mode, e.g. `kimi-k2.6-tee`. Ignored in auto mode.
x-latency	`low` · `normal`	Latency preference. `low` biases routing toward faster Tier 1–2 models.
x-privacy	`tee`	Set to `tee` to restrict routing to TEE-only models.
x-tier-max	`1` · `2` · `3`	Cap the maximum model tier used. Useful for cost control.

Auto mode

In auto mode, gateway.fast uses a two-stage routing pipeline:

LLM classifier — A lightweight Claude call analyses your messages for signals: agentic intent, tool use, code, reasoning depth, latency sensitivity. Completes in under 2 seconds.
Availability-weighted scorer — Combines capability tier score with live availability data to pick the model least likely to be rate-limited right now.

If the classifier times out, routing falls back to the heuristic scorer alone — no request is dropped.

curl — auto mode with hints

curl -X POST https://api.gateway.fast/v1/messages \
  -H "Authorization: Bearer sk-gw-<your-key>" \
  -H "Content-Type: application/json" \
  -H "x-latency: low" \
  -H "x-privacy: tee" \
  -d '{
    "messages": [{ "role": "user", "content": "Fix this bug: ..." }],
    "tools": [{ "name": "bash", "description": "Run shell commands" }],
    "max_tokens": 2048
  }'

Direct mode

Use x-mode: direct with x-model to target a specific model. The availability scorer is bypassed — your request goes to exactly the model you specify.

curl — direct mode

curl -X POST https://api.gateway.fast/v1/messages \
  -H "Authorization: Bearer sk-gw-<your-key>" \
  -H "Content-Type: application/json" \
  -H "x-mode: direct" \
  -H "x-model: kimi-k2.6-tee" \
  -d '{
    "messages": [{ "role": "user", "content": "..." }],
    "max_tokens": 4096
  }'

Response headers

Every response includes metadata headers describing what happened:

Header	Description
x-model-used	Slug of the model that handled the request, e.g. `kimi-k2.6-tee`.
x-tier	Tier of the selected model (`1`–`4`).
x-score	Routing score that won (0–100).
x-cost-micro	Cost of this request in µ$ (micro-dollars). $1 = 1,000,000 µ$.
x-cost-usd	Cost in USD as a decimal string, e.g. `0.000312`.
x-balance-remaining-micro	Remaining balance in µ$ after this request.
x-balance-remaining-usd	Remaining balance in USD, e.g. `18.4231`.
x-classifier-source	`llm` — LLM classifier ran. `heuristic` — fell back to scorer. `direct` — direct mode.
x-classifier-confidence	Classifier confidence score (0–1), if available.
x-request-id	UUID for this request — include in support queries.

Streaming

Set "stream": true in your request body to receive a Server-Sent Events stream. Each event follows the standard SSE format with a data: prefix. The stream terminates with data: [DONE].

Note: Cost and token counts are not included in response headers for streaming requests — they are logged server-side and visible in your dashboard.

python — streaming

import requests, json

resp = requests.post(
    "https://api.gateway.fast/v1/messages",
    headers={
        "Authorization": "Bearer sk-gw-<your-key>",
        "Content-Type": "application/json",
    },
    json={
        "messages": [{"role": "user", "content": "Write a sorting algorithm."}],
        "max_tokens": 2048,
        "stream": True,
    },
    stream=True,
)

for line in resp.iter_lines():
    if line and line.startswith(b"data: "):
        data = line[6:]
        if data == b"[DONE]":
            break
        chunk = json.loads(data)
        print(chunk, flush=True)

Errors

All errors return JSON with an error field describing the issue.

Status	Meaning
400	Bad request — invalid JSON body or missing `messages` array.
401	Unauthorised — missing, malformed, or inactive API key.
402	Payment required — insufficient balance. Top up at gateway.fast/pricing.
403	Forbidden — restricted models require an enterprise key.
500	Internal server error — routing or provider failure.
502/503	Provider error — upstream model returned an error. Retry after a short delay.

Model catalog

The active model catalog is exposed through GET /v1/models. Provider names are shown as gateway.fast in customer-facing surfaces.

Model	Slug	Tier	Context	Input /1M	Output /1M	TEE
DeepSeek V4 Flash	`deepseek-v4-flash`	T1	1 million	$0.154	$0.308	—
DeepSeek V4 Pro	`deepseek-v4-pro`	T2	1 million	$0.4785	$0.957	—

Tier system

Models are grouped into four capability tiers. Auto routing uses tiers as a primary signal alongside live utilisation.

Tier	Characteristics	Best for
T1	Balanced — fast, cost-efficient	Summarisation, classification, high-throughput pipelines
T2	Frontier agentic — multi-step, tool-capable	Agentic workflows, tool use, reasoning chains
T3	Cutting edge — SWE-bench leaders	Complex coding, deep reasoning, long-context tasks
T4	Enterprise — restricted premium models	Enterprise accounts only

Questions? Email hello@gateway.fast or check your dashboard at gateway.fast/dashboard.