API Reference
gateway.fast routes your requests to the best available frontier model — automatically, based on workload type, capability tier, and live utilisation. One endpoint for the active gateway.fast model catalog.
Overview
gateway.fast exposes inference endpoints under https://api.gateway.fast/v1. Requests are routed in real time to the best available model. You can influence routing via request headers — or let gateway.fast decide automatically.
Two routing modes are available:
- Auto mode (default) — gateway.fast classifies your request and picks the optimal model by tier, live load, and workload type.
- Direct mode — You specify exactly which model to use via
x-model.
Some models are TEE-enabled. The model catalog shows which models support TEE.
Authentication
All requests require a Bearer token in the Authorization header. API keys are issued after purchase at gateway.fast/pricing and follow the format sk-gw-….
Authorization: Bearer sk-gw-<your-api-key>Quick start
Send your first request in under a minute:
curl -X POST https://api.gateway.fast/v1/messages \
-H "Authorization: Bearer sk-gw-<your-key>" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Summarise the key steps of a RAG pipeline." }
],
"max_tokens": 1024
}'POST /v1/messages
The primary inference endpoint. Accepts a messages payload and returns a completion.
Routes to the best available model. Model selection is controlled by routing headers (see below).
Request body
| Parameter | Type | Description | |
|---|---|---|---|
| messages | array | Required | Array of message objects with role and content fields. |
| max_tokens | integer | Optional | Maximum tokens to generate. Defaults to 4096. |
| stream | boolean | Optional | Set to true to stream the response as SSE. |
| temperature | number | Optional | Sampling temperature. Passed through to the model. |
| top_p | number | Optional | Nucleus sampling threshold. |
| tools | array | Optional | Tool definitions. Presence influences routing toward tool-capable models. |
| tool_choice | object | Optional | Tool choice control. Passed through to the model. |
| response_format | object | Optional | Structured output format. Presence scores toward structured-output models. |
POST /v1/chat/completions
An OpenAI-compatible alias that delegates to /v1/messages. Use this as a drop-in replacement if your SDK or framework targets the OpenAI API format.
Identical behaviour to /v1/messages. All routing headers apply.
Routing headers
These request headers control how gateway.fast routes your request. All are optional — omitting them triggers fully automatic routing.
| Header | Values | Description |
|---|---|---|
| x-mode | auto · direct | Routing mode. auto (default) lets gateway.fast decide. direct routes to a specific model via x-model. |
| x-model | model slug | Target model slug for direct mode, e.g. kimi-k2.6-tee. Ignored in auto mode. |
| x-latency | low · normal | Latency preference. low biases routing toward faster Tier 1–2 models. |
| x-privacy | tee | Set to tee to restrict routing to TEE-only models. |
| x-tier-max | 1 · 2 · 3 | Cap the maximum model tier used. Useful for cost control. |
Auto mode
In auto mode, gateway.fast uses a two-stage routing pipeline:
- LLM classifier — A lightweight Claude call analyses your messages for signals: agentic intent, tool use, code, reasoning depth, latency sensitivity. Completes in under 2 seconds.
- Availability-weighted scorer — Combines capability tier score with live availability data to pick the model least likely to be rate-limited right now.
If the classifier times out, routing falls back to the heuristic scorer alone — no request is dropped.
curl -X POST https://api.gateway.fast/v1/messages \
-H "Authorization: Bearer sk-gw-<your-key>" \
-H "Content-Type: application/json" \
-H "x-latency: low" \
-H "x-privacy: tee" \
-d '{
"messages": [{ "role": "user", "content": "Fix this bug: ..." }],
"tools": [{ "name": "bash", "description": "Run shell commands" }],
"max_tokens": 2048
}'Direct mode
Use x-mode: direct with x-model to target a specific model. The availability scorer is bypassed — your request goes to exactly the model you specify.
curl -X POST https://api.gateway.fast/v1/messages \
-H "Authorization: Bearer sk-gw-<your-key>" \
-H "Content-Type: application/json" \
-H "x-mode: direct" \
-H "x-model: kimi-k2.6-tee" \
-d '{
"messages": [{ "role": "user", "content": "..." }],
"max_tokens": 4096
}'Response headers
Every response includes metadata headers describing what happened:
| Header | Description |
|---|---|
| x-model-used | Slug of the model that handled the request, e.g. kimi-k2.6-tee. |
| x-tier | Tier of the selected model (1–4). |
| x-score | Routing score that won (0–100). |
| x-cost-micro | Cost of this request in µ$ (micro-dollars). $1 = 1,000,000 µ$. |
| x-cost-usd | Cost in USD as a decimal string, e.g. 0.000312. |
| x-balance-remaining-micro | Remaining balance in µ$ after this request. |
| x-balance-remaining-usd | Remaining balance in USD, e.g. 18.4231. |
| x-classifier-source | llm — LLM classifier ran. heuristic — fell back to scorer. direct — direct mode. |
| x-classifier-confidence | Classifier confidence score (0–1), if available. |
| x-request-id | UUID for this request — include in support queries. |
Streaming
Set "stream": true in your request body to receive a Server-Sent Events stream. Each event follows the standard SSE format with a data: prefix. The stream terminates with data: [DONE].
import requests, json
resp = requests.post(
"https://api.gateway.fast/v1/messages",
headers={
"Authorization": "Bearer sk-gw-<your-key>",
"Content-Type": "application/json",
},
json={
"messages": [{"role": "user", "content": "Write a sorting algorithm."}],
"max_tokens": 2048,
"stream": True,
},
stream=True,
)
for line in resp.iter_lines():
if line and line.startswith(b"data: "):
data = line[6:]
if data == b"[DONE]":
break
chunk = json.loads(data)
print(chunk, flush=True)Errors
All errors return JSON with an error field describing the issue.
| Status | Meaning |
|---|---|
| 400 | Bad request — invalid JSON body or missing messages array. |
| 401 | Unauthorised — missing, malformed, or inactive API key. |
| 402 | Payment required — insufficient balance. Top up at gateway.fast/pricing. |
| 403 | Forbidden — restricted models require an enterprise key. |
| 500 | Internal server error — routing or provider failure. |
| 502/503 | Provider error — upstream model returned an error. Retry after a short delay. |
Model catalog
The active model catalog is exposed through GET /v1/models. Provider names are shown as gateway.fast in customer-facing surfaces.
| Model | Slug | Tier | Context | Input /1M | Output /1M | TEE |
|---|---|---|---|---|---|---|
| DeepSeek V4 Flash | deepseek-v4-flash | T1 | 1 million | $0.154 | $0.308 | — |
| DeepSeek V4 Pro | deepseek-v4-pro | T2 | 1 million | $0.4785 | $0.957 | — |
Tier system
Models are grouped into four capability tiers. Auto routing uses tiers as a primary signal alongside live utilisation.
| Tier | Characteristics | Best for |
|---|---|---|
| T1 | Balanced — fast, cost-efficient | Summarisation, classification, high-throughput pipelines |
| T2 | Frontier agentic — multi-step, tool-capable | Agentic workflows, tool use, reasoning chains |
| T3 | Cutting edge — SWE-bench leaders | Complex coding, deep reasoning, long-context tasks |
| T4 | Enterprise — restricted premium models | Enterprise accounts only |