Deploy API
Promote a fine-tuned adapter to a live endpoint in one click.
Every fine-tuned adapter on Soramai can be promoted to an autoscaling inference endpoint. Call it from the dashboard playground, the CLI, or directly over HTTPS — billed per request, idle for free.
What you get
Soramai endpoints are managed: scaling, queueing, retries, observability, and key management are all built in.
Autoscaling endpoints
Each deployed adapter gets its own HTTPS endpoint that scales from zero to as many workers as you need. Idle endpoints cost nothing.
Streaming responses
Text completions stream tokens over HTTP server-sent events. The schema mirrors common OpenAI-compatible clients.
Per-request billing
You pay for the GPU time each request actually consumes — no per-minute reservations and no warm-pool minimums.
Scoped API keys
Create, rotate, and revoke keys from the dashboard. Keys are scoped to a single account and shown once at creation time.
Region pinning
Pin a deployment to a region for data-residency requirements. Multi-region active-active is available on enterprise plans.
Logs and metrics
Per-request latency, queue depth, and error rate are exposed in the dashboard. Logs are retained for 90 days.
Text completions — request
Send a prompt to a fine-tuned adapter. The deployment is identified by the API key, so the URL is the same for every deployment under your account.
curl https://soramai.com/api/v1/inference \
-H "Authorization: Bearer $SORAMAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Summarize this changelog entry.",
"system_prompt": "You are a concise technical writer.",
"temperature": 0.7,
"max_new_tokens": 256
}'Streaming responses
Add stream: true to get tokens back as Server-Sent Events. Compatible with the Vercel AI SDK, openai-node, and any standard SSE client.
curl https://soramai.com/api/v1/inference \
-H "Authorization: Bearer $SORAMAI_API_KEY" \
-H "Content-Type: application/json" \
-N \
-d '{
"prompt": "Write a haiku about graphics cards.",
"stream": true
}'
# Stream output (each block is one SSE event):
data: {"type":"chunk","text":"Fans spin at midnight—"}
data: {"type":"chunk","text":" silicon dreams of compute,"}
data: {"type":"chunk","text":" oceans of warm watts."}
data: {"type":"done","input_tokens":12,"output_tokens":22,"usage":{"coins_used":18,"coins_per_minute":200,"latency_ms":1843},"model":{"base_model":"Qwen/Qwen2.5-7B-Instruct","name":"haiku-bot"}}Frame types: chunk (incremental text the client appends), done (final usage + model info, stream ends), and error (fatal error mid-stream, stream ends with no done). The same auth, rate limit, and billing semantics apply as the non-streaming endpoint — streaming is purely a transport choice.
JavaScript / TypeScript example
A complete streaming consumer in ~20 lines. Pairs cleanly with React.useState updates for a typewriter UI.
const r = await fetch("https://soramai.com/api/v1/inference", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SORAMAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ prompt: "Hello", stream: true }),
});
const reader = r.body!.getReader();
const decoder = new TextDecoder();
let buf = "";
let output = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buf += decoder.decode(value, { stream: true });
const frames = buf.split("\n\n");
buf = frames.pop() ?? "";
for (const frame of frames) {
const data = frame.replace(/^data:\s*/, "").trim();
if (!data) continue;
const evt = JSON.parse(data);
if (evt.type === "chunk") output += evt.text;
if (evt.type === "done") console.log("usage:", evt.usage);
}
}
console.log("final:", output);Python example
Streaming consumer using the standard requests library. Use httpx if you need asyncio.
import json
import os
import requests
with requests.post(
"https://soramai.com/api/v1/inference",
headers={"Authorization": f"Bearer {os.environ['SORAMAI_API_KEY']}"},
json={"prompt": "Hello", "stream": True},
stream=True,
) as r:
output = ""
for raw in r.iter_lines(decode_unicode=True):
if not raw or not raw.startswith("data:"):
continue
evt = json.loads(raw[len("data:"):].strip())
if evt["type"] == "chunk":
output += evt["text"]
print(evt["text"], end="", flush=True)
elif evt["type"] == "done":
print(f"\n\n[usage] {evt['usage']}")
elif evt["type"] == "error":
raise RuntimeError(evt["text"])
print(output)Non-streaming response shape
When stream is omitted or false, the API returns a single JSON document with the full response. Lowest-effort integration if you don't need typewriter UX.
{
"response": "Soramai now supports per-second training billing...",
"input_tokens": 124,
"output_tokens": 87,
"usage": {
"coins_used": 31,
"coins_per_minute": 200,
"latency_ms": 1843
},
"model": {
"base_model": "Qwen/Qwen2.5-7B-Instruct",
"name": "supportbot"
}
}Image generation
Image LoRA inference is currently available through the in-app playground only. A dedicated image inference API is on the roadmap.