Soramai · Docs
Datasets.
Every supervised fine-tune starts with a dataset. Soramai accepts JSONL for text fine-tuning and ZIP archives for image LoRA fine-tuning, and offers three different ways to produce one: bring your own, type by hand, or generate with AI.
Text dataset format (JSONL)
One JSON object per line. Each object represents a single fine-tuning example.
{"prompt": "Summarise this changelog: ...", "response": "1.4 adds streaming inference ..."}
{"prompt": "Classify ticket as bug/feature/question: ...", "response": "feature"}
{"prompt": "Translate to French: 'It is raining'", "response": "Il pleut"}Required fields
prompt— the user input the model will see at inference timeresponse— the target output you want the model to learn
Optional fields
system_prompt— per-example system messagemetadata— arbitrary object, preserved on the row but not used for fine-tuning
Limits
- · Maximum file size: 20 MB per dataset.
- · Maximum rows: 50,000 per single dataset.
- · Maximum sequence length: 2,048 tokens per row (longer rows are truncated to fit the base model context).
- · Encoding: UTF-8. No BOM. LF line endings (Windows CRLF is auto-normalised).
Image dataset format (ZIP archive)
For fine-tuning image LoRAs on FLUX or SDXL base checkpoints. ZIP archive of image + caption pairs.
my-dataset.zip ├── 001.jpg ├── 001.txt ← caption for 001.jpg ├── 002.png ├── 002.txt ├── 003.webp ├── 003.txt └── ... up to 500 image + caption pairs
Image rules
- · Supported formats: JPG, PNG, WebP
- · Recommended size: 512×512 to 2048×2048
- · Aspect ratio: any (auto-cropped/padded during fine-tuning)
- · Maximum total ZIP size: 500 MB
- · Minimum images: 5 · Maximum: 500
Caption rules
- · One
.txtper image, identical basename - · Plain UTF-8 text
- · Booru-style tags work well: 1girl, blue eyes, smiling
- · Use a unique trigger word in every caption to call the LoRA at inference time
AI-assisted dataset generation
Describe what you want and Soramai produces the JSONL for you. Useful for prototypes, synthetic data augmentation, or when you don't have examples on hand.
- 1Open Dataset Studio, click Generate with AI.
- 2Describe the task in plain English: “Customer support replies to billing questions, polite tone, 2–3 sentences”.
- 3Pick a row count: 100, 500, or 1000. Larger counts cost more but yield more stable fine-tuning.
- 4Click Generate. Rows stream into the editor live — you can stop, edit, or restart at any point.
- 5When done, click Save dataset. The full JSONL is stored in your account and immediately usable by the fine-tuning page.
Costs
- · 100 rows: ~0.5 coins (about $0.005)
- · 500 rows: ~2.5 coins
- · 1000 rows: ~5 coins
Generation runs against a high-quality teacher model. Per-row cost scales with prompt complexity — Soramai shows the live estimate in the dashboard before you commit.
Crash recovery
Generation jobs are persisted server-side. If you close the tab or your machine crashes mid-generation, opening the same dataset again shows a “Resume in-progress generation?” banner. Recovery window is 30 minutes after the job started.
Merging multiple datasets
Combine several smaller datasets into one fine-tuning run without writing a single line of code.
You may have generated five 1000-row datasets for the same task at different times. Rather than re-uploading them as one big file, select all of them in the fine-tuning page and Soramai concatenates them server-side into a single merged dataset before launching the pod.
- 1Open the fine-tuning page and click Pick from My Datasets.
- 2Multi-select up to 10 datasets. The footer shows the combined row count and total byte size.
- 3Click Use N selected. Soramai computes a SHA-256 of the dataset-id list, deduplicates against any previous merge of the same selection, and serves a single signed URL to the pod.
- 4The fine-tuning run proceeds normally — the worker sees one file, with all rows shuffled by the data loader.
Merge limits
- · Maximum datasets per merge: 10
- · Maximum merged file size: 200 MB
- · Merged files are temporary (7-day TTL); recreated on next fine-tuning launch if needed
Validation and errors
Soramai validates every dataset before queuing the pod. If validation fails, the run is rejected with a clear error and you are not charged.
- Empty rows (missing prompt or response) are rejected with the exact line number that failed.
- Invalid JSON on any line aborts the upload — the error names the offending line and the parser’s diagnosis.
- Oversize rows (after tokenisation) are silently truncated to fit the context window. The fine-tuning log records the count of affected rows.
- Duplicate detection is not enforced — duplicate rows are accepted and weighted equally during fine-tuning.
Best practices
Things that materially affect fine-tuning quality.
- Smaller, cleaner datasets beat larger noisy ones. 500 carefully-written examples typically outperform 5,000 scraped ones for any single task.
- Diversity within the task. If you want a customer support bot, include examples covering billing, technical support, complaints, compliments, edge cases. Don’t fine-tune on only one ticket category.
- Match inference-time format. Your fine-tuning prompts should look like what real users will type. If users send single questions, fine-tune on single questions — not multi-turn conversations.
- Consistent style in responses. Tone, length, and structure of the response field set the model’s output distribution. Variance here is what the model will replicate.
- Start small. 100-step run with 200 rows costs pennies and tells you if your dataset is in the right shape. Scale up once that’s clean.