Datasets and formats

The exact shape Soramai expects for training data.

Soramai validates every dataset before queueing a job. This page documents accepted formats, validation rules, and how to convert common dataset layouts into a Soramai-ready format.

Text · JSONL

Text training accepts a single .jsonl file. Each line is one training example, encoded as a JSON object.

Prompt / response (simplest)

{"prompt": "Summarize the changelog.", "response": "Soramai now supports..."}
{"prompt": "Triage this support ticket.", "response": "This appears to be..."}

Chat (multi-turn)

{
  "messages": [
    {"role": "system", "content": "You are a support agent."},
    {"role": "user", "content": "My deploy is stuck on 'building'."},
    {"role": "assistant", "content": "That usually means..."}
  ]
}

Validation rules

  • UTF-8. Files in any other encoding are rejected.
  • Maximum 8,192 tokens per example by default (configurable up to 32,768).
  • At least 32 examples. Soramai warns under 200.
  • Fields outside the documented schema are ignored, not an error.

Image · ZIP

Image training accepts a single .zip file. Each image is paired (by base filename) with an optional .txt caption.

dataset.zip
├── 01.png
├── 01.txt        # caption (optional)
├── 02.png
├── 02.txt
├── 03.jpg
└── 03.txt

Validation rules

  • PNG, JPG, JPEG, WEBP. Other formats are rejected.
  • Minimum dimension 512 px. Images are bucketed to common aspect ratios at training time.
  • At least 8 images. Soramai warns under 15 for style LoRAs.
  • Captions are optional. If absent, Soramai can auto-caption with a vision model before training. Auto-captions are editable before the run starts.
  • Use a unique trigger token in captions if you want a specific token to invoke the LoRA at inference time.

Common conversions

If you already have data in a different shape, here is how to bring it over.

Alpaca / instruction-tuning

Concatenate instruction + input into prompt, and keep output as response.

ShareGPT / conversations

Map each conversation to the messages schema. Replace human with user and gpt with assistant.

CSV exports

Convert with a one-liner: csvtojson data.csv | jq -c '' > data.jsonl.

Folders of captioned images

Zip the folder directly. Soramai accepts any flat layout where each image has a matching .txt caption by base filename.