Public Developer Docs
Integration docs for KaleidoVid speech, video, and local LLM APIs.
This page documents the current customer-facing Thai ASR streaming API, Reading Score upload API, shared OpenAI-compatible LLM API, and LTX23 image-to-video API exposed by this SaaS site. The examples below are built from the actual routes and backend contracts in this repository.
Thai ASR base URL
https://kaleidovid.com/api/thai-asr
Reading Score URL
https://kaleidovid.com/api/reading-score
https://kaleidovid.com/api/reading-score/task/{taskId}
Shared LLM base URL
https://kaleidovid.com/api/llm/v1
LTX23 Video base URL
https://kaleidovid.com/api/ltx23/v1
Quick start
- 1. Create or copy a team API key from Dashboard → API Keys.
- 2. Use
x-api-keyfor HTTP calls. Browser WebSocket clients should appendws_urlto the returnedapi_key. - 3. Use Thai ASR for live streaming transcripts, Reading Score for Thai reading evaluation, Shared LLM for OpenAI-compatible local model inference, and LTX23 Video for API-key authenticated image-to-video generation.
Authentication
Use team-scoped API keys.
These APIs are meant to be called with a KaleidoVid team API key. The SaaS dashboard manages key lifecycle, while your integration sends the raw key on every request.
Where to create keys
Create, revoke, restore, and inspect quotas in Dashboard → API Keys. Team owners and admins can manage keys; team members can still inspect existing key policies and usage.
Raw keys are only shown once at creation time. Store them in your own secret manager immediately.
Header and WebSocket auth
Samplex-api-key: kvid_your_prefix_your_secret
Reading Score, Shared LLM, and LTX23 Video also accept:
Authorization: Bearer kvid_your_prefix_your_secret
Browser WebSocket clients should append `api_key=<YOUR_API_KEY>`
to the `ws_url` returned by POST /api/thai-asr/sessions.Thai ASR API
Streaming Thai speech-to-text over HTTP + WebSocket.
The Thai ASR product is a three-step flow: inspect defaults, create a session, then stream PCM audio to the returned WebSocket URL. The service is Thai-only and currently reports the `typhoon` engine in responses.
/api/thai-asr/configInspect defaults and limits
Returns the current backend name, decode defaults, tenant information, and request/session/audio quota limits for the calling key.
/api/thai-asr/sessionsCreate a streaming session
Returns a fresh `session_id`, the resolved `ws_url`, the accepted config, and current SLA target hints for the low-latency benchmark surface.
/api/thai-asr/streamStream audio and receive transcripts
After the server emits `ready`, send a `start` message, then raw PCM16LE mono audio chunks. Watch `partial`, `stabilized_partial`, and `final` events.
GET /api/thai-asr/config
Samplecurl -sS \
-H "x-api-key: YOUR_API_KEY" \
"https://kaleidovid.com/api/thai-asr/config"Config response example
Sample{
"language": "th",
"mode": "benchmark_spike",
"backend": "persistent_decoder",
"engine": "typhoon",
"defaults": {
"sample_rate": 16000,
"frame_ms": 20,
"stream_chunk_ms": 80,
"partial_interval_ms": 40,
"min_decode_audio_ms": 240,
"decode_window_ms": 1600,
"vad": true
},
"notes": [
"This spike measures websocket, VAD, and partial transcript latency for Thai ASR on the 8x4090 host."
],
"tenant": {
"team_id": "team_cuid",
"team_name": "Acme Team",
"key_prefix": "abcd1234"
},
"limits": {
"requests_per_minute": 120,
"concurrent_sessions": 3,
"daily_audio_seconds": 3600,
"today_audio_seconds_used": 420
}
}Create a session
The session creation payload is JSON. You can omit fields to use the server defaults, but most clients should send the same values they expect to use on the WebSocket start message so there is no mismatch between planning and runtime.
| Field | Type | Required | Notes |
|---|---|---|---|
sample_rate | integer | No | 8,000 to 48,000. Use 16,000 for the current Thai ASR defaults. |
frame_ms | integer | No | 10 to 200. The built-in smoke test uses 20 ms. |
partial_interval_ms | integer | No | 20 to 1,000. Controls how often the service tries to emit new partials. |
min_decode_audio_ms | integer | No | 100 to 5,000. Minimum buffered audio before partial decoding starts. |
decode_window_ms | integer | No | 200 to 15,000. Rolling audio window used for decode requests. |
vad | boolean | No | Defaults to true. Emits `speech_start` when voice activity is detected. |
benchmark_label | string | No | Optional label up to 200 characters. |
POST /api/thai-asr/sessions
Samplecurl -sS \
-X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d '{
"sample_rate": 16000,
"frame_ms": 20,
"partial_interval_ms": 40,
"min_decode_audio_ms": 240,
"decode_window_ms": 1600,
"vad": true
}' \
"https://kaleidovid.com/api/thai-asr/sessions"Session response example
Sample{
"session_id": "e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
"mode": "benchmark_spike",
"language": "th",
"engine": "typhoon",
"backend": "persistent_decoder",
"ws_url": "wss://kaleidovid.com/api/thai-asr/stream?session_id=e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
"config": {
"sample_rate": 16000,
"frame_ms": 20,
"partial_interval_ms": 40,
"min_decode_audio_ms": 240,
"decode_window_ms": 1600,
"vad": true,
"benchmark_label": null
},
"sla_targets": {
"speech_detected_p95_ms": 60,
"first_partial_p95_ms": 250,
"final_after_eou_p95_ms": 800
},
"tenant": {
"team_id": "team_cuid",
"team_name": "Acme Team",
"key_prefix": "abcd1234"
}
}WebSocket flow
Use the exact ws_url returned by session creation. Browser clients typically append api_key to that URL because custom WebSocket headers are harder to set in the browser. Binary audio frames are the preferred transport.
stabilized_partial when available, otherwise fall back to partial. Only persist the transcript after receiving final.Client messages
| Field | Type | Required | Notes |
|---|---|---|---|
start | JSON message | Yes | Send once after the server emits `ready`. Carries the same config fields used in session creation. |
binary audio frame | raw bytes | Yes | Recommended path. Send PCM16LE mono audio bytes only, without WAV headers. |
audio | JSON message | No | Fallback JSON shape: `{ "type": "audio", "audio_b64": "..." }` where the payload is base64-encoded PCM16LE audio. |
flush | JSON message | No | Forces the service to emit a best-effort partial from current buffered audio. |
end / end_utterance | JSON message | Yes | Finalizes the utterance and triggers the `final` event. |
reset | JSON message | No | Clears buffered state so another utterance can run on the same socket. |
Server events
| Field | Type | Required | Notes |
|---|---|---|---|
ready | server event | Always | First event on a successful connection. Includes team info, backend, limits, and sample rate. |
started | server event | Always | Confirms the stream is configured and ready for audio frames. |
speech_start | server event | Optional | Emitted when VAD first detects speech. Includes `offset_ms` and `latency_ms`. |
partial | server event | Optional | Best-effort rolling transcript. Includes `text`, `segments`, `audio_ms`, `latency_ms`, `queue_ms`, and `decode_ms`. |
stabilized_partial | server event | Optional | A partial that repeated enough times to look stable. Use this for live captions when available. |
final | server event | Always on success | Final transcript with segments and latency metrics. Persist only this event in production workflows. |
error | server event | On failure | Carries `code` and `error` for issues such as timeout, decode failures, or stream failures. |
reset_ok | server event | Optional | Acknowledges a successful `reset` command. |
JavaScript integration example
Sampleconst API_KEY = "YOUR_API_KEY";
const HTTP_BASE = "https://kaleidovid.com";
const config = await fetch(`${HTTP_BASE}/api/thai-asr/config`, {
headers: { "x-api-key": API_KEY },
}).then(async (response) => {
if (!response.ok) throw new Error(await response.text());
return response.json();
});
const session = await fetch(`${HTTP_BASE}/api/thai-asr/sessions`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": API_KEY,
},
body: JSON.stringify({
sample_rate: 16000,
frame_ms: 20,
partial_interval_ms: 40,
min_decode_audio_ms: 240,
decode_window_ms: 1600,
vad: true,
}),
}).then(async (response) => {
if (!response.ok) throw new Error(await response.text());
return response.json();
});
const wsUrl = `${session.ws_url}&api_key=${encodeURIComponent(API_KEY)}`;
const socket = new WebSocket(wsUrl);
socket.onmessage = (event) => {
const payload = JSON.parse(event.data);
switch (payload.type) {
case "ready":
socket.send(JSON.stringify({
type: "start",
sample_rate: 16000,
frame_ms: 20,
partial_interval_ms: 40,
min_decode_audio_ms: 240,
decode_window_ms: 1600,
vad: true,
}));
break;
case "partial":
case "stabilized_partial":
console.log("live", payload.text);
break;
case "final":
console.log("final", payload.text, payload.segments);
break;
case "error":
console.error(payload.code, payload.error);
break;
}
};
// `pcmChunks` must contain raw PCM16LE mono audio frames.
// Do not send WAV headers after the socket is started.
for (const chunk of pcmChunks) {
socket.send(chunk);
}
socket.send(JSON.stringify({ type: "flush" }));
socket.send(JSON.stringify({ type: "end" }));`final` event example
Sample{
"type": "final",
"session_id": "e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
"engine": "typhoon",
"backend": "persistent_decoder",
"text": "สวัสดีครับ นี่คือการทดสอบระบบถอดเสียงภาษาไทยแบบหน่วงต่ำ",
"segments": [
{
"start": 0.0,
"end": 2.84,
"text": "สวัสดีครับ นี่คือการทดสอบระบบถอดเสียงภาษาไทยแบบหน่วงต่ำ"
}
],
"audio_ms": 2840,
"latency_ms": 134,
"queue_ms": 0,
"decode_ms": 134,
"processing_time_ms": 133,
"audio_duration_ms": 2840
}Reading Score API
Upload a student recording and get Thai best-attempt scoring with structured feedback data.
The Reading Score API is a canonical HTTP upload endpoint on this Next.js app. Send multipart form data with a student recording and either reference text or reference audio. The service selects the best-matching reading attempt from the recording and exposes structured comparison fields that callers can render however they want. Most requests return the final score JSON directly with HTTP 200. Longer jobs may return HTTP 202 with `taskId` and `statusPath`; poll that status route with the same API key until `ready=true`. The default ASR engine is `typhoon`, and callers can override it with `asr_engine=whisper` or the equivalent `scoring_options` field. The current service only supports Thai.
Migration note for existing callers
- Use
/api/reading-scoreas the canonical public route for new integrations. The older/api/low-latency-asr/reading-scorepath is still accepted for compatibility. - Handle two success modes: HTTP 200 returns the final reading-score payload; HTTP 202 returns a queued job with
taskIdandstatusPath. - When you receive HTTP 202, call
GET /api/reading-score/task/{taskId}with the samex-api-keyuntilready=true. Ifsuccessful=true, read the final score fromresult. - Update your response parsing to read
student.selectedAttempt,feedback.*, andcomparison.wordResults[]instead of treating the API as transcript-only. - Displayed
scores.*values are usually scaled into the `80-100` band, but VAD-confirmed no-speech detections now return `0.0` withassessment.status=no_speech. Usediagnostics.rawScores.*if you need the underlying raw metrics. - If you stay on Typhoon, optionally handle
student.meta.silenceRemovedOnTyphoon,student.meta.speechRegionCount,student.meta.windowedRecoveryOnTyphoonandstudent.meta.windowedRecoveryWindowCountwhen Typhoon recovery paths activate.
/api/reading-scoreCanonical public route
Use this route for new integrations. It validates the API key, normalizes form fields, forwards the upload into the reading-score backend stack, and records usage under the calling team.
/api/low-latency-asr/reading-scoreLegacy compatibility alias
Older clients may still call this path. New integrations should prefer `/api/reading-score`.
/api/reading-score/task/{taskId}Canonical queued-job status route
Use this only after the upload route returns HTTP 202. It validates the same API key and returns task state, readiness, success/failure, progress when available, and the final score under `result` when complete.
/api/low-latency-asr/reading-score/task/{taskId}Legacy status alias
Older clients using the legacy upload path may poll this matching status path. New integrations should prefer `/api/reading-score/task/{taskId}`.
Multipart request fields
| Field | Type | Required | Notes |
|---|---|---|---|
student_audio | file | Yes | Student recording to score. This is the canonical field name. |
reference_text | string | One of two | Reference sentence or passage. Required when `reference_audio` is not sent. |
reference_audio | file | One of two | Reference recording. If provided without `reference_text`, the service first transcribes this audio. |
language | string | No | Defaults to `th`. The current service only accepts Thai. |
asr_engine | string | No | Optional ASR override. Supported values are `typhoon` and `whisper`. Defaults to `typhoon`. You can also send this inside `scoring_options`. |
student_id | string | No | Optional caller-supplied student identifier returned in `request.studentId`. |
lesson_id | string | No | Optional caller-supplied lesson identifier returned in `request.lessonId`. |
scoring_options | JSON string | No | Supported keys today include `{ "include_pronunciation": true }` and `{ "asr_engine": "whisper" }`. |
include_pronunciation | boolean-like string | No | Alternative top-level form for pronunciation scoring. `true` is the default behavior. |
Canonical field names and accepted aliases
The canonical request fields are student_audio, reference_text, reference_audio, student_id, lesson_id, asr_engine, and scoring_options. The backend also accepts a few compatibility aliases such as studentAudio, audio_file, referenceText, referenceAudio, studentId, lessonId, and asrEngine。
curl upload example
Samplecurl -sS \
-H "x-api-key: YOUR_API_KEY" \
-F "student_audio=@student.wav" \
-F "reference_text=สวัสดีครับ วันนี้เราจะเรียนภาษาไทย" \
-F "student_id=student_001" \
-F "lesson_id=lesson_01" \
-F "language=th" \
-F "include_pronunciation=true" \
"https://kaleidovid.com/api/reading-score"Python upload example
Sample# pip install requests
import json
import time
import requests
ENDPOINT_URL = "https://kaleidovid.com/api/reading-score"
BASE_URL = ENDPOINT_URL.removesuffix("/api/reading-score")
API_KEY = "YOUR_API_KEY"
def read_result_or_poll(response):
if response.status_code != 202:
response.raise_for_status()
return response.json()
queued = response.json()
status_path = queued.get("statusPath")
if not status_path:
return queued
while True:
status_response = requests.get(
f"{BASE_URL}{status_path}",
headers={"x-api-key": API_KEY},
timeout=30,
)
status_response.raise_for_status()
status_payload = status_response.json()
if not status_payload.get("ready"):
time.sleep(2)
continue
if status_payload.get("successful"):
return status_payload.get("result")
raise RuntimeError(status_payload.get("error", "Reading score failed"))
with open("student.wav", "rb") as student_audio:
response = requests.post(
ENDPOINT_URL,
headers={"x-api-key": API_KEY},
data={
"reference_text": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
"student_id": "student_001",
"lesson_id": "lesson_01",
"language": "th",
"include_pronunciation": "true",
},
files={
"student_audio": ("student.wav", student_audio, "audio/wav"),
},
timeout=300,
)
payload = read_result_or_poll(response)
print(json.dumps(payload, ensure_ascii=False, indent=2))Response shape
HTTP 200 responses return the normalized request summary, including the resolved ASR engine, the full student transcript plus the best selected attempt, displayed student-facing scores, token-by-token comparison, structured feedback fields, raw diagnostics, and optional forced-alignment metadata. HTTP 202 responses mean the request is still running in the reading-score worker queue; poll the returned `statusPath` with the same auth until `ready=true`, then read `result` when `successful=true`.
Display-score formula: after bounding the raw score to `0-100`, the public `scores.*` value is usually scaled as `80 + raw * 0.2`. VAD-confirmed no-speech detections are the exception: `assessment.status` becomes `no_speech` and `scores.*` are forced to `0.0`.
| Field | Type | Required | Notes |
|---|---|---|---|
assessment.passed | boolean | Always | Customer-facing assessment flag. Typical reads return `true`, while VAD-confirmed no-speech detections return `false`. |
assessment.status | string | Always | Customer-facing status string. Typical successful reads return `passed`; no-speech is only returned after the silence detector and VAD both find no usable voice activity. |
request.asrEngine | string | Always | The resolved ASR engine that actually handled the request. Use this instead of assuming the default from your client code. |
student.selectedAttempt | object | null | Maybe | Describes the best-matching reading attempt selected from the full recording before scoring. |
scores.overall | number | null | Always | Displayed student-facing score. The service usually scales raw scores into an encouragement band of `80-100`, but VAD-confirmed no-speech detections return `0.0`. Raw unclamped values live under `diagnostics.rawScores`. |
scores.textAccuracy | number | Always | Displayed text-accuracy score for the selected best attempt. |
scores.pronunciation | number | null | Maybe | Displayed pronunciation score derived from alignment confidence. `null` means pronunciation scoring was unavailable. |
scores.levenshteinSimilarity | number | Always | Displayed edit-distance similarity for the selected attempt. |
feedback.pronunciationFocusWords[] | array | Always | Reference words whose pronunciation confidence was weak enough that you may want to flag them in your own UI. |
feedback.missingWords[] / feedback.extraWords[] | array | Always | Structured token lists for omitted reference words and extra spoken words. |
feedback.substitutions[] | array | Always | Structured expected/actual pairs for substitution mismatches. |
diagnostics.rawScores | object | Always | Underlying unclamped metrics for internal review, analytics, or debugging. |
student.meta.noSpeechDetected / student.meta.audioDuration | boolean / number | Maybe | Optional no-speech metadata. When `noSpeechDetected` is `true`, both the silence pass and VAD found no usable student voice, `assessment.status` becomes `no_speech`, and `scores.*` are forced to `0.0`. `audioDuration` reports the analyzed clip length. |
student.meta.voiceActivityChecked / voiceActivityDetected | boolean / boolean | Maybe | Optional VAD metadata for no-speech decisions. `voiceActivityChecked=true` means the service ran the follow-up VAD pass; `voiceActivityDetected=false` is required before the service returns `assessment.status=no_speech`. |
student.meta.voiceActivityDuration / voiceActivityDetector | number / string | Maybe | Optional VAD detail. `voiceActivityDuration` reports the estimated voiced duration in seconds, and `voiceActivityDetector` currently reports `torchaudio_vad`. |
student.meta.silenceRemovedOnTyphoon / student.meta.speechRegionCount | boolean / number | Maybe | Optional Typhoon-only metadata for long-gap recovery. `silenceRemovedOnTyphoon` means the service retried on a silence-removed copy, and `speechRegionCount` reports how many speech regions were detected. |
student.meta.windowedRecoveryOnTyphoon / student.meta.windowedRecoveryWindowCount | boolean / number | Maybe | Optional Typhoon-only metadata for reference-aware boundary recovery. `windowedRecoveryOnTyphoon` means the service retried short overlapping windows after the first transcript looked like a strict boundary-truncated slice of the expected text, and `windowedRecoveryWindowCount` reports how many windows were tested. |
comparison.wordResults[] | array | Always | Token-by-token breakdown with `status` = `correct`, `missing`, `extra`, or `substitution`. |
alignment.used | boolean | Always | Indicates whether pronunciation alignment was strong enough to contribute to the score. |
Successful response example
Sample{
"success": true,
"assessment": {
"passed": true,
"status": "passed",
"displayScoreMin": 80.0,
"usedBestAttemptSelection": true
},
"request": {
"language": "th",
"asrEngine": "typhoon",
"studentId": "student_001",
"lessonId": "lesson_01",
"includePronunciation": true,
"referenceSource": "referenceText"
},
"reference": {
"text": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
"normalizedText": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
"tokens": ["สวัสดีครับ", "วันนี้", "เรา", "จะ", "เรียน", "ภาษาไทย"],
"tokenCount": 6,
"meta": {}
},
"student": {
"transcript": "สวัสดีครับ วันนี้ เรียน ภาษาใจ",
"fullTranscript": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย สวัสดีครับ วันนี้ เรียน ภาษาใจ",
"normalizedText": "สวัสดีครับ วันนี้ เรียน ภาษาใจ",
"tokens": ["สวัสดีครับ", "วันนี้", "เรียน", "ภาษาใจ"],
"tokenCount": 4,
"selectedAttempt": {
"mode": "best_token_window",
"applied": true,
"candidateCount": 32,
"multipleAttemptsDetected": true,
"start": 2.84,
"end": 4.96,
"durationSeconds": 2.12,
"tokenStartIndex": 6,
"tokenEndIndex": 9,
"hasTiming": true
},
"meta": {
"model": "scb10x/typhoon-asr-realtime",
"device": "cuda",
"speechRegionCount": 2,
"silenceRemovedOnTyphoon": true
}
},
"scores": {
"overall": 89.83,
"textAccuracy": 90.0,
"pronunciation": 89.45,
"levenshteinSimilarity": 90.0
},
"comparison": {
"correctTokenCount": 3,
"referenceTokenCount": 6,
"studentTokenCount": 4,
"missingTokenCount": 2,
"extraTokenCount": 0,
"substitutionCount": 1,
"wordResults": [
{
"referenceIndex": 0,
"studentIndex": 0,
"status": "correct",
"referenceToken": "สวัสดีครับ",
"studentToken": "สวัสดีครับ",
"pronunciationConfidence": 0.82
},
{
"referenceIndex": 2,
"studentIndex": null,
"status": "missing",
"referenceToken": "เรา",
"studentToken": null,
"pronunciationConfidence": 0.33
},
{
"referenceIndex": 5,
"studentIndex": 3,
"status": "substitution",
"referenceToken": "ภาษาไทย",
"studentToken": "ภาษาใจ",
"pronunciationConfidence": 0.41
}
]
},
"feedback": {
"pronunciationFocusWords": ["เรา", "จะ", "ภาษาไทย"],
"missingWords": ["เรา", "จะ"],
"extraWords": [],
"substitutions": [
{
"expected": "ภาษาไทย",
"actual": "ภาษาใจ"
}
]
},
"diagnostics": {
"rawScores": {
"overall": 49.17,
"textAccuracy": 50.0,
"pronunciation": 47.25,
"levenshteinSimilarity": 50.0
}
},
"alignment": {
"used": true,
"averageConfidence": 0.4725,
"words": [
{
"index": 0,
"word": "สวัสดีครับ",
"start": 0.0,
"end": 0.34,
"confidence": 0.82
},
{
"index": 2,
"word": "เรา",
"start": 0.62,
"end": 0.84,
"confidence": 0.33
},
{
"index": 5,
"word": "ภาษาไทย",
"start": 1.55,
"end": 1.93,
"confidence": 0.41
}
]
}
}No-speech response example
Sample{
"success": true,
"assessment": {
"passed": false,
"status": "no_speech",
"displayScoreMin": 0.0,
"usedBestAttemptSelection": false
},
"request": {
"language": "th",
"asrEngine": "typhoon",
"studentId": null,
"lessonId": null,
"includePronunciation": true,
"referenceSource": "referenceText"
},
"reference": {
"text": "สวัสดีครับ",
"normalizedText": "สวัสดีครับ",
"tokens": ["สวัสดี", "ครับ"],
"tokenCount": 2,
"meta": {}
},
"student": {
"transcript": "",
"fullTranscript": "",
"normalizedText": "",
"tokens": [],
"tokenCount": 0,
"selectedAttempt": {
"mode": "no_speech_detected",
"applied": false,
"candidateCount": 0,
"multipleAttemptsDetected": false,
"start": null,
"end": null,
"durationSeconds": null,
"tokenStartIndex": null,
"tokenEndIndex": null,
"hasTiming": false
},
"meta": {
"noSpeechDetected": true,
"speechRegionCount": 0,
"audioDuration": 1.0,
"voiceActivityChecked": true,
"voiceActivityDetected": false,
"voiceActivityDuration": 0.0,
"voiceActivityDetector": "torchaudio_vad"
}
},
"scores": {
"overall": 0.0,
"textAccuracy": 0.0,
"pronunciation": 0.0,
"levenshteinSimilarity": 0.0
},
"comparison": {
"correctTokenCount": 0,
"referenceTokenCount": 2,
"studentTokenCount": 0,
"missingTokenCount": 2,
"extraTokenCount": 0,
"substitutionCount": 0,
"wordResults": [
{
"referenceIndex": 0,
"studentIndex": null,
"status": "missing",
"referenceToken": "สวัสดี",
"studentToken": null,
"pronunciationConfidence": null
},
{
"referenceIndex": 1,
"studentIndex": null,
"status": "missing",
"referenceToken": "ครับ",
"studentToken": null,
"pronunciationConfidence": null
}
]
},
"feedback": {
"pronunciationFocusWords": [],
"missingWords": ["สวัสดี", "ครับ"],
"extraWords": [],
"substitutions": []
},
"diagnostics": {
"rawScores": {
"overall": 0.0,
"textAccuracy": 0.0,
"pronunciation": 0.0,
"levenshteinSimilarity": 0.0
}
},
"alignment": {
"used": false,
"averageConfidence": null,
"words": [],
"meta": {
"reason": "no_speech_detected"
}
}
}Queued upload response example
Sample{
"success": true,
"queued": true,
"taskId": "a1f4387a-09fd-4ac9-8cd5-4eac34f6f6cc",
"state": "PENDING",
"ready": false,
"statusPath": "/api/reading-score/task/a1f4387a-09fd-4ac9-8cd5-4eac34f6f6cc"
}Completed status response example
Sample{
"taskId": "a1f4387a-09fd-4ac9-8cd5-4eac34f6f6cc",
"state": "SUCCESS",
"ready": true,
"successful": true,
"result": {
"success": true,
"assessment": {
"passed": true,
"status": "passed",
"displayScoreMin": 80.0,
"usedBestAttemptSelection": true
},
"scores": {
"overall": 89.83,
"textAccuracy": 90.0,
"pronunciation": 89.45,
"levenshteinSimilarity": 90.0
}
}
}alignment.used is false, pronunciation scoring was unavailable or too weak to trust. On non-silent reads, scores.pronunciation may become null, while the customer-facing scores.* fields usually still stay in the encouragement range. If the service confirms no speech with VAD, assessment.status becomes no_speech, assessment.passed becomes false, and scores.* return 0.0. Check student.meta.voiceActivity* for the VAD decision and diagnostics.rawScores when you need the underlying unclamped metrics or the raw text-only calculation.Shared LLM API
Use one OpenAI-compatible base URL for pooled local models.
The shared LLM API exposes a stable OpenAI-compatible surface under this SaaS domain. Today it is backed by the broker-managed Qwen3 pool, and future local models can be added behind the same `/api/llm/v1` base URL so callers do not need to re-integrate.
/api/llm/v1/modelsList live models
Returns the models currently exposed by the shared local LLM stack. Use this when you want runtime discovery instead of hard-coding one model forever.
/api/llm/v1/chat/completionsCreate a chat completion
Accepts OpenAI-style `model`, `messages`, `temperature`, and `max_tokens` fields and forwards the request into the pooled local model router.
GET /api/llm/v1/models
Samplecurl -sS \
-H "x-api-key: YOUR_API_KEY" \
"https://kaleidovid.com/api/llm/v1/models"Models response example
Sample{
"object": "list",
"data": [
{
"id": "qwen3-4b",
"object": "model",
"owned_by": "kaleidovid-local"
}
]
}POST /api/llm/v1/chat/completions
Samplecurl -sS \
-X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d '{
"model": "qwen3-4b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain why a shared local LLM endpoint is useful." }
],
"temperature": 0.4,
"max_tokens": 256
}' \
"https://kaleidovid.com/api/llm/v1/chat/completions"Python OpenAI SDK example
Samplefrom openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://kaleidovid.com/api/llm/v1",
)
response = client.chat.completions.create(
model="qwen3-4b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the API in one short paragraph."},
],
temperature=0.4,
max_tokens=256,
)
print(response.choices[0].message.content)qwen3-4b. The important contract is the base URL, not the model family: as more local models are added, callers should discover them through /api/llm/v1/models and then choose a model id dynamically.LTX23 Video API
Create Standard AV image-to-video jobs with the LTX23 video API.
The LTX23 Video API exposes a small asynchronous image-to-video surface under `/api/ltx23/v1`. Submit a prompt and input image, receive a `request_id`, then poll the status endpoint until the job returns a generated MP4 URL or a failure reason.
/api/ltx23/v1/videos/generationsStart image-to-video generation
Creates a Standard AV job. The public route validates the API key, records usage, and forwards the request to the Django task queue backed by the broker-managed LTX23 worker pool.
/api/ltx23/v1/videos/{request_id}Check generation status
Returns `queued`, `processing`, `done`, `failed`, or `canceled`. When the job is done, the response includes `video.url`.
JSON request fields
The endpoint accepts JSON only. For direct file uploads from your own server or browser, convert the image to a base64 data URL and send it as image.url. Remote image URLs must be publicly reachable.
| Field | Type | Required | Notes |
|---|---|---|---|
model | string | No | Defaults to `ltx23-standard`. Accepted aliases include `ltx23-standard`, `ltx23-standard-fast`, `ltx23-standard-quality`, `ltx23-standard-full`, `ltx23-standard-distilled`, `ltx23`, and `ltx-2.3`. |
prompt | string | Yes | Text instruction for how the input image should animate. |
image.url | string | Yes | A base64 data URL, a same-site `/uploads/...` path, or a public `http(s)` image URL. JPG, PNG, and WEBP are accepted up to 25 MB. Private network hosts and redirects are rejected for remote URLs. |
duration | integer | No | Defaults to 5. Must be between 2 and 10 seconds. |
aspect_ratio | string | No | Defaults to `16:9`. Supported values are `16:9`, `9:16`, and `1:1`. The camelCase alias `aspectRatio` is also accepted. |
resolution | string | No | Defaults to `720p`. Supported values are `480p` and `720p`. |
seed | integer | No | Use `-1` or omit the field for a random seed. Non-negative seeds are clamped to the service-safe range. |
POST /api/ltx23/v1/videos/generations
Samplecurl -sS \
-X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d '{
"model": "ltx23-standard",
"prompt": "A cinematic product shot with gentle camera motion.",
"image": {
"url": "https://example.com/input.png"
},
"duration": 5,
"aspect_ratio": "16:9",
"resolution": "720p",
"seed": -1
}' \
"https://kaleidovid.com/api/ltx23/v1/videos/generations"Queued response example
Sample{
"request_id": "6bcd2f29-4b74-4479-a3b3-f71a50a77dab",
"status": "queued",
"model": "ltx23-standard"
}GET /api/ltx23/v1/videos/{request_id}
Samplecurl -sS \
-H "x-api-key: YOUR_API_KEY" \
"https://kaleidovid.com/api/ltx23/v1/videos/6bcd2f29-4b74-4479-a3b3-f71a50a77dab"Done response example
Sample{
"request_id": "6bcd2f29-4b74-4479-a3b3-f71a50a77dab",
"status": "done",
"model": "ltx23-standard",
"video": {
"url": "/uploads/generations/ltx23/ltx23_av_20260426_abcdef.mp4"
}
}Status response fields
| Field | Type | Required | Notes |
|---|---|---|---|
request_id | string | Always | Celery task id used to poll the status endpoint. |
status | string | Always | One of `queued`, `processing`, `done`, `failed`, or `canceled`. |
model | string | Always | The public model alias associated with the request. |
video.url | string | When done | Relative or absolute URL of the generated MP4. Present only when `status=done`. |
progress | object | Maybe | Best-effort task progress metadata while the job is processing. |
error | string | On failure | Failure reason returned when the job fails. |
Python polling example
Sample# pip install requests
import base64
import mimetypes
import time
from pathlib import Path
import requests
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://kaleidovid.com/api/ltx23/v1"
image_path = Path("input.png")
mime_type = mimetypes.guess_type(image_path.name)[0] or "image/png"
image_data_url = (
f"data:{mime_type};base64,"
+ base64.b64encode(image_path.read_bytes()).decode("ascii")
)
start = requests.post(
f"{BASE_URL}/videos/generations",
headers={"x-api-key": API_KEY},
json={
"model": "ltx23-standard",
"prompt": "A cinematic product shot with gentle camera motion.",
"image": {"url": image_data_url},
"duration": 5,
"aspect_ratio": "16:9",
"resolution": "720p",
"seed": -1,
},
timeout=180,
)
start.raise_for_status()
request_id = start.json()["request_id"]
while True:
status = requests.get(f"{BASE_URL}/videos/{request_id}", headers={"x-api-key": API_KEY}, timeout=60)
status.raise_for_status()
payload = status.json()
if payload["status"] in {"done", "failed", "canceled"}:
break
time.sleep(5)
print(payload)Failed response example
Sample{
"request_id": "6bcd2f29-4b74-4479-a3b3-f71a50a77dab",
"status": "failed",
"model": "ltx23-standard",
"error": "Remote LTX image-to-video failed: upstream message"
}Errors And Limits
Expect explicit auth, quota, and upstream failures.
These APIs share the same team API key system, but their runtime error surfaces differ. Thai ASR enforces request and concurrent-session limits directly; Reading Score mainly returns validation or upstream-transcription failures; Shared LLM can surface model saturation; LTX23 Video can return validation, upstream timeout, or async job failure responses.
| Field | Type | Required | Notes |
|---|---|---|---|
401 | HTTP | Thai ASR + Reading Score + Shared LLM + LTX23 Video | Missing or invalid API key. |
400 | HTTP | LTX23 Video | Invalid JSON body, missing `prompt`, missing `image.url`, unsupported model alias, invalid duration, aspect ratio, resolution, or image source. |
429 | HTTP / WS close | Thai ASR | Request-per-minute limit or concurrent-session limit exceeded. WebSocket clients can see close code `4429`. |
4401 | WS close | Thai ASR | WebSocket authentication failed. |
503 | HTTP | Thai ASR + Reading Score + Shared LLM | Policy lookup or the upstream ASR / transcription / LLM runtime was unavailable. |
502 | HTTP | LTX23 Video | The Next.js public route could not reach the Django/LTX23 upstream, or the upstream timed out. |
429 | HTTP | Shared LLM | The shared LLM runtime or upstream worker was saturated. |
1011 | WS close | Thai ASR | Unexpected server-side streaming failure. |
Thai ASR quota notes
- The HTTP config and session routes can reject requests when the team request bucket is exhausted.
- The WebSocket service can reject or close sessions when concurrent-session limits are exceeded.
- Daily audio quota is checked during streaming as audio accumulates.
Reading Score validation notes
student_audiois required.- You must send at least one of
reference_textorreference_audio. languagemust resolve to Thai. Other values are rejected.
Legacy Routes
Prefer the canonical routes for new integrations.
Compatibility paths still exist for older callers, but new integrations should standardize on the canonical endpoints below.
| Canonical | Legacy Alias | Notes |
|---|---|---|
https://kaleidovid.com/api/thai-asr | https://kaleidovid.com/v2/asr/low-latency/th | Thai ASR streaming surface. Prefer `/api/thai-asr/*`. |
https://kaleidovid.com/api/reading-score | https://kaleidovid.com/api/low-latency-asr/reading-score | Reading Score upload API. Prefer `/api/reading-score`. |
https://kaleidovid.com/api/reading-score/task/{taskId} | https://kaleidovid.com/api/low-latency-asr/reading-score/task/{taskId} | Reading Score queued-job status API. Use after an upload returns HTTP 202. |
