Thai ASR APIReading Score APITeam API keysShared LLM APITTS APIVideo Flashcards APILTX23 Video APIKDCN Video API

Public Developer Docs

Integration docs for KaleidoVid speech, TTS, video flashcard, video, KDCN, and local LLM APIs.

This page documents the current customer-facing Thai ASR streaming API, Reading Score upload API, shared OpenAI-compatible LLM API, TTS API, Video Flashcards API, LTX23 image-to-video API, and KDCN video API exposed by this SaaS site. The examples below are built from the actual routes and backend contracts in this repository.

Thai ASR base URL

https://kaleidovid.com/api/thai-asr

Reading Score URL

https://kaleidovid.com/api/reading-score

https://kaleidovid.com/api/reading-score/task/{taskId}

Shared LLM base URL

https://kaleidovid.com/api/llm/v1

TTS base URL

https://kaleidovid.com/api/tts/v1

Video Flashcards base URL

https://kaleidovid.com/api/video-flashcards/v1

LTX23 Video base URL

https://kaleidovid.com/api/ltx23/v1

KDCN Video base URL

https://kaleidovid.com/api/kdcn/v1

Quick start

1. Create or copy a team API key from Dashboard → API Keys.
2. Use x-api-key for HTTP calls. Browser WebSocket clients should append ws_url to the returned api_key.
3. Use Thai ASR for live streaming transcripts, Reading Score for Thai reading evaluation, Shared LLM for OpenAI-compatible local model inference, TTS for API-key authenticated speech synthesis, Video Flashcards for language-learning clip cards, LTX23 Video for Standard AV image-to-video, and KDCN Video for KDCN model generation.

On This Page

Thai ASR endpoints

https://kaleidovid.com/api/thai-asr/config

https://kaleidovid.com/api/thai-asr/sessions

wss://kaleidovid.com/api/thai-asr/stream

Reading Score endpoints

https://kaleidovid.com/api/reading-score

https://kaleidovid.com/api/low-latency-asr/reading-score

Shared LLM endpoints

https://kaleidovid.com/api/llm/v1/models

https://kaleidovid.com/api/llm/v1/chat/completions

TTS endpoints

https://kaleidovid.com/api/tts/v1/audio/speech

https://kaleidovid.com/api/tts/v1/audio/speech/{taskId}

Video Flashcards endpoints

https://kaleidovid.com/api/video-flashcards/v1/jobs

https://kaleidovid.com/api/video-flashcards/v1/jobs/from-url

https://kaleidovid.com/api/video-flashcards/v1/jobs/{job_id}

LTX23 Video endpoints

https://kaleidovid.com/api/ltx23/v1/videos/generations

https://kaleidovid.com/api/ltx23/v1/videos/{request_id}

KDCN Video endpoints

https://kaleidovid.com/api/kdcn/v1/models

https://kaleidovid.com/api/kdcn/v1/videos/generations

https://kaleidovid.com/api/kdcn/v1/videos/{request_id}

Authentication

Use team-scoped API keys.

These APIs are meant to be called with a KaleidoVid team API key. The SaaS dashboard manages key lifecycle, while your integration sends the raw key on every request.

Where to create keys

Create, revoke, restore, and inspect quotas in Dashboard → API Keys. Team owners and admins can manage keys; team members can still inspect existing key policies and usage.

Raw keys are only shown once at creation time. Store them in your own secret manager immediately.

Header and WebSocket auth

Sample

x-api-key: kvid_your_prefix_your_secret

Reading Score, Shared LLM, TTS, Video Flashcards, LTX23 Video, and KDCN Video also accept:
Authorization: Bearer kvid_your_prefix_your_secret

Browser WebSocket clients should append `api_key=<YOUR_API_KEY>`
to the `ws_url` returned by POST /api/thai-asr/sessions.

Thai ASR API

Streaming Thai speech-to-text over HTTP + WebSocket.

The Thai ASR product is a three-step flow: inspect defaults, create a session, then stream PCM audio to the returned WebSocket URL. The service is Thai-only and currently reports the `typhoon` engine in responses.

GET/api/thai-asr/config

Inspect defaults and limits

Returns the current backend name, decode defaults, tenant information, and request/session/audio quota limits for the calling key.

POST/api/thai-asr/sessions

Create a streaming session

Returns a fresh `session_id`, the resolved `ws_url`, the accepted config, and current SLA target hints for the low-latency benchmark surface.

WS/api/thai-asr/stream

Stream audio and receive transcripts

After the server emits `ready`, send a `start` message, then raw PCM16LE mono audio chunks. Watch `partial`, `stabilized_partial`, and `final` events.

GET /api/thai-asr/config

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/thai-asr/config"

Config response example

Sample

{
  "language": "th",
  "mode": "benchmark_spike",
  "backend": "persistent_decoder",
  "engine": "typhoon",
  "defaults": {
    "sample_rate": 16000,
    "frame_ms": 20,
    "stream_chunk_ms": 80,
    "partial_interval_ms": 40,
    "min_decode_audio_ms": 240,
    "decode_window_ms": 1600,
    "vad": true
  },
  "notes": [
    "This spike measures websocket, VAD, and partial transcript latency for Thai ASR on the 8x4090 host."
  ],
  "tenant": {
    "team_id": "team_cuid",
    "team_name": "Acme Team",
    "key_prefix": "abcd1234"
  },
  "limits": {
    "requests_per_minute": 120,
    "concurrent_sessions": 3,
    "daily_audio_seconds": 3600,
    "today_audio_seconds_used": 420
  }
}

Create a session

The session creation payload is JSON. You can omit fields to use the server defaults, but most clients should send the same values they expect to use on the WebSocket start message so there is no mismatch between planning and runtime.

Field	Type	Required	Notes
`sample_rate`	integer	No	8,000 to 48,000. Use 16,000 for the current Thai ASR defaults.
`frame_ms`	integer	No	10 to 200. The built-in smoke test uses 20 ms.
`partial_interval_ms`	integer	No	20 to 1,000. Controls how often the service tries to emit new partials.
`min_decode_audio_ms`	integer	No	100 to 5,000. Minimum buffered audio before partial decoding starts.
`decode_window_ms`	integer	No	200 to 15,000. Rolling audio window used for decode requests.
`vad`	boolean	No	Defaults to true. Emits `speech_start` when voice activity is detected.
`benchmark_label`	string	No	Optional label up to 200 characters.

POST /api/thai-asr/sessions

Sample

curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "sample_rate": 16000,
    "frame_ms": 20,
    "partial_interval_ms": 40,
    "min_decode_audio_ms": 240,
    "decode_window_ms": 1600,
    "vad": true
  }' \
  "https://kaleidovid.com/api/thai-asr/sessions"

Session response example

Sample

{
  "session_id": "e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
  "mode": "benchmark_spike",
  "language": "th",
  "engine": "typhoon",
  "backend": "persistent_decoder",
  "ws_url": "wss://kaleidovid.com/api/thai-asr/stream?session_id=e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
  "config": {
    "sample_rate": 16000,
    "frame_ms": 20,
    "partial_interval_ms": 40,
    "min_decode_audio_ms": 240,
    "decode_window_ms": 1600,
    "vad": true,
    "benchmark_label": null
  },
  "sla_targets": {
    "speech_detected_p95_ms": 60,
    "first_partial_p95_ms": 250,
    "final_after_eou_p95_ms": 800
  },
  "tenant": {
    "team_id": "team_cuid",
    "team_name": "Acme Team",
    "key_prefix": "abcd1234"
  }
}

WebSocket flow

Use the exact ws_url returned by session creation. Browser clients typically append api_key to that URL because custom WebSocket headers are harder to set in the browser. Binary audio frames are the preferred transport.

Production caption rule: render one live caption from stabilized_partial when available, otherwise fall back to partial. Only persist the transcript after receiving final.

Client messages

Field	Type	Required	Notes
`start`	JSON message	Yes	Send once after the server emits `ready`. Carries the same config fields used in session creation.
`binary audio frame`	raw bytes	Yes	Recommended path. Send PCM16LE mono audio bytes only, without WAV headers.
`audio`	JSON message	No	Fallback JSON shape: `{ "type": "audio", "audio_b64": "..." }` where the payload is base64-encoded PCM16LE audio.
`flush`	JSON message	No	Forces the service to emit a best-effort partial from current buffered audio.
`end` / `end_utterance`	JSON message	Yes	Finalizes the utterance and triggers the `final` event.
`reset`	JSON message	No	Clears buffered state so another utterance can run on the same socket.

Server events

Field	Type	Required	Notes
`ready`	server event	Always	First event on a successful connection. Includes team info, backend, limits, and sample rate.
`started`	server event	Always	Confirms the stream is configured and ready for audio frames.
`speech_start`	server event	Optional	Emitted when VAD first detects speech. Includes `offset_ms` and `latency_ms`.
`partial`	server event	Optional	Best-effort rolling transcript. Includes `text`, `segments`, `audio_ms`, `latency_ms`, `queue_ms`, and `decode_ms`.
`stabilized_partial`	server event	Optional	A partial that repeated enough times to look stable. Use this for live captions when available.
`final`	server event	Always on success	Final transcript with segments and latency metrics. Persist only this event in production workflows.
`error`	server event	On failure	Carries `code` and `error` for issues such as timeout, decode failures, or stream failures.
`reset_ok`	server event	Optional	Acknowledges a successful `reset` command.

JavaScript integration example

Sample

const API_KEY = "YOUR_API_KEY";
const HTTP_BASE = "https://kaleidovid.com";

const config = await fetch(`${HTTP_BASE}/api/thai-asr/config`, {
  headers: { "x-api-key": API_KEY },
}).then(async (response) => {
  if (!response.ok) throw new Error(await response.text());
  return response.json();
});

const session = await fetch(`${HTTP_BASE}/api/thai-asr/sessions`, {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "x-api-key": API_KEY,
  },
  body: JSON.stringify({
    sample_rate: 16000,
    frame_ms: 20,
    partial_interval_ms: 40,
    min_decode_audio_ms: 240,
    decode_window_ms: 1600,
    vad: true,
  }),
}).then(async (response) => {
  if (!response.ok) throw new Error(await response.text());
  return response.json();
});

const wsUrl = `${session.ws_url}&api_key=${encodeURIComponent(API_KEY)}`;
const socket = new WebSocket(wsUrl);

socket.onmessage = (event) => {
  const payload = JSON.parse(event.data);
  switch (payload.type) {
    case "ready":
      socket.send(JSON.stringify({
        type: "start",
        sample_rate: 16000,
        frame_ms: 20,
        partial_interval_ms: 40,
        min_decode_audio_ms: 240,
        decode_window_ms: 1600,
        vad: true,
      }));
      break;
    case "partial":
    case "stabilized_partial":
      console.log("live", payload.text);
      break;
    case "final":
      console.log("final", payload.text, payload.segments);
      break;
    case "error":
      console.error(payload.code, payload.error);
      break;
  }
};

// `pcmChunks` must contain raw PCM16LE mono audio frames.
// Do not send WAV headers after the socket is started.
for (const chunk of pcmChunks) {
  socket.send(chunk);
}

socket.send(JSON.stringify({ type: "flush" }));
socket.send(JSON.stringify({ type: "end" }));

`final` event example

Sample

{
  "type": "final",
  "session_id": "e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
  "engine": "typhoon",
  "backend": "persistent_decoder",
  "text": "สวัสดีครับ นี่คือการทดสอบระบบถอดเสียงภาษาไทยแบบหน่วงต่ำ",
  "segments": [
    {
      "start": 0.0,
      "end": 2.84,
      "text": "สวัสดีครับ นี่คือการทดสอบระบบถอดเสียงภาษาไทยแบบหน่วงต่ำ"
    }
  ],
  "audio_ms": 2840,
  "latency_ms": 134,
  "queue_ms": 0,
  "decode_ms": 134,
  "processing_time_ms": 133,
  "audio_duration_ms": 2840
}

Reading Score API

Upload a student recording and get Thai best-attempt scoring with structured feedback data.

The Reading Score API is a canonical HTTP upload endpoint on this Next.js app. Send multipart form data with a student recording and either reference text or reference audio. The service selects the best-matching reading attempt from the recording and exposes structured comparison fields that callers can render however they want. Most requests return the final score JSON directly with HTTP 200. Longer jobs may return HTTP 202 with `taskId` and `statusPath`; poll that status route with the same API key until `ready=true`. The default ASR engine is `typhoon`, and callers can override it with `asr_engine=whisper` or the equivalent `scoring_options` field. The current service only supports Thai.

Migration note for existing callers

Use /api/reading-score as the canonical public route for new integrations. The older /api/low-latency-asr/reading-score path is still accepted for compatibility.
Handle two success modes: HTTP 200 returns the final reading-score payload; HTTP 202 returns a queued job with taskId and statusPath.
When you receive HTTP 202, call GET /api/reading-score/task/{taskId} with the same x-api-key until ready=true. If successful=true, read the final score from result.
Update your response parsing to read student.selectedAttempt, feedback.*, and comparison.wordResults[] instead of treating the API as transcript-only.
Displayed scores.* values are usually scaled into the `80-100` band, but VAD-confirmed no-speech detections now return `0.0` with assessment.status=no_speech. Use diagnostics.rawScores.* if you need the underlying raw metrics.
If you stay on Typhoon, optionally handle student.meta.silenceRemovedOnTyphoon, student.meta.speechRegionCount, student.meta.windowedRecoveryOnTyphoon and student.meta.windowedRecoveryWindowCount when Typhoon recovery paths activate.

POST/api/reading-score

Canonical public route

Use this route for new integrations. It validates the API key, normalizes form fields, forwards the upload into the reading-score backend stack, and records usage under the calling team.

POST/api/low-latency-asr/reading-score

Legacy compatibility alias

Older clients may still call this path. New integrations should prefer `/api/reading-score`.

GET/api/reading-score/task/{taskId}

Canonical queued-job status route

Use this only after the upload route returns HTTP 202. It validates the same API key and returns task state, readiness, success/failure, progress when available, and the final score under `result` when complete.

GET/api/low-latency-asr/reading-score/task/{taskId}

Legacy status alias

Older clients using the legacy upload path may poll this matching status path. New integrations should prefer `/api/reading-score/task/{taskId}`.

Multipart request fields

Field	Type	Required	Notes
`student_audio`	file	Yes	Student recording to score. This is the canonical field name.
`reference_text`	string	One of two	Reference sentence or passage. Required when `reference_audio` is not sent.
`reference_audio`	file	One of two	Reference recording. If provided without `reference_text`, the service first transcribes this audio.
`language`	string	No	Defaults to `th`. The current service only accepts Thai.
`asr_engine`	string	No	Optional ASR override. Supported values are `typhoon` and `whisper`. Defaults to `typhoon`. You can also send this inside `scoring_options`.
`student_id`	string	No	Optional caller-supplied student identifier returned in `request.studentId`.
`lesson_id`	string	No	Optional caller-supplied lesson identifier returned in `request.lessonId`.
`scoring_options`	JSON string	No	Supported keys today include `{ "include_pronunciation": true }` and `{ "asr_engine": "whisper" }`.
`include_pronunciation`	boolean-like string	No	Alternative top-level form for pronunciation scoring. `true` is the default behavior.

Canonical field names and accepted aliases

The canonical request fields are student_audio, reference_text, reference_audio, student_id, lesson_id, asr_engine, and scoring_options. The backend also accepts a few compatibility aliases such as studentAudio, audio_file, referenceText, referenceAudio, studentId, lessonId, and asrEngine。

curl upload example

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  -F "student_audio=@student.wav" \
  -F "reference_text=สวัสดีครับ วันนี้เราจะเรียนภาษาไทย" \
  -F "student_id=student_001" \
  -F "lesson_id=lesson_01" \
  -F "language=th" \
  -F "include_pronunciation=true" \
  "https://kaleidovid.com/api/reading-score"

Python upload example

Sample

# pip install requests

import json
import time
import requests

ENDPOINT_URL = "https://kaleidovid.com/api/reading-score"
BASE_URL = ENDPOINT_URL.removesuffix("/api/reading-score")
API_KEY = "YOUR_API_KEY"

def read_result_or_poll(response):
    if response.status_code != 202:
        response.raise_for_status()
        return response.json()

    queued = response.json()
    status_path = queued.get("statusPath")
    if not status_path:
        return queued

    while True:
        status_response = requests.get(
            f"{BASE_URL}{status_path}",
            headers={"x-api-key": API_KEY},
            timeout=30,
        )
        status_response.raise_for_status()
        status_payload = status_response.json()
        if not status_payload.get("ready"):
            time.sleep(2)
            continue
        if status_payload.get("successful"):
            return status_payload.get("result")
        raise RuntimeError(status_payload.get("error", "Reading score failed"))

with open("student.wav", "rb") as student_audio:
    response = requests.post(
        ENDPOINT_URL,
        headers={"x-api-key": API_KEY},
        data={
            "reference_text": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
            "student_id": "student_001",
            "lesson_id": "lesson_01",
            "language": "th",
            "include_pronunciation": "true",
        },
        files={
            "student_audio": ("student.wav", student_audio, "audio/wav"),
        },
        timeout=300,
    )

payload = read_result_or_poll(response)
print(json.dumps(payload, ensure_ascii=False, indent=2))

Response shape

HTTP 200 responses return the normalized request summary, including the resolved ASR engine, the full student transcript plus the best selected attempt, displayed student-facing scores, token-by-token comparison, structured feedback fields, raw diagnostics, and optional forced-alignment metadata. HTTP 202 responses mean the request is still running in the reading-score worker queue; poll the returned `statusPath` with the same auth until `ready=true`, then read `result` when `successful=true`.

Display-score formula: after bounding the raw score to `0-100`, the public `scores.*` value is usually scaled as `80 + raw * 0.2`. VAD-confirmed no-speech detections are the exception: `assessment.status` becomes `no_speech` and `scores.*` are forced to `0.0`.

Field	Type	Required	Notes
`assessment.passed`	boolean	Always	Customer-facing assessment flag. Typical reads return `true`, while VAD-confirmed no-speech detections return `false`.
`assessment.status`	string	Always	Customer-facing status string. Typical successful reads return `passed`; no-speech is only returned after the silence detector and VAD both find no usable voice activity.
`request.asrEngine`	string	Always	The resolved ASR engine that actually handled the request. Use this instead of assuming the default from your client code.
`student.selectedAttempt`	object \| null	Maybe	Describes the best-matching reading attempt selected from the full recording before scoring.
`scores.overall`	number \| null	Always	Displayed student-facing score. The service usually scales raw scores into an encouragement band of `80-100`, but VAD-confirmed no-speech detections return `0.0`. Raw unclamped values live under `diagnostics.rawScores`.
`scores.textAccuracy`	number	Always	Displayed text-accuracy score for the selected best attempt.
`scores.pronunciation`	number \| null	Maybe	Displayed pronunciation score derived from alignment confidence. `null` means pronunciation scoring was unavailable.
`scores.levenshteinSimilarity`	number	Always	Displayed edit-distance similarity for the selected attempt.
`feedback.pronunciationFocusWords[]`	array	Always	Reference words whose pronunciation confidence was weak enough that you may want to flag them in your own UI.
`feedback.missingWords[] / feedback.extraWords[]`	array	Always	Structured token lists for omitted reference words and extra spoken words.
`feedback.substitutions[]`	array	Always	Structured expected/actual pairs for substitution mismatches.
`diagnostics.rawScores`	object	Always	Underlying unclamped metrics for internal review, analytics, or debugging.
`student.meta.noSpeechDetected / student.meta.audioDuration`	boolean / number	Maybe	Optional no-speech metadata. When `noSpeechDetected` is `true`, both the silence pass and VAD found no usable student voice, `assessment.status` becomes `no_speech`, and `scores.*` are forced to `0.0`. `audioDuration` reports the analyzed clip length.
`student.meta.voiceActivityChecked / voiceActivityDetected`	boolean / boolean	Maybe	Optional VAD metadata for no-speech decisions. `voiceActivityChecked=true` means the service ran the follow-up VAD pass; `voiceActivityDetected=false` is required before the service returns `assessment.status=no_speech`.
`student.meta.voiceActivityDuration / voiceActivityDetector`	number / string	Maybe	Optional VAD detail. `voiceActivityDuration` reports the estimated voiced duration in seconds, and `voiceActivityDetector` currently reports `torchaudio_vad`.
`student.meta.silenceRemovedOnTyphoon / student.meta.speechRegionCount`	boolean / number	Maybe	Optional Typhoon-only metadata for long-gap recovery. `silenceRemovedOnTyphoon` means the service retried on a silence-removed copy, and `speechRegionCount` reports how many speech regions were detected.
`student.meta.windowedRecoveryOnTyphoon / student.meta.windowedRecoveryWindowCount`	boolean / number	Maybe	Optional Typhoon-only metadata for reference-aware boundary recovery. `windowedRecoveryOnTyphoon` means the service retried short overlapping windows after the first transcript looked like a strict boundary-truncated slice of the expected text, and `windowedRecoveryWindowCount` reports how many windows were tested.
`comparison.wordResults[]`	array	Always	Token-by-token breakdown with `status` = `correct`, `missing`, `extra`, or `substitution`.
`alignment.used`	boolean	Always	Indicates whether pronunciation alignment was strong enough to contribute to the score.

Successful response example

Sample

{
  "success": true,
  "assessment": {
    "passed": true,
    "status": "passed",
    "displayScoreMin": 80.0,
    "usedBestAttemptSelection": true
  },
  "request": {
    "language": "th",
    "asrEngine": "typhoon",
    "studentId": "student_001",
    "lessonId": "lesson_01",
    "includePronunciation": true,
    "referenceSource": "referenceText"
  },
  "reference": {
    "text": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
    "normalizedText": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
    "tokens": ["สวัสดีครับ", "วันนี้", "เรา", "จะ", "เรียน", "ภาษาไทย"],
    "tokenCount": 6,
    "meta": {}
  },
  "student": {
    "transcript": "สวัสดีครับ วันนี้ เรียน ภาษาใจ",
    "fullTranscript": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย สวัสดีครับ วันนี้ เรียน ภาษาใจ",
    "normalizedText": "สวัสดีครับ วันนี้ เรียน ภาษาใจ",
    "tokens": ["สวัสดีครับ", "วันนี้", "เรียน", "ภาษาใจ"],
    "tokenCount": 4,
    "selectedAttempt": {
      "mode": "best_token_window",
      "applied": true,
      "candidateCount": 32,
      "multipleAttemptsDetected": true,
      "start": 2.84,
      "end": 4.96,
      "durationSeconds": 2.12,
      "tokenStartIndex": 6,
      "tokenEndIndex": 9,
      "hasTiming": true
    },
    "meta": {
      "model": "scb10x/typhoon-asr-realtime",
      "device": "cuda",
      "speechRegionCount": 2,
      "silenceRemovedOnTyphoon": true
    }
  },
  "scores": {
    "overall": 89.83,
    "textAccuracy": 90.0,
    "pronunciation": 89.45,
    "levenshteinSimilarity": 90.0
  },
  "comparison": {
    "correctTokenCount": 3,
    "referenceTokenCount": 6,
    "studentTokenCount": 4,
    "missingTokenCount": 2,
    "extraTokenCount": 0,
    "substitutionCount": 1,
    "wordResults": [
      {
        "referenceIndex": 0,
        "studentIndex": 0,
        "status": "correct",
        "referenceToken": "สวัสดีครับ",
        "studentToken": "สวัสดีครับ",
        "pronunciationConfidence": 0.82
      },
      {
        "referenceIndex": 2,
        "studentIndex": null,
        "status": "missing",
        "referenceToken": "เรา",
        "studentToken": null,
        "pronunciationConfidence": 0.33
      },
      {
        "referenceIndex": 5,
        "studentIndex": 3,
        "status": "substitution",
        "referenceToken": "ภาษาไทย",
        "studentToken": "ภาษาใจ",
        "pronunciationConfidence": 0.41
      }
    ]
  },
  "feedback": {
    "pronunciationFocusWords": ["เรา", "จะ", "ภาษาไทย"],
    "missingWords": ["เรา", "จะ"],
    "extraWords": [],
    "substitutions": [
      {
        "expected": "ภาษาไทย",
        "actual": "ภาษาใจ"
      }
    ]
  },
  "diagnostics": {
    "rawScores": {
      "overall": 49.17,
      "textAccuracy": 50.0,
      "pronunciation": 47.25,
      "levenshteinSimilarity": 50.0
    }
  },
  "alignment": {
    "used": true,
    "averageConfidence": 0.4725,
    "words": [
      {
        "index": 0,
        "word": "สวัสดีครับ",
        "start": 0.0,
        "end": 0.34,
        "confidence": 0.82
      },
      {
        "index": 2,
        "word": "เรา",
        "start": 0.62,
        "end": 0.84,
        "confidence": 0.33
      },
      {
        "index": 5,
        "word": "ภาษาไทย",
        "start": 1.55,
        "end": 1.93,
        "confidence": 0.41
      }
    ]
  }
}

No-speech response example

Sample

{
  "success": true,
  "assessment": {
    "passed": false,
    "status": "no_speech",
    "displayScoreMin": 0.0,
    "usedBestAttemptSelection": false
  },
  "request": {
    "language": "th",
    "asrEngine": "typhoon",
    "studentId": null,
    "lessonId": null,
    "includePronunciation": true,
    "referenceSource": "referenceText"
  },
  "reference": {
    "text": "สวัสดีครับ",
    "normalizedText": "สวัสดีครับ",
    "tokens": ["สวัสดี", "ครับ"],
    "tokenCount": 2,
    "meta": {}
  },
  "student": {
    "transcript": "",
    "fullTranscript": "",
    "normalizedText": "",
    "tokens": [],
    "tokenCount": 0,
    "selectedAttempt": {
      "mode": "no_speech_detected",
      "applied": false,
      "candidateCount": 0,
      "multipleAttemptsDetected": false,
      "start": null,
      "end": null,
      "durationSeconds": null,
      "tokenStartIndex": null,
      "tokenEndIndex": null,
      "hasTiming": false
    },
    "meta": {
      "noSpeechDetected": true,
      "speechRegionCount": 0,
      "audioDuration": 1.0,
      "voiceActivityChecked": true,
      "voiceActivityDetected": false,
      "voiceActivityDuration": 0.0,
      "voiceActivityDetector": "torchaudio_vad"
    }
  },
  "scores": {
    "overall": 0.0,
    "textAccuracy": 0.0,
    "pronunciation": 0.0,
    "levenshteinSimilarity": 0.0
  },
  "comparison": {
    "correctTokenCount": 0,
    "referenceTokenCount": 2,
    "studentTokenCount": 0,
    "missingTokenCount": 2,
    "extraTokenCount": 0,
    "substitutionCount": 0,
    "wordResults": [
      {
        "referenceIndex": 0,
        "studentIndex": null,
        "status": "missing",
        "referenceToken": "สวัสดี",
        "studentToken": null,
        "pronunciationConfidence": null
      },
      {
        "referenceIndex": 1,
        "studentIndex": null,
        "status": "missing",
        "referenceToken": "ครับ",
        "studentToken": null,
        "pronunciationConfidence": null
      }
    ]
  },
  "feedback": {
    "pronunciationFocusWords": [],
    "missingWords": ["สวัสดี", "ครับ"],
    "extraWords": [],
    "substitutions": []
  },
  "diagnostics": {
    "rawScores": {
      "overall": 0.0,
      "textAccuracy": 0.0,
      "pronunciation": 0.0,
      "levenshteinSimilarity": 0.0
    }
  },
  "alignment": {
    "used": false,
    "averageConfidence": null,
    "words": [],
    "meta": {
      "reason": "no_speech_detected"
    }
  }
}

Queued upload response example

Sample

{
  "success": true,
  "queued": true,
  "taskId": "a1f4387a-09fd-4ac9-8cd5-4eac34f6f6cc",
  "state": "PENDING",
  "ready": false,
  "statusPath": "/api/reading-score/task/a1f4387a-09fd-4ac9-8cd5-4eac34f6f6cc"
}

Completed status response example

Sample

{
  "taskId": "a1f4387a-09fd-4ac9-8cd5-4eac34f6f6cc",
  "state": "SUCCESS",
  "ready": true,
  "successful": true,
  "result": {
    "success": true,
    "assessment": {
      "passed": true,
      "status": "passed",
      "displayScoreMin": 80.0,
      "usedBestAttemptSelection": true
    },
    "scores": {
      "overall": 89.83,
      "textAccuracy": 90.0,
      "pronunciation": 89.45,
      "levenshteinSimilarity": 90.0
    }
  }
}

If alignment.used is false, pronunciation scoring was unavailable or too weak to trust. On non-silent reads, scores.pronunciation may become null, while the customer-facing scores.* fields usually still stay in the encouragement range. If the service confirms no speech with VAD, assessment.status becomes no_speech, assessment.passed becomes false, and scores.* return 0.0. Check student.meta.voiceActivity* for the VAD decision and diagnostics.rawScores when you need the underlying unclamped metrics or the raw text-only calculation.

Shared LLM API

Use one OpenAI-compatible base URL for pooled local models.

The shared LLM API exposes a stable OpenAI-compatible surface under this SaaS domain. Today it is backed by the broker-managed Qwen3 pool, and future local models can be added behind the same `/api/llm/v1` base URL so callers do not need to re-integrate.

GET/api/llm/v1/models

List live models

Returns the models currently exposed by the shared local LLM stack. Use this when you want runtime discovery instead of hard-coding one model forever.

POST/api/llm/v1/chat/completions

Create a chat completion

Accepts OpenAI-style `model`, `messages`, `temperature`, and `max_tokens` fields and forwards the request into the pooled local model router.

GET /api/llm/v1/models

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/llm/v1/models"

Models response example

Sample

{
  "object": "list",
  "data": [
    {
      "id": "qwen3-4b",
      "object": "model",
      "owned_by": "kaleidovid-local"
    }
  ]
}

POST /api/llm/v1/chat/completions

Sample

curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "model": "qwen3-4b",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain why a shared local LLM endpoint is useful." }
    ],
    "temperature": 0.4,
    "max_tokens": 256
  }' \
  "https://kaleidovid.com/api/llm/v1/chat/completions"

Python OpenAI SDK example

Sample

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://kaleidovid.com/api/llm/v1",
)

response = client.chat.completions.create(
    model="qwen3-4b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the API in one short paragraph."},
    ],
    temperature=0.4,
    max_tokens=256,
)

print(response.choices[0].message.content)

The current production model exposed by this shared route is qwen3-4b. The important contract is the base URL, not the model family: as more local models are added, callers should discover them through /api/llm/v1/models and then choose a model id dynamically.

TTS API

Create speech audio with selectable language and model controls.

The TTS API exposes an asynchronous text-to-speech surface under `/api/tts/v1`. Submit text plus language/model options, receive a `taskId`, then poll the status endpoint until the job returns a generated WAV URL or a failure reason.

POST/api/tts/v1/custom-voices

Upload clone audio

Registers a short reference WAV or public HTTPS audio URL and returns a `custom_voices/...` path that can be reused across speech requests.

POST/api/tts/v1/audio/speech

Start speech synthesis

Creates a TTS generation task. The public route validates the API key, records usage, and forwards JSON options to the existing TTS worker stack.

GET/api/tts/v1/audio/speech/{taskId}

Check synthesis status

Returns Celery task state, readiness, progress when available, success/failure, and the generated audio under `result.url` when complete.

JSON request fields

Speech synthesis accepts JSON. Clone/reference audio can be passed as `clone_audio`, uploaded first through `/api/tts/v1/custom-voices`, or supplied as a public `https://` audio URL for automatic registration. Treat the returned `custom_voices/...` path as an opaque reusable reference and send it back exactly as returned.

Field	Type	Required	Notes
`text`	string	Yes	Text to synthesize. Maximum length is 200 characters.
`language`	string	No	Supported values are `en`, `zh`, `vi`, `th`, `ms`, `lo`, `ja`, and `ko`. Defaults to `en`; Thai, Malay, and Lao should use `omnivoice`.
`model`	string	No	Supported public models are `vieneu_v2`, `cosytts`, `qwen_viet_tts`, and `omnivoice`. If omitted, the service chooses a default from the language. Legacy `f5` requests are routed to OmniVoice.
`seed`	integer	No	Optional deterministic seed from `0` to `2147483647`. Omit or send a negative value for random generation.
`f5_sample`	string	No	Legacy optional reference voice path for `vieneu_v2`. For OmniVoice, CosyTTS, or Qwen Viet TTS, use `clone_audio` instead.
`vieneu_expression`	string	No	Only used when `model=vieneu_v2` and `language=vi`. Supported values are `natural` and `storytelling`.
`speaker`	string	No	Optional built-in speaker for `cosytts` or `qwen_viet_tts` when no clone audio is supplied. Examples: `Ryan`, `Vivian`, `yen_nhi`, `my_van`.
`instruct`	string	No	Optional voice/style instruction for `cosytts` and OmniVoice non-clone requests, such as `Warm, clear narration`. OmniVoice clone requests with `clone_audio` must omit `instruct`.
`clone_audio`	string	No	Optional clone/reference audio for `cosytts`, `qwen_viet_tts`, or `omnivoice`. Use a returned `custom_voices/...` path from `/api/tts/v1/custom-voices`, an accepted saved path (`/uploads/...`, `F5-TTS/...`, `custom_voices/...`), or a public `https://` audio URL. Legacy alias: `qwen3_clone_audio`.
`clone_text`	string	No	Optional transcript for `clone_audio`. Provide it when known; uploaded or hosted references are auto-transcribed when the ASR service is configured. For OmniVoice clone mode, provide this field when the upstream OmniVoice service has reference ASR disabled. Legacy alias: `qwen3_clone_text`.

POST /api/tts/v1/custom-voices

Sample

curl -sS \
  -X POST \
  -H "x-api-key: YOUR_API_KEY" \
  -F "audio_file=@thai-ref-female-4s.wav;type=audio/wav" \
  -F "name=thai_flashcard_female" \
  -F "model=omnivoice" \
  -F "language=th" \
  -F "transcript=reference transcript here" \
  "https://kaleidovid.com/api/tts/v1/custom-voices"

Custom voice response

Sample

{
  "success": true,
  "voice": {
    "id": "cv_123",
    "name": "thai_flashcard_female",
    "model": "omnivoice",
    "language": "th",
    "audioPath": "custom_voices/user-1/thai_flashcard_female.wav",
    "transcript": "reference transcript here"
  },
  "clone_audio": "custom_voices/user-1/thai_flashcard_female.wav",
  "clone_text": "reference transcript here",
  "ttsRequest": {
    "clone_audio": "custom_voices/user-1/thai_flashcard_female.wav",
    "clone_text": "reference transcript here"
  }
}

Sample

curl -sS \
  -X POST \
  -H "x-api-key: YOUR_API_KEY" \
  -F "audio_url=https://example.com/tts_refs/thai-ref-female-4s.wav" \
  -F "name=thai_flashcard_female" \
  -F "model=omnivoice" \
  -F "language=th" \
  "https://kaleidovid.com/api/tts/v1/custom-voices"

Cloned OmniVoice request

Sample

curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "text": "สวัสดีครับ",
    "language": "th",
    "model": "omnivoice",
    "clone_audio": "custom_voices/user-1/thai_flashcard_female.wav",
    "clone_text": "reference transcript here",
    "seed": 12345
  }' \
  "https://kaleidovid.com/api/tts/v1/audio/speech"

POST /api/tts/v1/audio/speech

Sample

curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "text": "Hello from the KaleidoVid TTS API.",
    "language": "en",
    "model": "cosytts",
    "speaker": "Ryan",
    "instruct": "Warm, clear narration",
    "seed": 12345
  }' \
  "https://kaleidovid.com/api/tts/v1/audio/speech"

Queued response example

Sample

{
  "success": true,
  "taskId": "a97bf5a8-16d4-44c7-b45b-c8a5f94c9e8f",
  "message": "TTS generation task started with COSYTTS",
  "statusPath": "/api/tts/v1/audio/speech/a97bf5a8-16d4-44c7-b45b-c8a5f94c9e8f"
}

GET /api/tts/v1/audio/speech/{taskId}

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/tts/v1/audio/speech/a97bf5a8-16d4-44c7-b45b-c8a5f94c9e8f"

Completed status response example

Sample

{
  "taskId": "a97bf5a8-16d4-44c7-b45b-c8a5f94c9e8f",
  "state": "SUCCESS",
  "ready": true,
  "successful": true,
  "result": {
    "success": true,
    "url": "/uploads/generations/tts/tts_cosytts_20260605_1234.wav",
    "metadata": {
      "text": "Hello from the KaleidoVid TTS API.",
      "language": "en",
      "model": "cosytts",
      "duration": 2.14,
      "seed": 12345
    }
  }
}

Status response fields

Field	Type	Required	Notes
`taskId`	string	Always on success	Celery task id returned by the start request. Use it to poll the status endpoint.
`statusPath`	string	Always on success	Relative polling path for the public TTS status endpoint.
`state`	string	On status responses	Celery state such as `PENDING`, `STARTED`, `SUCCESS`, or `FAILURE`.
`ready / successful`	boolean / boolean \| null	On status responses	When `ready=true` and `successful=true`, read the generated audio from `result.url`.
`result.url`	string	When successful	Relative or absolute URL of the generated WAV file.
`result.metadata`	object	When successful	Includes the resolved `text`, `language`, `model`, `duration`, optional `seed`, and model-specific metadata.
`error`	string	On failure	Failure reason returned when the job fails.

Python polling example

Sample

# pip install requests

import time
import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://kaleidovid.com/api/tts/v1"

start = requests.post(
    f"{BASE_URL}/audio/speech",
    headers={"x-api-key": API_KEY},
    json={
        "text": "Hello from the KaleidoVid TTS API.",
        "language": "en",
        "model": "cosytts",
        "speaker": "Ryan",
        "instruct": "Warm, clear narration",
    },
    timeout=180,
)
start.raise_for_status()
task_id = start.json()["taskId"]

while True:
    status = requests.get(f"{BASE_URL}/audio/speech/{task_id}", headers={"x-api-key": API_KEY}, timeout=60)
    status.raise_for_status()
    payload = status.json()
    if payload.get("ready"):
        break
    time.sleep(3)

if not payload.get("successful"):
    raise RuntimeError(payload.get("error", "TTS generation failed"))

print(payload["result"]["url"])

Failed response example

Sample

{
  "taskId": "a97bf5a8-16d4-44c7-b45b-c8a5f94c9e8f",
  "state": "FAILURE",
  "ready": true,
  "successful": false,
  "error": "TTS generation failed: upstream message"
}

Recommended model pairings: `omnivoice` for Thai, Malay, Lao, or multilingual fallback; `vieneu_v2` or `qwen_viet_tts` for Vietnamese; and `cosytts` for English/Chinese/Japanese/Korean. For cloned OmniVoice flashcard batches, reuse the same `clone_audio`, `clone_text` when available, and `seed` across every short word and sentence, and omit `instruct`. The reference audio anchors speaker identity; the seed makes sampling more repeatable, but it is not a substitute for a stable reference. The 200-character limit is per request, so split longer learning sentences on phrase or sentence boundaries and stitch WAVs client-side; chunks generated with the same clone reference and seed are expected to remain voice-consistent.

Video Flashcards API

Turn uploaded or hosted videos into language-learning flashcard clips.

The Video Flashcards API exposes an asynchronous video-to-flashcards surface under `/api/video-flashcards/v1`. Submit a video upload, YouTube URL, or public direct video URL with a supported learning direction, receive a `job_id`, then poll until the result returns transcript segments, word-level vocabulary items, and optional rendered word-card clip URLs.

POST/api/video-flashcards/v1/jobs

Upload a source video

Accepts `multipart/form-data` with `video_file` and flashcard options. The route stores the media, validates duration, charges credits, and queues the flashcard pipeline.

POST/api/video-flashcards/v1/jobs/from-url

Start from a video URL

Accepts JSON for supported YouTube URLs or public direct video file URLs. Direct video URLs are downloaded with private-network and redirect safety checks.

GET/api/video-flashcards/v1/jobs/{job_id}

Poll job status

Returns progress while processing. When complete, `result.segments[]` contains text, translations, word items, and original/rendered segment video URLs.

Multipart upload fields

Use this route when your integration already has the video file. The pipeline transcribes the source language, translates to the target language, extracts word-level vocabulary, cuts source clips, and can render word-card overlays for each segment.

Field	Type	Required	Notes
`video_file`	file	Yes	Source video upload. Supported containers include MP4, MOV, MKV, WEBM, M4V, AVI, MPEG, and MPG. Default limit is 500 MB and 15 minutes.
`learning_direction`	string	Yes	Supported values: `thai_to_chinese`, `chinese_to_thai`, `chinese_to_vietnamese`, `korean_to_chinese`, and `japanese_to_chinese`.
`source_language` / `target_language`	string	No	Optional explicit language pair. If supplied, it must match the selected `learning_direction`.
`skip_vocal_separation`	boolean-like string	No	Defaults to `false`. Use `true` for clean speech or lecture clips where separating vocals would slow processing without improving ASR.
`max_words_per_segment`	integer	No	Defaults to 15. Must be between 1 and 80. Lower values produce shorter flashcard clips.
`include_rendered_segments`	boolean-like string	No	Defaults to `true`. When enabled, completed segments include rendered clip URLs with word-card overlays.
`include_original_segments`	boolean-like string	No	Defaults to `true`. When enabled, completed segments include the raw cut segment URLs.
`idempotency_key`	string	No	Optional caller key up to 180 characters. Reusing it returns the active matching job instead of creating a duplicate.

POST /api/video-flashcards/v1/jobs

Sample

curl -sS \
  -X POST \
  -H "x-api-key: YOUR_API_KEY" \
  -F "video_file=@lesson.mp4;type=video/mp4" \
  -F "learning_direction=thai_to_chinese" \
  -F "max_words_per_segment=8" \
  -F "include_rendered_segments=true" \
  -F "include_original_segments=true" \
  -F "idempotency_key=lesson-001-th-zh" \
  "https://kaleidovid.com/api/video-flashcards/v1/jobs"

Queued response example

Sample

{
  "success": true,
  "job_id": "vf_clxq3videojob",
  "status": "queued",
  "statusPath": "/api/video-flashcards/v1/jobs/vf_clxq3videojob",
  "limits": {
    "requests_per_minute": 120,
    "concurrent_video_jobs": 3,
    "daily_video_minutes": 60,
    "today_video_minutes_used": null,
    "max_video_duration_seconds": 900,
    "max_upload_bytes": 524288000
  }
}

URL request fields

Use the URL route when the source lives on YouTube or as a public video file. For direct files, send `source_type=direct_video_url`; for YouTube, omit `source_type` or send `youtube`.

Field	Type	Required	Notes
`url`	string	Yes	A supported YouTube watch, Shorts, or youtu.be URL when `source_type=youtube`; a public direct video URL when `source_type=direct_video_url`.
`source_type`	string	No	Defaults to `youtube`. Use `direct_video_url` for a public MP4/MOV/etc. file. Private network hosts and unsafe redirects are rejected.
`learning_direction`	string	Yes	Supported values: `thai_to_chinese`, `chinese_to_thai`, `chinese_to_vietnamese`, `korean_to_chinese`, and `japanese_to_chinese`.
`source_language` / `target_language`	string	No	Optional explicit language pair. If supplied, it must match the selected `learning_direction`.
`skip_vocal_separation`	boolean	No	Defaults to `false`. Use `true` for clean speech or lecture clips where separating vocals would slow processing without improving ASR.
`max_words_per_segment`	integer	No	Defaults to 15. Must be between 1 and 80. Lower values produce shorter flashcard clips.
`include_rendered_segments`	boolean	No	Defaults to `true`. When enabled, completed segments include rendered clip URLs with word-card overlays.
`include_original_segments`	boolean	No	Defaults to `true`. When enabled, completed segments include the raw cut segment URLs.
`idempotency_key`	string	No	Optional caller key up to 180 characters. Reusing it returns the active matching job instead of creating a duplicate.

POST /api/video-flashcards/v1/jobs/from-url

Sample

curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "source_type": "youtube",
    "learning_direction": "japanese_to_chinese",
    "skip_vocal_separation": true,
    "max_words_per_segment": 10,
    "idempotency_key": "yt-lesson-001-ja-zh"
  }' \
  "https://kaleidovid.com/api/video-flashcards/v1/jobs/from-url"

Python upload and polling example

Sample

# pip install requests

import time
import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://kaleidovid.com/api/video-flashcards/v1"

with open("lesson.mp4", "rb") as video:
    start = requests.post(
        f"{BASE_URL}/jobs",
        headers={"x-api-key": API_KEY},
        data={
            "learning_direction": "thai_to_chinese",
            "max_words_per_segment": "8",
            "include_rendered_segments": "true",
            "include_original_segments": "true",
            "idempotency_key": "lesson-001-th-zh",
        },
        files={"video_file": ("lesson.mp4", video, "video/mp4")},
        timeout=300,
    )

start.raise_for_status()
job_id = start.json()["job_id"]

while True:
    status = requests.get(f"{BASE_URL}/jobs/{job_id}", headers={"x-api-key": API_KEY}, timeout=60)
    status.raise_for_status()
    payload = status.json()
    if payload["ready"]:
        break
    time.sleep(5)

if not payload["successful"]:
    raise RuntimeError(payload["error"]["message"])

for segment in payload["result"]["segments"]:
    print(segment["text"], "=>", segment["translation"])
    print(segment["video"]["rendered_segment_url"])

Status response fields

Field	Type	Required	Notes
`job_id`	string	Always	Public job id returned by the start request. It is prefixed with `vf_`.
`status`	string	Always	One of `queued`, `processing`, `done`, `failed`, or `cancelled`.
`ready / successful`	boolean / boolean \| null	On status responses	Poll until `ready=true`. Then read `result` when `successful=true`, or `error` when `successful=false`.
`progress`	object	Always	Best-effort progress with `percent`, `stage`, and `message`.
`result.media`	object	When done	Source metadata including title, duration, source URL, and uploaded original video URL when available.
`result.transcript`	object	When done	Full transcript text plus timed transcript segments in the source language.
`result.segments[]`	array	When done	Flashcard-ready clip segments with source text, translation, English gloss, timing, words, and video URLs.
`result.segments[].word_items[]`	array	When done	Per-word vocabulary data with `word`, `pronunciation`, `meaning`, `meaning_en`, and optional timing/probability.
`video.original_segment_url` / `video.rendered_segment_url`	string \| null	When requested	Original cut and rendered word-card clip URLs. `has_word_cards=true` means the rendered clip is available.
`error`	object	On failure	Failure code and message when the job fails or is cancelled.

GET /api/video-flashcards/v1/jobs/{job_id}

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/video-flashcards/v1/jobs/vf_clxq3videojob"

Processing status example

Sample

{
  "job_id": "vf_clxq3videojob",
  "status": "processing",
  "ready": false,
  "successful": null,
  "progress": {
    "percent": 42,
    "stage": "translating",
    "message": "translating"
  },
  "media": {
    "title": "lesson",
    "duration": 312.4
  }
}

Completed status response example

Sample

{
  "job_id": "vf_clxq3videojob",
  "status": "done",
  "ready": true,
  "successful": true,
  "progress": {
    "percent": 100,
    "stage": "done",
    "message": "done"
  },
  "media": {
    "title": "lesson",
    "duration": 312.4
  },
  "result": {
    "learning_direction": "thai_to_chinese",
    "source_language": "th",
    "target_language": "zh",
    "media": {
      "title": "lesson",
      "duration": 312.4,
      "source_url": null,
      "original_video_url": "/uploads/video-flashcards/imports/user-1/lesson.mp4"
    },
    "transcript": {
      "text": "สวัสดีครับ วันนี้เราจะเรียนคำศัพท์ใหม่",
      "language": "th",
      "segments": [
        {
          "index": 0,
          "start": 1.2,
          "end": 4.6,
          "text": "สวัสดีครับ วันนี้เราจะเรียนคำศัพท์ใหม่"
        }
      ]
    },
    "segments": [
      {
        "index": 0,
        "start": 1.2,
        "end": 4.6,
        "duration": 3.4,
        "text": "สวัสดีครับ วันนี้เราจะเรียนคำศัพท์ใหม่",
        "translation": "你好，今天我们会学习新词。",
        "translation_en": "Hello, today we will learn new vocabulary.",
        "word_items": [
          {
            "word": "สวัสดี",
            "pronunciation": "sa-wat-dee",
            "meaning": "你好",
            "meaning_en": "hello",
            "start": 1.2,
            "end": 1.82,
            "probability": 0.94
          }
        ],
        "video": {
          "original_segment_url": "/uploads/flashcards/original_segments/segment_000.mp4",
          "rendered_segment_url": "/uploads/flashcards/rendered_segments/segment_000.mp4",
          "content_type": "video/mp4",
          "has_word_cards": true
        }
      }
    ]
  }
}

Failed status response example

Sample

{
  "job_id": "vf_clxq3videojob",
  "status": "failed",
  "ready": true,
  "successful": false,
  "progress": {
    "percent": 64,
    "stage": "asr",
    "message": "asr"
  },
  "media": {
    "title": "lesson",
    "duration": 312.4
  },
  "error": {
    "code": "processing_failed",
    "message": "Video flashcard job failed"
  }
}

Supported learning directions are intentionally explicit so clients can rely on stable source/target behavior: `thai_to_chinese`, `chinese_to_thai`, `chinese_to_vietnamese`, `korean_to_chinese`, and `japanese_to_chinese`. Use `idempotency_key` for retries; if the same team sends the same key while a matching job is still active, the API returns that job instead of creating another one.

LTX23 Video API

Create Standard AV image-to-video jobs with the LTX23 video API.

The LTX23 Video API exposes a small asynchronous image-to-video surface under `/api/ltx23/v1`. Submit a prompt and input image, receive a `request_id`, then poll the status endpoint until the job returns a generated MP4 URL or a failure reason.

POST/api/ltx23/v1/videos/generations

Start image-to-video generation

Creates a Standard AV job. The public route validates the API key, records usage, and forwards the request to the Django task queue backed by the broker-managed LTX23 worker pool.

GET/api/ltx23/v1/videos/{request_id}

Check generation status

Returns `queued`, `processing`, `done`, `failed`, or `canceled`. When the job is done, the response includes `video.url`.

JSON request fields

The endpoint accepts JSON only. For direct file uploads from your own server or browser, convert the image to a base64 data URL and send it as image.url. Remote image URLs must be publicly reachable.

Field	Type	Required	Notes
`model`	string	No	Defaults to `ltx23-standard`. Accepted aliases include `ltx23-standard`, `ltx23-standard-fast`, `ltx23-standard-quality`, `ltx23-standard-full`, `ltx23-standard-distilled`, `ltx23`, and `ltx-2.3`.
`prompt`	string	Yes	Text instruction for how the input image should animate.
`image.url`	string	Yes	A base64 data URL, a same-site `/uploads/...` path, or a public `http(s)` image URL. JPG, PNG, and WEBP are accepted up to 25 MB. Private network hosts and redirects are rejected for remote URLs.
`duration`	integer	No	Defaults to 5. Must be between 2 and 10 seconds.
`aspect_ratio`	string	No	Defaults to `16:9`. Supported values are `16:9`, `9:16`, and `1:1`. The camelCase alias `aspectRatio` is also accepted.
`resolution`	string	No	Defaults to `720p`. Supported values are `480p` and `720p`.
`seed`	integer	No	Use `-1` or omit the field for a random seed. Non-negative seeds are clamped to the service-safe range.

POST /api/ltx23/v1/videos/generations

Sample

curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "model": "ltx23-standard",
    "prompt": "A cinematic product shot with gentle camera motion.",
    "image": {
      "url": "https://example.com/input.png"
    },
    "duration": 5,
    "aspect_ratio": "16:9",
    "resolution": "720p",
    "seed": -1
  }' \
  "https://kaleidovid.com/api/ltx23/v1/videos/generations"

Queued response example

Sample

{
  "request_id": "6bcd2f29-4b74-4479-a3b3-f71a50a77dab",
  "status": "queued",
  "model": "ltx23-standard"
}

GET /api/ltx23/v1/videos/{request_id}

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/ltx23/v1/videos/6bcd2f29-4b74-4479-a3b3-f71a50a77dab"

Done response example

Sample

{
  "request_id": "6bcd2f29-4b74-4479-a3b3-f71a50a77dab",
  "status": "done",
  "model": "ltx23-standard",
  "video": {
    "url": "/uploads/generations/ltx23/ltx23_av_20260426_abcdef.mp4"
  }
}

Status response fields

Field	Type	Required	Notes
`request_id`	string	Always	Celery task id used to poll the status endpoint.
`status`	string	Always	One of `queued`, `processing`, `done`, `failed`, or `canceled`.
`model`	string	Always	The public model alias associated with the request.
`video.url`	string	When done	Relative or absolute URL of the generated MP4. Present only when `status=done`.
`progress`	object	Maybe	Best-effort task progress metadata while the job is processing.
`error`	string	On failure	Failure reason returned when the job fails.

Python polling example

Sample

# pip install requests

import base64
import mimetypes
import time
from pathlib import Path

import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://kaleidovid.com/api/ltx23/v1"
image_path = Path("input.png")
mime_type = mimetypes.guess_type(image_path.name)[0] or "image/png"
image_data_url = (
    f"data:{mime_type};base64,"
    + base64.b64encode(image_path.read_bytes()).decode("ascii")
)

start = requests.post(
    f"{BASE_URL}/videos/generations",
    headers={"x-api-key": API_KEY},
    json={
        "model": "ltx23-standard",
        "prompt": "A cinematic product shot with gentle camera motion.",
        "image": {"url": image_data_url},
        "duration": 5,
        "aspect_ratio": "16:9",
        "resolution": "720p",
        "seed": -1,
    },
    timeout=180,
)
start.raise_for_status()
request_id = start.json()["request_id"]

while True:
    status = requests.get(f"{BASE_URL}/videos/{request_id}", headers={"x-api-key": API_KEY}, timeout=60)
    status.raise_for_status()
    payload = status.json()
    if payload["status"] in {"done", "failed", "canceled"}:
        break
    time.sleep(5)

print(payload)

Failed response example

Sample

{
  "request_id": "6bcd2f29-4b74-4479-a3b3-f71a50a77dab",
  "status": "failed",
  "model": "ltx23-standard",
  "error": "Remote LTX image-to-video failed: upstream message"
}

The public model aliases map to the same Standard AV backend. `ltx23-standard`, `ltx23-standard-fast`, and `ltx23-standard-distilled` use the distilled variant; `ltx23-standard-quality` and `ltx23-standard-full` request the full variant. Use the status endpoint instead of holding a long HTTP connection open.

KDCN Video API

Expose KDCN video models through the KDCN public API.

The KDCN Video API exposes an asynchronous KDCN model surface under `/api/kdcn/v1`. List the available model schemas, submit the selected model with prompt/media/options, receive a `request_id`, then poll until the generated MP4 URL is ready.

GET/api/kdcn/v1/models

List KDCN model schemas

Returns the current KDCN model catalog, including model ids, categories, required fields, and parameter schemas that callers can send through `options`.

POST/api/kdcn/v1/videos/generations

Start a KDCN video job

Validates the team API key, injects the tenant-scoped KDCN provider key server-side, records usage, and queues the Django/Celery KDCN worker.

GET/api/kdcn/v1/videos/{request_id}

Poll KDCN job status

Returns `queued`, `processing`, `done`, `failed`, or `canceled`. When complete, `video.url` points to the generated MP4 stored by this site.

GET /api/kdcn/v1/models

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/kdcn/v1/models"

Model catalog response example

Sample

{
  "object": "list",
  "provider": "kdcn",
  "base_url": "/api/kdcn/v1/videos",
  "data": [
    {
      "id": "kdcn:xai-grok-imagine-image-to-video-official-stable",
      "title": "Xai-grok-imagine-image-to-video-official-stable",
      "category": "image-to-video",
      "required": ["prompt", "imageUrl"],
      "parameters": [
        { "name": "prompt", "type": "string", "required": true },
        { "name": "imageUrl", "type": "string", "format": "uri", "required": true }
      ]
    }
  ]
}

JSON request fields

Use the model catalog as the source of truth for model-specific `options`. Common fields fill obvious schema slots, while `options` lets callers pass exact KDCN parameter names for advanced models.

Field	Type	Required	Notes
`model`	string	Yes	A KDCN model id from `GET /api/kdcn/v1/models`. Bare model ids are also accepted when they match a KDCN model.
`prompt`	string	Maybe	Required for models whose schema includes a required prompt field. You can also pass model-specific prompt fields inside `options`.
`inputs.images[] / inputs.videos[] / inputs.audios[]`	array	Maybe	Media inputs used to fill the model's URI parameters. Each item is `{ "url": "..." }`. URLs may be public `http(s)` URLs, same-site `/uploads/...` paths, or base64 data URLs.
`duration`	integer / string	No	Optional common duration hint. The selected model schema ultimately coerces or defaults this value.
`resolution`	string	No	Optional common resolution hint such as `720p`. Model-specific allowed values are visible in the model catalog response.
`aspect_ratio`	string	No	Optional common aspect-ratio hint. The camelCase alias `aspectRatio` is also accepted.
`options`	object	No	Model-specific fields from the catalog, for example `movementAmplitude`, `bgm`, `negativePrompt`, or exact URI parameter names. `options` values override common field mapping.
`seed`	integer	No	Use `-1` or omit the field for model default/random behavior. Non-negative seeds are clamped to the service-safe range.

POST /api/kdcn/v1/videos/generations

Sample

curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "model": "kdcn:xai-grok-imagine-image-to-video-official-stable",
    "prompt": "A calm cinematic camera push over the product.",
    "inputs": {
      "images": [
        { "url": "https://example.com/product.png" }
      ]
    },
    "duration": 6,
    "resolution": "720p",
    "aspect_ratio": "16:9",
    "options": {
      "movementAmplitude": "auto"
    }
  }' \
  "https://kaleidovid.com/api/kdcn/v1/videos/generations"

Queued response example

Sample

{
  "request_id": "a4377c13-c2a6-4974-81df-0f0ccbf701dc",
  "status": "queued",
  "model": "kdcn:xai-grok-imagine-image-to-video-official-stable",
  "provider": "kdcn"
}

Status response fields

Field	Type	Required	Notes
`request_id`	string	Always	Celery task id used to poll the status endpoint.
`status`	string	Always	One of `queued`, `processing`, `done`, `failed`, or `canceled`.
`provider`	string	Always	Always `kdcn` for this public surface.
`model`	string	Maybe	The normalized `kdcn:` model id. It is always present in the queued response and present in status once the worker has model metadata.
`video.url`	string	When done	Relative or absolute URL of the generated MP4. Present only when `status=done`.
`progress`	object	Maybe	Best-effort task progress metadata, including remote KDCN task state when available.
`error`	string	On failure	Failure reason returned when the job fails.

GET /api/kdcn/v1/videos/{request_id}

Sample

curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/kdcn/v1/videos/a4377c13-c2a6-4974-81df-0f0ccbf701dc"

Done response example

Sample

{
  "request_id": "a4377c13-c2a6-4974-81df-0f0ccbf701dc",
  "status": "done",
  "provider": "kdcn",
  "model": "kdcn:xai-grok-imagine-image-to-video-official-stable",
  "video": {
    "url": "/uploads/generations/video/runninghub_image-to-video/clip.mp4"
  }
}

Python polling example

Sample

# pip install requests

import time
import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://kaleidovid.com/api/kdcn/v1"

start = requests.post(
    f"{BASE_URL}/videos/generations",
    headers={"x-api-key": API_KEY},
    json={
        "model": "kdcn:xai-grok-imagine-image-to-video-official-stable",
        "prompt": "A calm cinematic camera push over the product.",
        "inputs": {
            "images": [{"url": "https://example.com/product.png"}],
        },
        "duration": 6,
        "resolution": "720p",
        "aspect_ratio": "16:9",
        "options": {"movementAmplitude": "auto"},
    },
    timeout=180,
)
start.raise_for_status()
request_id = start.json()["request_id"]

while True:
    status = requests.get(f"{BASE_URL}/videos/{request_id}", headers={"x-api-key": API_KEY}, timeout=60)
    status.raise_for_status()
    payload = status.json()
    if payload["status"] in {"done", "failed", "canceled"}:
        break
    time.sleep(5)

print(payload)

Failed response example

Sample

{
  "request_id": "a4377c13-c2a6-4974-81df-0f0ccbf701dc",
  "status": "failed",
  "provider": "kdcn",
  "model": "kdcn:xai-grok-imagine-image-to-video-official-stable",
  "error": "KDCN generation failed: upstream message"
}

KDCN uses the platform-owned KDCN key configured in Admin API Configuration; clients only send their KaleidoVid team API key. Remote media URLs must be publicly reachable and cannot resolve to private network hosts. For complex models, inspect `/models` and pass exact schema fields through `options`.

Errors And Limits

Expect explicit auth, quota, and upstream failures.

These APIs share the same team API key system, but their runtime error surfaces differ. Thai ASR enforces request and concurrent-session limits directly; Reading Score mainly returns validation or upstream-transcription failures; Shared LLM can surface model saturation; TTS can return model/language validation or synthesis runtime failures; Video Flashcards can return upload, credit, URL-safety, or async job failures; LTX23 and KDCN Video can return validation, configuration, upstream timeout, or async job failure responses.

Field	Type	Required	Notes
`401`	HTTP	Thai ASR + Reading Score + Shared LLM + TTS + Video Flashcards + LTX23 Video + KDCN Video	Missing or invalid API key.
`400`	HTTP	TTS + Video Flashcards + LTX23 Video + KDCN Video	Invalid JSON or multipart body, missing required text/prompt/image/video fields, unsupported model or language pairing, invalid learning direction, seed, duration, aspect ratio, resolution, model option, or media source.
`402`	HTTP	Video Flashcards	The caller does not have enough credits to start the flashcard generation job.
`413`	HTTP	Video Flashcards	The uploaded or downloaded video exceeds the configured file size or duration limit.
`429`	HTTP / WS close	Thai ASR	Request-per-minute limit or concurrent-session limit exceeded. WebSocket clients can see close code `4429`.
`4401`	WS close	Thai ASR	WebSocket authentication failed.
`503`	HTTP	Thai ASR + Reading Score + Shared LLM + KDCN Video	Policy lookup, public model configuration, or the upstream ASR / transcription / LLM runtime was unavailable.
`502`	HTTP	TTS + Video Flashcards + LTX23 Video + KDCN Video	The Next.js public route could not reach the Django/TTS/Video Flashcards/LTX23/KDCN upstream, or the upstream timed out.
`500`	HTTP	TTS + Video Flashcards	The backend accepted the request but the synthesis, flashcard, or runtime task returned an immediate server-side failure.
`429`	HTTP	Shared LLM	The shared LLM runtime or upstream worker was saturated.
`1011`	WS close	Thai ASR	Unexpected server-side streaming failure.

Thai ASR quota notes

The HTTP config and session routes can reject requests when the team request bucket is exhausted.
The WebSocket service can reject or close sessions when concurrent-session limits are exceeded.
Daily audio quota is checked during streaming as audio accumulates.

Reading Score validation notes

student_audio is required.
You must send at least one of reference_text or reference_audio.
language must resolve to Thai. Other values are rejected.

TTS validation notes

text is required and capped at 200 characters.
Vietnamese must use vieneu_v2, qwen_viet_tts, or omnivoice.
Malay and Lao must use omnivoice.

Video Flashcards validation notes

learning_direction must be one of the documented direction ids.
Upload requests require video_file; URL requests require url.
Direct video URLs must be publicly reachable and cannot resolve or redirect to private networks.

KDCN validation notes

model must resolve to a `kdcn:` model from `/api/kdcn/v1/models`.
Model-specific required fields are driven by the catalog; send exact fields through options when common fields are not enough.
Media URLs must be public, same-site `/uploads/...`, or base64 data URLs; remote private-network hosts are rejected.

Legacy Routes

Prefer the canonical routes for new integrations.

Compatibility paths still exist for older callers, but new integrations should standardize on the canonical endpoints below.

Canonical	Legacy Alias	Notes
`https://kaleidovid.com/api/thai-asr`	`https://kaleidovid.com/v2/asr/low-latency/th`	Thai ASR streaming surface. Prefer `/api/thai-asr/*`.
`https://kaleidovid.com/api/reading-score`	`https://kaleidovid.com/api/low-latency-asr/reading-score`	Reading Score upload API. Prefer `/api/reading-score`.
`https://kaleidovid.com/api/reading-score/task/{taskId}`	`https://kaleidovid.com/api/low-latency-asr/reading-score/task/{taskId}`	Reading Score queued-job status API. Use after an upload returns HTTP 202.