Thai ASR APIReading Score APITeam API keys

Public Developer Docs

Integration docs for KaleidoVid Thai speech products.

This page documents the current customer-facing Thai ASR streaming API and the Reading Score upload API exposed by this SaaS site. The examples below are built from the actual routes and backend contracts in this repository.

Thai ASR base URL

https://kaleidovid.com/api/thai-asr

Reading Score URL

https://kaleidovid.com/api/reading-score

Quick start

  1. 1. Create or copy a team API key from Dashboard → API Keys.
  2. 2. Use x-api-key for HTTP calls. Browser WebSocket clients should append ws_url to the returned api_key.
  3. 3. Use Thai ASR for live streaming transcripts, or Reading Score when you need token-level Thai reading evaluation.

Authentication

Use team-scoped API keys.

Both APIs are meant to be called with a KaleidoVid team API key. The SaaS dashboard manages key lifecycle, while your integration sends the raw key on every request.

Where to create keys

Create, revoke, restore, and inspect quotas in Dashboard → API Keys. Team owners and admins can manage keys; team members can still inspect existing key policies and usage.

Raw keys are only shown once at creation time. Store them in your own secret manager immediately.

Header and WebSocket auth

Sample
x-api-key: kvid_your_prefix_your_secret

Reading Score also accepts:
Authorization: Bearer kvid_your_prefix_your_secret

Browser WebSocket clients should append `api_key=<YOUR_API_KEY>`
to the `ws_url` returned by POST /api/thai-asr/sessions.

Thai ASR API

Streaming Thai speech-to-text over HTTP + WebSocket.

The Thai ASR product is a three-step flow: inspect defaults, create a session, then stream PCM audio to the returned WebSocket URL. The service is Thai-only and currently reports the `typhoon` engine in responses.

GET/api/thai-asr/config

Inspect defaults and limits

Returns the current backend name, decode defaults, tenant information, and request/session/audio quota limits for the calling key.

POST/api/thai-asr/sessions

Create a streaming session

Returns a fresh `session_id`, the resolved `ws_url`, the accepted config, and current SLA target hints for the low-latency benchmark surface.

WS/api/thai-asr/stream

Stream audio and receive transcripts

After the server emits `ready`, send a `start` message, then raw PCM16LE mono audio chunks. Watch `partial`, `stabilized_partial`, and `final` events.

GET /api/thai-asr/config

Sample
curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  "https://kaleidovid.com/api/thai-asr/config"

Config response example

Sample
{
  "language": "th",
  "mode": "benchmark_spike",
  "backend": "persistent_decoder",
  "engine": "typhoon",
  "defaults": {
    "sample_rate": 16000,
    "frame_ms": 20,
    "stream_chunk_ms": 80,
    "partial_interval_ms": 40,
    "min_decode_audio_ms": 240,
    "decode_window_ms": 1600,
    "vad": true
  },
  "notes": [
    "This spike measures websocket, VAD, and partial transcript latency for Thai ASR on the 8x4090 host."
  ],
  "tenant": {
    "team_id": "team_cuid",
    "team_name": "Acme Team",
    "key_prefix": "abcd1234"
  },
  "limits": {
    "requests_per_minute": 120,
    "concurrent_sessions": 3,
    "daily_audio_seconds": 3600,
    "today_audio_seconds_used": 420
  }
}

Create a session

The session creation payload is JSON. You can omit fields to use the server defaults, but most clients should send the same values they expect to use on the WebSocket start message so there is no mismatch between planning and runtime.

FieldTypeRequiredNotes
sample_rateintegerNo8,000 to 48,000. Use 16,000 for the current Thai ASR defaults.
frame_msintegerNo10 to 200. The built-in smoke test uses 20 ms.
partial_interval_msintegerNo20 to 1,000. Controls how often the service tries to emit new partials.
min_decode_audio_msintegerNo100 to 5,000. Minimum buffered audio before partial decoding starts.
decode_window_msintegerNo200 to 15,000. Rolling audio window used for decode requests.
vadbooleanNoDefaults to true. Emits `speech_start` when voice activity is detected.
benchmark_labelstringNoOptional label up to 200 characters.

POST /api/thai-asr/sessions

Sample
curl -sS \
  -X POST \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "sample_rate": 16000,
    "frame_ms": 20,
    "partial_interval_ms": 40,
    "min_decode_audio_ms": 240,
    "decode_window_ms": 1600,
    "vad": true
  }' \
  "https://kaleidovid.com/api/thai-asr/sessions"

Session response example

Sample
{
  "session_id": "e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
  "mode": "benchmark_spike",
  "language": "th",
  "engine": "typhoon",
  "backend": "persistent_decoder",
  "ws_url": "wss://kaleidovid.com/api/thai-asr/stream?session_id=e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
  "config": {
    "sample_rate": 16000,
    "frame_ms": 20,
    "partial_interval_ms": 40,
    "min_decode_audio_ms": 240,
    "decode_window_ms": 1600,
    "vad": true,
    "benchmark_label": null
  },
  "sla_targets": {
    "speech_detected_p95_ms": 60,
    "first_partial_p95_ms": 250,
    "final_after_eou_p95_ms": 800
  },
  "tenant": {
    "team_id": "team_cuid",
    "team_name": "Acme Team",
    "key_prefix": "abcd1234"
  }
}

WebSocket flow

Use the exact ws_url returned by session creation. Browser clients typically append api_key to that URL because custom WebSocket headers are harder to set in the browser. Binary audio frames are the preferred transport.

Production caption rule: render one live caption from stabilized_partial when available, otherwise fall back to partial. Only persist the transcript after receiving final.

Client messages

FieldTypeRequiredNotes
startJSON messageYesSend once after the server emits `ready`. Carries the same config fields used in session creation.
binary audio frameraw bytesYesRecommended path. Send PCM16LE mono audio bytes only, without WAV headers.
audioJSON messageNoFallback JSON shape: `{ "type": "audio", "audio_b64": "..." }` where the payload is base64-encoded PCM16LE audio.
flushJSON messageNoForces the service to emit a best-effort partial from current buffered audio.
end / end_utteranceJSON messageYesFinalizes the utterance and triggers the `final` event.
resetJSON messageNoClears buffered state so another utterance can run on the same socket.

Server events

FieldTypeRequiredNotes
readyserver eventAlwaysFirst event on a successful connection. Includes team info, backend, limits, and sample rate.
startedserver eventAlwaysConfirms the stream is configured and ready for audio frames.
speech_startserver eventOptionalEmitted when VAD first detects speech. Includes `offset_ms` and `latency_ms`.
partialserver eventOptionalBest-effort rolling transcript. Includes `text`, `segments`, `audio_ms`, `latency_ms`, `queue_ms`, and `decode_ms`.
stabilized_partialserver eventOptionalA partial that repeated enough times to look stable. Use this for live captions when available.
finalserver eventAlways on successFinal transcript with segments and latency metrics. Persist only this event in production workflows.
errorserver eventOn failureCarries `code` and `error` for issues such as timeout, decode failures, or stream failures.
reset_okserver eventOptionalAcknowledges a successful `reset` command.

JavaScript integration example

Sample
const API_KEY = "YOUR_API_KEY";
const HTTP_BASE = "https://kaleidovid.com";

const config = await fetch(`${HTTP_BASE}/api/thai-asr/config`, {
  headers: { "x-api-key": API_KEY },
}).then(async (response) => {
  if (!response.ok) throw new Error(await response.text());
  return response.json();
});

const session = await fetch(`${HTTP_BASE}/api/thai-asr/sessions`, {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "x-api-key": API_KEY,
  },
  body: JSON.stringify({
    sample_rate: 16000,
    frame_ms: 20,
    partial_interval_ms: 40,
    min_decode_audio_ms: 240,
    decode_window_ms: 1600,
    vad: true,
  }),
}).then(async (response) => {
  if (!response.ok) throw new Error(await response.text());
  return response.json();
});

const wsUrl = `${session.ws_url}&api_key=${encodeURIComponent(API_KEY)}`;
const socket = new WebSocket(wsUrl);

socket.onmessage = (event) => {
  const payload = JSON.parse(event.data);
  switch (payload.type) {
    case "ready":
      socket.send(JSON.stringify({
        type: "start",
        sample_rate: 16000,
        frame_ms: 20,
        partial_interval_ms: 40,
        min_decode_audio_ms: 240,
        decode_window_ms: 1600,
        vad: true,
      }));
      break;
    case "partial":
    case "stabilized_partial":
      console.log("live", payload.text);
      break;
    case "final":
      console.log("final", payload.text, payload.segments);
      break;
    case "error":
      console.error(payload.code, payload.error);
      break;
  }
};

// `pcmChunks` must contain raw PCM16LE mono audio frames.
// Do not send WAV headers after the socket is started.
for (const chunk of pcmChunks) {
  socket.send(chunk);
}

socket.send(JSON.stringify({ type: "flush" }));
socket.send(JSON.stringify({ type: "end" }));

`final` event example

Sample
{
  "type": "final",
  "session_id": "e6a4c4f6c69f4ca1aa8f9b3ad8d515f0",
  "engine": "typhoon",
  "backend": "persistent_decoder",
  "text": "สวัสดีครับ นี่คือการทดสอบระบบถอดเสียงภาษาไทยแบบหน่วงต่ำ",
  "segments": [
    {
      "start": 0.0,
      "end": 2.84,
      "text": "สวัสดีครับ นี่คือการทดสอบระบบถอดเสียงภาษาไทยแบบหน่วงต่ำ"
    }
  ],
  "audio_ms": 2840,
  "latency_ms": 134,
  "queue_ms": 0,
  "decode_ms": 134,
  "processing_time_ms": 133,
  "audio_duration_ms": 2840
}

Reading Score API

Upload a student recording and get Thai token-level scoring.

The Reading Score API is a canonical HTTP upload endpoint on this Next.js app. Send multipart form data with a student recording and either reference text or reference audio. The current service only supports Thai.

POST/api/reading-score

Canonical public route

Use this route for new integrations. It validates the API key, normalizes form fields, forwards the upload to the Django scoring backend, and records usage under the calling team.

POST/api/low-latency-asr/reading-score

Legacy compatibility alias

Older clients may still call this path. New integrations should prefer `/api/reading-score`.

Multipart request fields

FieldTypeRequiredNotes
student_audiofileYesStudent recording to score. This is the canonical field name.
reference_textstringOne of twoReference sentence or passage. Required when `reference_audio` is not sent.
reference_audiofileOne of twoReference recording. If provided without `reference_text`, the service first transcribes this audio.
languagestringNoDefaults to `th`. The current service only accepts Thai.
student_idstringNoOptional caller-supplied student identifier returned in `request.studentId`.
lesson_idstringNoOptional caller-supplied lesson identifier returned in `request.lessonId`.
scoring_optionsJSON stringNoSupported option today: `{ "include_pronunciation": true }`.
include_pronunciationboolean-like stringNoAlternative top-level form for pronunciation scoring. `true` is the default behavior.

Canonical field names and accepted aliases

The canonical request fields are student_audio, reference_text, reference_audio, student_id, lesson_id, and scoring_options. The backend also accepts a few compatibility aliases such as studentAudio, audio_file, referenceText, referenceAudio, studentId, and lessonId

curl upload example

Sample
curl -sS \
  -H "x-api-key: YOUR_API_KEY" \
  -F "student_audio=@student.wav" \
  -F "reference_text=สวัสดีครับ วันนี้เราจะเรียนภาษาไทย" \
  -F "student_id=student_001" \
  -F "lesson_id=lesson_01" \
  -F 'scoring_options={"include_pronunciation":true}' \
  "https://kaleidovid.com/api/reading-score"

Python upload example

Sample
# pip install requests

import json
import requests

ENDPOINT_URL = "https://kaleidovid.com/api/reading-score"
API_KEY = "YOUR_API_KEY"

with open("student.wav", "rb") as student_audio:
    response = requests.post(
        ENDPOINT_URL,
        headers={"x-api-key": API_KEY},
        data={
            "reference_text": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
            "student_id": "student_001",
            "lesson_id": "lesson_01",
            "language": "th",
            "scoring_options": json.dumps({"include_pronunciation": True}),
        },
        files={
            "student_audio": ("student.wav", student_audio, "audio/wav"),
        },
        timeout=120,
    )

response.raise_for_status()
print(response.json())

Response shape

A successful response returns the normalized request summary, reference transcript, student transcript, aggregate scores, token-by-token comparison, user-facing feedback, and optional forced-alignment metadata.

FieldTypeRequiredNotes
scores.overallnumber | nullAlwaysFinal headline score. If pronunciation data exists, it is weighted 70% text accuracy and 30% pronunciation.
scores.textAccuracynumberAlwaysPercentage of reference tokens correctly matched against the student transcript.
scores.pronunciationnumber | nullMaybeAlignment-based pronunciation score. `null` means pronunciation scoring was unavailable.
scores.levenshteinSimilaritynumberAlwaysEdit-distance similarity between reference and student token sequences.
comparison.wordResults[]arrayAlwaysToken-by-token breakdown with `status` = `correct`, `missing`, `extra`, or `substitution`.
alignment.usedbooleanAlwaysIndicates whether pronunciation alignment was strong enough to contribute to the score.

Successful response example

Sample
{
  "success": true,
  "request": {
    "language": "th",
    "studentId": "student_001",
    "lessonId": "lesson_01",
    "includePronunciation": true,
    "referenceSource": "referenceText"
  },
  "reference": {
    "text": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
    "normalizedText": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
    "tokens": ["สวัสดีครับ", "วันนี้", "เรา", "จะ", "เรียน", "ภาษาไทย"],
    "tokenCount": 6,
    "meta": {}
  },
  "student": {
    "transcript": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
    "normalizedText": "สวัสดีครับ วันนี้เราจะเรียนภาษาไทย",
    "tokens": ["สวัสดีครับ", "วันนี้", "เรา", "จะ", "เรียน", "ภาษาไทย"],
    "tokenCount": 6,
    "meta": {
      "engine": "whisper_local_fallback"
    }
  },
  "scores": {
    "overall": 96.2,
    "textAccuracy": 100.0,
    "pronunciation": 87.33,
    "levenshteinSimilarity": 100.0
  },
  "comparison": {
    "correctTokenCount": 6,
    "referenceTokenCount": 6,
    "studentTokenCount": 6,
    "missingTokenCount": 0,
    "extraTokenCount": 0,
    "substitutionCount": 0,
    "wordResults": [
      {
        "referenceIndex": 0,
        "studentIndex": 0,
        "status": "correct",
        "referenceToken": "สวัสดีครับ",
        "studentToken": "สวัสดีครับ",
        "pronunciationConfidence": 0.91
      }
    ]
  },
  "feedback": {
    "missingWords": [],
    "extraWords": [],
    "substitutions": [],
    "notes": [
      "Pronunciation score was estimated from forced alignment confidence."
    ]
  },
  "alignment": {
    "used": true,
    "averageConfidence": 0.8733,
    "words": [
      {
        "index": 0,
        "word": "สวัสดีครับ",
        "start": 0.0,
        "end": 0.41,
        "confidence": 0.91
      }
    ]
  }
}
If alignment.used is false, pronunciation scoring was unavailable or too weak to trust. In that case scores.pronunciation becomes null, and scores.overall falls back to text accuracy only.

Errors And Limits

Expect explicit auth, quota, and upstream failures.

The two APIs share the same team API key system, but their runtime error surfaces differ. Thai ASR enforces request and concurrent-session limits directly; Reading Score mainly returns validation or upstream-transcription failures.

FieldTypeRequiredNotes
401HTTPThai ASR + Reading ScoreMissing or invalid API key.
429HTTP / WS closeThai ASRRequest-per-minute limit or concurrent-session limit exceeded. WebSocket clients can see close code `4429`.
4401WS closeThai ASRWebSocket authentication failed.
503HTTPThai ASR + Reading ScorePolicy lookup or upstream ASR/transcription service was unavailable.
1011WS closeThai ASRUnexpected server-side streaming failure.

Thai ASR quota notes

  • The HTTP config and session routes can reject requests when the team request bucket is exhausted.
  • The WebSocket service can reject or close sessions when concurrent-session limits are exceeded.
  • Daily audio quota is checked during streaming as audio accumulates.

Reading Score validation notes

  • student_audio is required.
  • You must send at least one of reference_text or reference_audio.
  • language must resolve to Thai. Other values are rejected.

Legacy Routes

Prefer the canonical routes for new integrations.

Compatibility paths still exist for older callers, but new integrations should standardize on the canonical endpoints below.

CanonicalLegacy AliasNotes
https://kaleidovid.com/api/thai-asrhttps://kaleidovid.com/v2/asr/low-latency/thThai ASR streaming surface. Prefer `/api/thai-asr/*`.
https://kaleidovid.com/api/reading-scorehttps://kaleidovid.com/api/low-latency-asr/reading-scoreReading Score upload API. Prefer `/api/reading-score`.