Skip to content

TTS / STT API

Synthesize Speech

Convert text to audio.

POST /v1/synthesize

Request Body

json
{
  "text": "Hello world, this is a test.",
  "voiceId": "default",
  "format": "wav",
  "speed": 1.0,
  "pitch": 1.0,
  "provider": "cartesia"
}
FieldTypeDefaultDescription
textstringrequiredText to synthesize (1-10,000 chars)
voiceIdstring"default"Voice identifier
formatstring"wav"Output format: mp3, wav, ogg
speednumber1.0Playback speed (0.5-2.0)
pitchnumber1.0Pitch adjustment (0.5-2.0)
providerstring"default"TTS provider. See providers.

Response 200

Binary audio stream with Content-Type set to the requested format.

TTS Providers

ProviderIDNotes
Cartesiacartesia~40ms TTFB. Recommended for production.
ElevenLabselevenlabsHigh quality, wide language support
DeepgramdeepgramFast, cost-effective
Fish Audiofish-audioOpen-source friendly
Resemble AIresemble-aiEnterprise voice cloning
InworldinworldGaming/interactive focus
ChatterboxchatterboxOpen-source, GPU-accelerated
Pocket TTSpocket-ttsFree tier fallback (higher latency)
KokorokokoroLightweight
F5 TTSf5-ttsOpen-source
CosyVoicecosyvoiceMultilingual
Qwen TTSqwen-ttsHuggingFace-hosted

Transcribe Audio

Synchronous transcription. Blocks until complete.

POST /v1/transcribe

JSON Body (URL reference)

json
{
  "audioUrl": "https://example.com/audio.mp3",
  "language": "en",
  "mode": "accurate",
  "diarize": true,
  "timestamps": "segment"
}

Multipart Form (file upload)

bash
curl -X POST https://persona-labsvoice-api-production.up.railway.app/v1/transcribe \
  -H "Authorization: Bearer $PH0NY_API_KEY" \
  -F "file=@recording.mp3" \
  -F "language=en" \
  -F "mode=accurate" \
  -F "diarize=true"
FieldTypeDefaultDescription
audioUrlstring-URL to audio file
filebinary-Audio file upload (multipart)
languagestring""Language code (auto-detect if empty)
modestring"accurate"accurate or fast
diarizebooleanfalseEnable speaker diarization
timestampsstring"segment"segment or word

Max file size: 25MB.

Response 200

json
{
  "text": "Hello, welcome to the meeting. Today we'll discuss...",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, welcome to the meeting.",
      "speaker": "Speaker 1"
    }
  ],
  "words": [
    { "start": 0.0, "end": 0.4, "word": "Hello" }
  ],
  "language": "en",
  "duration": 45.2
}

Async Transcribe

Start a transcription job and poll for results. Useful for large files.

POST /v1/transcribe/async

Same request body as /v1/transcribe.

Response 202

json
{
  "id": "job_abc123",
  "status": "processing",
  "createdAt": "2026-03-21T00:00:00.000Z"
}

Get Job

GET /v1/transcribe/:jobId

Response 200

json
{
  "id": "job_abc123",
  "status": "completed",
  "result": {
    "text": "...",
    "segments": [...],
    "language": "en",
    "duration": 45.2
  },
  "createdAt": "2026-03-21T00:00:00.000Z",
  "completedAt": "2026-03-21T00:00:01.500Z"
}

Status values: processing, completed, failed.


OpenAI-Compatible Endpoints

Drop-in replacements for OpenAI's audio API:

TTS

POST /v1/audio/speech

Same interface as OpenAI's /v1/audio/speech.

STT

POST /v1/audio/transcriptions

Same interface as OpenAI's /v1/audio/transcriptions. Max 25MB.

List Models

GET /v1/audio/models

Returns available audio models for TTS and STT.

Built by Persona Labs.