TTS / STT API
Synthesize Speech
Convert text to audio.
POST /v1/synthesizeRequest Body
json
{
"text": "Hello world, this is a test.",
"voiceId": "default",
"format": "wav",
"speed": 1.0,
"pitch": 1.0,
"provider": "cartesia"
}| Field | Type | Default | Description |
|---|---|---|---|
text | string | required | Text to synthesize (1-10,000 chars) |
voiceId | string | "default" | Voice identifier |
format | string | "wav" | Output format: mp3, wav, ogg |
speed | number | 1.0 | Playback speed (0.5-2.0) |
pitch | number | 1.0 | Pitch adjustment (0.5-2.0) |
provider | string | "default" | TTS provider. See providers. |
Response 200
Binary audio stream with Content-Type set to the requested format.
TTS Providers
| Provider | ID | Notes |
|---|---|---|
| Cartesia | cartesia | ~40ms TTFB. Recommended for production. |
| ElevenLabs | elevenlabs | High quality, wide language support |
| Deepgram | deepgram | Fast, cost-effective |
| Fish Audio | fish-audio | Open-source friendly |
| Resemble AI | resemble-ai | Enterprise voice cloning |
| Inworld | inworld | Gaming/interactive focus |
| Chatterbox | chatterbox | Open-source, GPU-accelerated |
| Pocket TTS | pocket-tts | Free tier fallback (higher latency) |
| Kokoro | kokoro | Lightweight |
| F5 TTS | f5-tts | Open-source |
| CosyVoice | cosyvoice | Multilingual |
| Qwen TTS | qwen-tts | HuggingFace-hosted |
Transcribe Audio
Synchronous transcription. Blocks until complete.
POST /v1/transcribeJSON Body (URL reference)
json
{
"audioUrl": "https://example.com/audio.mp3",
"language": "en",
"mode": "accurate",
"diarize": true,
"timestamps": "segment"
}Multipart Form (file upload)
bash
curl -X POST https://persona-labsvoice-api-production.up.railway.app/v1/transcribe \
-H "Authorization: Bearer $PH0NY_API_KEY" \
-F "file=@recording.mp3" \
-F "language=en" \
-F "mode=accurate" \
-F "diarize=true"| Field | Type | Default | Description |
|---|---|---|---|
audioUrl | string | - | URL to audio file |
file | binary | - | Audio file upload (multipart) |
language | string | "" | Language code (auto-detect if empty) |
mode | string | "accurate" | accurate or fast |
diarize | boolean | false | Enable speaker diarization |
timestamps | string | "segment" | segment or word |
Max file size: 25MB.
Response 200
json
{
"text": "Hello, welcome to the meeting. Today we'll discuss...",
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Hello, welcome to the meeting.",
"speaker": "Speaker 1"
}
],
"words": [
{ "start": 0.0, "end": 0.4, "word": "Hello" }
],
"language": "en",
"duration": 45.2
}Async Transcribe
Start a transcription job and poll for results. Useful for large files.
POST /v1/transcribe/asyncSame request body as /v1/transcribe.
Response 202
json
{
"id": "job_abc123",
"status": "processing",
"createdAt": "2026-03-21T00:00:00.000Z"
}Get Job
GET /v1/transcribe/:jobIdResponse 200
json
{
"id": "job_abc123",
"status": "completed",
"result": {
"text": "...",
"segments": [...],
"language": "en",
"duration": 45.2
},
"createdAt": "2026-03-21T00:00:00.000Z",
"completedAt": "2026-03-21T00:00:01.500Z"
}Status values: processing, completed, failed.
OpenAI-Compatible Endpoints
Drop-in replacements for OpenAI's audio API:
TTS
POST /v1/audio/speechSame interface as OpenAI's /v1/audio/speech.
STT
POST /v1/audio/transcriptionsSame interface as OpenAI's /v1/audio/transcriptions. Max 25MB.
List Models
GET /v1/audio/modelsReturns available audio models for TTS and STT.