TTS / STT API

Synthesize Speech

Convert text to audio.

POST /v1/synthesize

Request Body

json

{
  "text": "Hello world, this is a test.",
  "voiceId": "default",
  "format": "wav",
  "speed": 1.0,
  "pitch": 1.0,
  "provider": "cartesia"
}

Field	Type	Default	Description
`text`	string	required	Text to synthesize (1-10,000 chars)
`voiceId`	string	`"default"`	Voice identifier
`format`	string	`"wav"`	Output format: `mp3`, `wav`, `ogg`
`speed`	number	`1.0`	Playback speed (0.5-2.0)
`pitch`	number	`1.0`	Pitch adjustment (0.5-2.0)
`provider`	string	`"default"`	TTS provider. See providers.

Response `200`

Binary audio stream with Content-Type set to the requested format.

TTS Providers

Provider	ID	Notes
Cartesia	`cartesia`	~40ms TTFB. Recommended for production.
ElevenLabs	`elevenlabs`	High quality, wide language support
Deepgram	`deepgram`	Fast, cost-effective
Fish Audio	`fish-audio`	Open-source friendly
Resemble AI	`resemble-ai`	Enterprise voice cloning
Inworld	`inworld`	Gaming/interactive focus
Chatterbox	`chatterbox`	Open-source, GPU-accelerated
Pocket TTS	`pocket-tts`	Free tier fallback (higher latency)
Kokoro	`kokoro`	Lightweight
F5 TTS	`f5-tts`	Open-source
CosyVoice	`cosyvoice`	Multilingual
Qwen TTS	`qwen-tts`	HuggingFace-hosted

Transcribe Audio

Synchronous transcription. Blocks until complete.

POST /v1/transcribe

JSON Body (URL reference)

json

{
  "audioUrl": "https://example.com/audio.mp3",
  "language": "en",
  "mode": "accurate",
  "diarize": true,
  "timestamps": "segment"
}

Multipart Form (file upload)

bash

curl -X POST https://api.ph0ny.com/v1/transcribe \
  -H "Authorization: Bearer $PH0NY_API_KEY" \
  -F "file=@recording.mp3" \
  -F "language=en" \
  -F "mode=accurate" \
  -F "diarize=true"

Field	Type	Default	Description
`audioUrl`	string	-	URL to audio file
`file`	binary	-	Audio file upload (multipart)
`language`	string	`""`	Language code (auto-detect if empty)
`mode`	string	`"accurate"`	`accurate` or `fast`
`diarize`	boolean	`false`	Enable speaker diarization
`timestamps`	string	`"segment"`	`segment` or `word`

Max file size: 25MB.

Response `200`

json

{
  "text": "Hello, welcome to the meeting. Today we'll discuss...",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, welcome to the meeting.",
      "speaker": "Speaker 1"
    }
  ],
  "words": [
    { "start": 0.0, "end": 0.4, "word": "Hello" }
  ],
  "language": "en",
  "duration": 45.2
}

Async Transcribe

Start a transcription job and poll for results. Useful for large files.

POST /v1/transcribe/async

Same request body as /v1/transcribe.

Response `202`

json

{
  "id": "job_abc123",
  "status": "processing",
  "createdAt": "2026-03-21T00:00:00.000Z"
}

Get Job

GET /v1/transcribe/:jobId

Response `200`

json

{
  "id": "job_abc123",
  "status": "completed",
  "result": {
    "text": "...",
    "segments": [...],
    "language": "en",
    "duration": 45.2
  },
  "createdAt": "2026-03-21T00:00:00.000Z",
  "completedAt": "2026-03-21T00:00:01.500Z"
}

Status values: processing, completed, failed.

OpenAI-Compatible Endpoints

Drop-in replacements for OpenAI's audio API:

TTS

POST /v1/audio/speech

Same interface as OpenAI's /v1/audio/speech.

STT

POST /v1/audio/transcriptions

Same interface as OpenAI's /v1/audio/transcriptions. Max 25MB.

List Models

GET /v1/audio/models

Returns available audio models for TTS and STT.

TTS / STT API ​

Synthesize Speech ​

Request Body ​

Response 200 ​

TTS Providers ​

Transcribe Audio ​

JSON Body (URL reference) ​

Multipart Form (file upload) ​

Response 200 ​

Async Transcribe ​

Response 202 ​

Get Job ​

Response 200 ​

OpenAI-Compatible Endpoints ​

TTS ​

STT ​

List Models ​

TTS / STT API

Synthesize Speech

Request Body

Response `200`

TTS Providers

Transcribe Audio

JSON Body (URL reference)

Multipart Form (file upload)

Response `200`

Async Transcribe

Response `202`

Get Job

Response `200`

OpenAI-Compatible Endpoints

TTS

STT

List Models