Skip to main content

How to Use Text to Speech

Neuphonic's primary offering is its text-to-speech technology, which serves as the foundation for various features, including Agents. Visit our Playground to experiment with different models and voices, and then continue reading below to learn how to implement this in code.

note

Don't forget to visit the Quickstart guide to obtain your API Key and get your environment set up.

Speech Synthesis

You can generate speech using the API in two ways: Server Side Events (SSE) and WebSockets.

SSE is a streaming protocol where you send a single request to our API to convert text into speech. Our API will then stream the generated audio back to you in real-time, providing the lowest possible latency. Below are some examples of how to use this endpoint.

The SDK examples demonstrate how to send a message to the API, receive the audio stream from the server, and play it through your device's speaker.

# Replace <API_KEY> with your actual API key.
# To switch languages, replace the lang_code in the path parameter (e.g., /en) with the desired language code.
curl -N --request POST \
--url https://api.neuphonic.com/sse/speak/en \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <API_KEY>' \
--header 'Accept: text/event-stream' \
--data '{
"text": "Hello, world!"
}'
warning

The chosen voice needs to be available for the chosen language.

Text-to-Speech Configuration

The settings for Text-to-Speech generation can include the following parameters.

NameTypeDescription
lang_codestringLanguage code for the desired language. Examples: 'en', 'es', 'de', 'nl', 'hi'
voice_idstringThe voice ID for the desired voice. Based on what voice_id you chose different models will be leveraged. Examples: '8e9c4bc8-3979-48ab-8626-df53befc2090'
speedfloatPlayback speed of the audio. Has to be in [0.7, 2.0] Examples: 0.7, 1.0, 1.5
sampling_rateintSampling rate of the audio returned from the server. Options: 8000, 16000, 22050
encodingstringEncoding of the audio returned from the server. Options: 'pcm_linear', 'pcm_mulaw'

Providing a context ID

You can provide a context ID to uniquely identify the audio chunks generated for a given text input. This ID is returned with the corresponding audio, allowing you to track, interrupt, or associate the audio with its original request.

client = Neuphonic()

ws = client.tts.AsyncWebsocketClient()
audio_bytes = bytearray()

async def on_message(message: APIResponse[TTSResponse]):
nonlocal audio_bytes
audio_bytes += message.data.audio
print(message.data.context_id) # here the context_id will be "1"

ws.on(WebsocketEvents.MESSAGE, on_message)

await ws.open()

await ws.send(
{"text": "This is message one example.<STOP>", "context_id": "1" }
)

More Examples

To see more examples, check out our Python SDK examples and JavaScript SDK examples on GitHub.