How to Use Text to Speech
A guide on how to generate speech using the Neuphonic text to speech engine.
Neuphonic’s primary offering is its text-to-speech technology, which serves as the foundation for various features, including Agents. Visit our Playground to experiment with different models and voices, and then continue reading below to learn how to implement this in code.
Don’t forget to visit the Quickstart guide to obtain your API Key and get your environment set up.
Speech Synthesis
You can generate speech using the API in two ways: Server Side Events (SSE) and WebSockets.
SSE is a streaming protocol where you send a single request to our API to convert text into speech. Our API will then stream the generated audio back to you in real-time, providing the lowest possible latency. Below are some examples of how to use this endpoint.
The SDK examples demonstrate how to send a message to the API, receive the audio stream from the server, and play it through your device’s speaker.
SSE is a streaming protocol where you send a single request to our API to convert text into speech. Our API will then stream the generated audio back to you in real-time, providing the lowest possible latency. Below are some examples of how to use this endpoint.
The SDK examples demonstrate how to send a message to the API, receive the audio stream from the server, and play it through your device’s speaker.
WebSockets offer a powerful way to create a two-way communication channel between your application and the server. This means you can send messages to the server and receive responses back in real-time. WebSockets are ideal for situations where you need the lowest possible latency, as they allow you to send text to the server in small, easy-to-manage pieces.
This protocol is especially useful for applications that involve continuous text streams, such
as when connecting Neuphonic TTS to a language model like GPT-4o
, which generates responses
word by word.
This setup ensures minimal latency and makes it easy to integrate with other services, as audio
is streamed back to you as soon as it is generated.
Because WebSockets work by streaming data, you need to use special <STOP>
tokens to tell the
server when a piece of text is complete and ready to be processed.
An appropriate example of when the <STOP>
symbol could be used is at the end of a stream of
messages output from a language model.
This is all demonstrated in the SDK examples below. They demonstrate how to send messages to the API and attach event handlers to play the received audio through your device’s speaker.
Return Configuration
If you call the websocket directly the returned response is a JSON object with a structure that contains audio data, sampling rate, the input text and other details. Below is a breakdown of the format:
The text returned when using the default, low latency, model (neu_fast) is not corresponding to the generated audio.
The chosen voice needs to be available for the chosen model and language. Voices
Text-to-Speech Configuration
The settings for Text-to-Speech generation can include the following parameters.
Language code for the desired language.
Examples: 'en'
, 'es'
, 'de'
, 'nl'
, 'hi'
The voice ID for the desired voice. Based on what voice_id you chose different models will be leveraged. Find all available voices here: Voices
Examples: '8e9c4bc8-3979-48ab-8626-df53befc2090'
Playback speed of the audio. Has to be in [0.7, 2.0]
Examples: 0.7
, 1.0
, 1.5
Sampling rate of the audio returned from the server.
Options: 8000
, 16000
, 22050
Encoding of the audio returned from the server.
Options: 'pcm_linear'
, 'pcm_mulaw'
More Examples
To see more examples, check out our Python SDK examples and JavaScript SDK examples on GitHub.