How to Use Text to Speech
Neuphonic's primary offering is its text-to-speech technology, which serves as the foundation for various features, including Agents. Visit our Playground to experiment with different models and voices, and then continue reading below to learn how to implement this in code.
Don't forget to visit the Quickstart guide to obtain your API Key and get your environment set up.
Speech Synthesis
You can generate speech using the API in two ways: Server Side Events (SSE) and WebSockets.
- Server Side Events (SSE)
- WebSockets
- Longform Inference
SSE is a streaming protocol where you send a single request to our API to convert text into speech. Our API will then stream the generated audio back to you in real-time, providing the lowest possible latency. Below are some examples of how to use this endpoint.
The SDK examples demonstrate how to send a message to the API, receive the audio stream from the server, and play it through your device's speaker.
- cURL
- Python SDK
- JavaScript SDK
# Replace <API_KEY> with your actual API key.
# To switch languages, replace the lang_code in the path parameter (e.g., /en) with the desired language code.
curl -N --request POST \
--url https://api.neuphonic.com/sse/speak/en \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <API_KEY>' \
--header 'Accept: text/event-stream' \
--data '{
"text": "Hello, world!"
}'
import os
from pyneuphonic import Neuphonic, TTSConfig
from pyneuphonic.player import AudioPlayer
# Ensure the API key is set in your environment
client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))
sse = client.tts.SSEClient()
# TTSConfig is a pydantic model so check out the source code for all valid options
tts_config = TTSConfig(
lang_code='en', # replace the lang_code with the desired language code.
sampling_rate=22050,
)
# Create an audio player with `pyaudio`
# Make sure you use the same sampling rate as in the TTSConfig
with AudioPlayer(sampling_rate=22050) as player:
response = sse.send('Hello, world!', tts_config=tts_config)
player.play(response)
import fs from 'fs';
import { createClient, toWav } from '@neuphonic/neuphonic-js';
const client = createClient({ apiKey: '<API_KEY>'});
const msg = `Hello how are you?<STOP>`;
const sse = await client.tts.sse({
speed: 1.15,
lang_code: 'en'
});
const res = await sse.send(msg);
// Saving data to file
const wav = toWav(res.audio);
fs.writeFileSync(__dirname + '/sse.wav', wav);
WebSockets offer a powerful way to create a two-way communication channel between your application and the server. This means you can send messages to the server and receive responses back in real-time. WebSockets are ideal for situations where you need the lowest possible latency, as they allow you to send text to the server in small, easy-to-manage pieces.
This protocol is especially useful for applications that involve continuous text streams, such
as when connecting Neuphonic TTS to a language model like GPT-4o
, which generates responses
word by word.
This setup ensures minimal latency and makes it easy to integrate with other services, as audio
is streamed back to you as soon as it is generated.
Because WebSockets work by streaming data, you need to use special <STOP>
tokens to tell the
server when a piece of text is complete and ready to be processed.
An appropriate example of when the <STOP>
symbol could be used is at the end of a stream of
messages output from a language model.
This is all demonstrated in the SDK examples below. They demonstrate how to send messages to the API and attach event handlers to play the received audio through your device's speaker.
- wscat
- Python SDK
- JavaScript SDK
# First do `npm install -g wscat`
wscat -c "wss://api.neuphonic.com/speak/en?api_key=<API_KEY>&speed=1.0&lang_code=en&sampling_rate=22050"
# And then type "Hello, World! <STOP>" and hit "Enter"
import os
import asyncio
from pyneuphonic import Neuphonic, TTSConfig, WebsocketEvents
from pyneuphonic.models import APIResponse, TTSResponse
from pyneuphonic.player import AsyncAudioPlayer
async def main():
client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))
# Create a websocket object from the client
ws = client.tts.AsyncWebsocketClient()
# Set the desired configuration, see the TTSConfig model for more configuration options
tts_config = TTSConfig(lang_code='es',voice_id='<VOICE_ID>', sampling_rate=22050, speed=1.0, encoding="pcm_linear")
# Create and start the audio player, which will output audio to your device's speaker
player = AsyncAudioPlayer()
await player.open()
# Attach event handlers. Check WebsocketEvents enum for all valid events.
async def on_message(message: APIResponse[TTSResponse]):
"""Play audio through the speaker as it is received."""
await player.play(message.data.audio)
async def on_close():
"""Close the audio player when the websocket closes."""
await player.close()
ws.on(WebsocketEvents.MESSAGE, on_message)
ws.on(WebsocketEvents.CLOSE, on_close)
await ws.open(tts_config=tts_config) # create the connection with the server
# A special symbol ' <STOP>' must be sent to the server, otherwise the server will wait for
# more text to be sent before generating the last few snippets of audio
await ws.send('¡Hola, cómo estás!', autocomplete=True)
await ws.send('¡Hola, cómo estás! <STOP>') # Both the above line, and this line, are equivalent
await asyncio.sleep(15) # let the audio play
player.save_audio('output.wav') # save the audio to a .wav file
await ws.close() # close the websocket and terminate the audio resources
asyncio.run(main())
import fs from 'fs';
import { createClient, toWav } from '@neuphonic/neuphonic-js';
const client = createClient({ apiKey: '<API_KEY>'});
const msg = `Hello how are you?<STOP>`;
const ws = await client.tts.websocket({
speed: 1.15,
lang_code: 'en'
});
let byteLen = 0;
const chunks = [];
// Websocket allow us to get chunk of the data as soon as they ready
// which can make API more responsve
for await (const chunk of ws.send(msg)) {
// here you can send the data to the client
// or collect it in array and save as a file
chunks.push(chunk.audio);
byteLen += chunk.audio.byteLength;
}
// Merging all chunks into single Uint8Array array
let offset = 0;
const allAudio = new Uint8Array(byteLen);
chunks.forEach(chunk => {
allAudio.set(chunk, offset);
offset += chunk.byteLength;
})
// Saving data to file
const wav = toWav(allAudio);
fs.writeFileSync(__dirname + '/ws.wav', wav);
await ws.close(); // closing the socket if we don't want to send anything
Return Configuration
If you call the websocket directly the returned response is a JSON object with a structure that contains audio data, sampling rate, the input text and other details. Below is a breakdown of the format:
{
"data": {
"audio": "<audio_data_string>",
"text": "<input_text>",
"sampling_rate": 22050 // Default value
}
}
The websocket connection has a 60 second timeout, so if you don't send any messages within that time, the connection will close automatically.
Longform Inference is a special TTS mode which focuses more on robustness rather than speed. It is designed primarily to produce high-quality audio output.
You can post a request to the Longform Inference endpoint to generate audio from text.
import os
from pyneuphonic import Neuphonic, TTSConfig
from pyneuphonic.player import AudioPlayer
# Ensure the API key is set in your environment
client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))
tts = client.tts.LongformInference()
# TTSConfig is a pydantic model so check out the source code for all valid options
tts_config = TTSConfig(
lang_code='en', # replace the lang_code with the desired language code.
sampling_rate=48000, # for Longform Inference, it is possible to use 48 kHz sampling rate
voice_id=None
)
post_response = tts.post(
text = "Testing the Longform Inference",
tts_config=tts_config
)
print(post_response)
Then you can get the result (once its ready) by an using the get
method. You will then retrieve an signed link to the audio file.
import os
from pyneuphonic import Neuphonic
# Ensure the API key is set in your environment
client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))
tts = client.tts.LongformInference()
response = tts.get(
job_id=<JOB_ID>, # Replace with your actual job ID
)
print(response)
For the Longform Inference it is possible to use sampling rate of 48 kHz, which is not available for the regular TTS.
The chosen voice needs to be available for the chosen language.
Text-to-Speech Configuration
The settings for Text-to-Speech generation can include the following parameters.
Name | Type | Description |
---|---|---|
lang_code | string | Language code for the desired language. Examples: 'en' , 'es' , 'de' , 'nl' , 'hi' |
voice_id | string | The voice ID for the desired voice. Based on what voice_id you chose different models will be leveraged. Examples: '8e9c4bc8-3979-48ab-8626-df53befc2090' |
speed | float | Playback speed of the audio. Has to be in [0.7, 2.0] Examples: 0.7 , 1.0 , 1.5 |
sampling_rate | int | Sampling rate of the audio returned from the server. Options: 8000 , 16000 , 22050 |
encoding | string | Encoding of the audio returned from the server. Options: 'pcm_linear' , 'pcm_mulaw' |
Providing a context ID
You can provide a context ID to uniquely identify the audio chunks generated for a given text input. This ID is returned with the corresponding audio, allowing you to track, interrupt, or associate the audio with its original request.
- Python SDK
- JavaScript SDK
client = Neuphonic()
ws = client.tts.AsyncWebsocketClient()
audio_bytes = bytearray()
async def on_message(message: APIResponse[TTSResponse]):
nonlocal audio_bytes
audio_bytes += message.data.audio
print(message.data.context_id) # here the context_id will be "1"
ws.on(WebsocketEvents.MESSAGE, on_message)
await ws.open()
await ws.send(
{"text": "This is message one example.<STOP>", "context_id": "1" }
)
import { createClient } from '@neuphonic/neuphonic-js';
const client = createClient();
const ws = await client.tts.websocketCb({
lang_code: 'en'
});
const msg = `Hello how are you?<STOP>`;
let byteLen = 0;
const chunks: Uint8Array[] = [];
ws.onMessage((chunk) => {
byteLen += chunk.audio.byteLength;
chunks.push(chunk.audio);
console.log({ ...chunk, audio: chunk.audio.byteLength }); // here the context_id will be "1"
});
ws.send(msg, { context_id: '1' });
More Examples
To see more examples, check out our Python SDK examples and JavaScript SDK examples on GitHub.