How to Use Text to Speech

Neuphonic's primary offering is its text-to-speech technology, which serves as the foundation for various features, including Agents. Visit our Playground to experiment with different models and voices, and then continue reading below to learn how to implement this in code.

note

Don't forget to visit the Quickstart guide to obtain your API Key and get your environment set up.

Speech Synthesis

You can generate speech using the API in two ways: Server Side Events (SSE) and WebSockets.

Server Side Events (SSE)
WebSockets
Longform Inference

SSE is a streaming protocol where you send a single request to our API to convert text into speech. Our API will then stream the generated audio back to you in real-time, providing the lowest possible latency. Below are some examples of how to use this endpoint.

The SDK examples demonstrate how to send a message to the API, receive the audio stream from the server, and play it through your device's speaker.

cURL
Python SDK
JavaScript SDK

# Replace <API_KEY> with your actual API key.
# To switch languages, replace the lang_code in the path parameter (e.g., /en) with the desired language code.
curl -N --request POST \
  --url https://api.neuphonic.com/sse/speak/en \
  --header 'Content-Type: application/json' \
  --header 'X-API-KEY: <API_KEY>' \
  --header 'Accept: text/event-stream' \
  --data '{
  "text": "Hello, world!"
}'

import os
from pyneuphonic import Neuphonic, TTSConfig
from pyneuphonic.player import AudioPlayer

# Ensure the API key is set in your environment
client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))

sse = client.tts.SSEClient()

# TTSConfig is a pydantic model so check out the source code for all valid options
tts_config = TTSConfig(
    lang_code='en', # replace the lang_code with the desired language code.
    sampling_rate=22050,
)

# Create an audio player with `pyaudio`
# Make sure you use the same sampling rate as in the TTSConfig
with AudioPlayer(sampling_rate=22050) as player:
    response = sse.send('Hello, world!', tts_config=tts_config)
    player.play(response)

import fs from 'fs';
import { createClient, toWav } from '@neuphonic/neuphonic-js';
const client = createClient({ apiKey: '<API_KEY>'});

const msg = `Hello how are you?<STOP>`;

const sse = await client.tts.sse({
  speed: 1.15,
  lang_code: 'en'
});

const res = await sse.send(msg);

// Saving data to file
const wav = toWav(res.audio);
fs.writeFileSync(__dirname + '/sse.wav', wav);

WebSockets offer a powerful way to create a two-way communication channel between your application and the server. This means you can send messages to the server and receive responses back in real-time. WebSockets are ideal for situations where you need the lowest possible latency, as they allow you to send text to the server in small, easy-to-manage pieces.

This protocol is especially useful for applications that involve continuous text streams, such as when connecting Neuphonic TTS to a language model like GPT-4o, which generates responses word by word. This setup ensures minimal latency and makes it easy to integrate with other services, as audio is streamed back to you as soon as it is generated.

Because WebSockets work by streaming data, you need to use special <STOP> tokens to tell the server when a piece of text is complete and ready to be processed. An appropriate example of when the <STOP> symbol could be used is at the end of a stream of messages output from a language model.

This is all demonstrated in the SDK examples below. They demonstrate how to send messages to the API and attach event handlers to play the received audio through your device's speaker.

wscat
Python SDK
JavaScript SDK

# First do `npm install -g wscat`
wscat -c "wss://api.neuphonic.com/speak/en?api_key=<API_KEY>&speed=1.0&lang_code=en&sampling_rate=22050"
# And then type "Hello, World! <STOP>" and hit "Enter"

import os
import asyncio
from pyneuphonic import Neuphonic, TTSConfig, WebsocketEvents
from pyneuphonic.models import APIResponse, TTSResponse
from pyneuphonic.player import AsyncAudioPlayer

async def main():
    client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))

    # Create a websocket object from the client
    ws = client.tts.AsyncWebsocketClient()

    # Set the desired configuration, see the TTSConfig model for more configuration options
    tts_config = TTSConfig(lang_code='es',voice_id='<VOICE_ID>', sampling_rate=22050, speed=1.0, encoding="pcm_linear")

    # Create and start the audio player, which will output audio to your device's speaker
    player = AsyncAudioPlayer()
    await player.open()

    # Attach event handlers. Check WebsocketEvents enum for all valid events.
    async def on_message(message: APIResponse[TTSResponse]):
        """Play audio through the speaker as it is received."""
        await player.play(message.data.audio)

    async def on_close():
        """Close the audio player when the websocket closes."""
        await player.close()

    ws.on(WebsocketEvents.MESSAGE, on_message)
    ws.on(WebsocketEvents.CLOSE, on_close)

    await ws.open(tts_config=tts_config)  # create the connection with the server

    # A special symbol ' <STOP>' must be sent to the server, otherwise the server will wait for
    # more text to be sent before generating the last few snippets of audio
    await ws.send('¡Hola, cómo estás!', autocomplete=True)
    await ws.send('¡Hola, cómo estás! <STOP>')  # Both the above line, and this line, are equivalent

    await asyncio.sleep(15)  # let the audio play
    player.save_audio('output.wav')  # save the audio to a .wav file
    await ws.close()  # close the websocket and terminate the audio resources

asyncio.run(main())

import fs from 'fs';
import { createClient, toWav } from '@neuphonic/neuphonic-js';
const client = createClient({ apiKey: '<API_KEY>'});

const msg = `Hello how are you?<STOP>`;

const ws = await client.tts.websocket({
  speed: 1.15,
  lang_code: 'en'
});

let byteLen = 0;
const chunks = [];

// Websocket allow us to get chunk of the data as soon as they ready
// which can make API more responsve
for await (const chunk of ws.send(msg)) {
  // here you can send the data to the client
  // or collect it in array and save as a file
  chunks.push(chunk.audio);
  byteLen += chunk.audio.byteLength;
}

// Merging all chunks into single Uint8Array array
let offset = 0;
const allAudio = new Uint8Array(byteLen);
chunks.forEach(chunk => {
  allAudio.set(chunk, offset);
  offset += chunk.byteLength;
})

// Saving data to file
const wav = toWav(allAudio);
fs.writeFileSync(__dirname + '/ws.wav', wav);

await ws.close(); // closing the socket if we don't want to send anything

Return Configuration

If you call the websocket directly the returned response is a JSON object with a structure that contains audio data, sampling rate, the input text and other details. Below is a breakdown of the format:

{
  "data": {
    "audio": "<audio_data_string>",
    "text": "<input_text>",
    "sampling_rate": 22050 // Default value
  }
}

The websocket connection has a 60 second timeout, so if you don't send any messages within that time, the connection will close automatically.

Longform Inference is a special TTS mode which focuses more on robustness rather than speed. It is designed primarily to produce high-quality audio output.

You can post a request to the Longform Inference endpoint to generate audio from text.

import os
from pyneuphonic import Neuphonic, TTSConfig
from pyneuphonic.player import AudioPlayer

# Ensure the API key is set in your environment
client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))

tts = client.tts.LongformInference()

# TTSConfig is a pydantic model so check out the source code for all valid options
tts_config = TTSConfig(
    lang_code='en', # replace the lang_code with the desired language code.
    sampling_rate=48000, # for Longform Inference, it is possible to use 48 kHz sampling rate
    voice_id=None
)

post_response = tts.post(
        text = "Testing the Longform Inference",
        tts_config=tts_config
    )
print(post_response)

Then you can get the result (once its ready) by an using the get method. You will then retrieve an signed link to the audio file.

import os
from pyneuphonic import Neuphonic

# Ensure the API key is set in your environment
client = Neuphonic(api_key=os.environ.get('NEUPHONIC_API_KEY'))

tts = client.tts.LongformInference()
response = tts.get(
    job_id=<JOB_ID>,  # Replace with your actual job ID
)
print(response)

For the Longform Inference it is possible to use sampling rate of 48 kHz, which is not available for the regular TTS.

warning

The chosen voice needs to be available for the chosen language.

Text-to-Speech Configuration

The settings for Text-to-Speech generation can include the following parameters.

Name	Type	Description
lang_code	string	Language code for the desired language. Examples: `'en'`, `'es'`, `'de'`, `'nl'`, `'hi'`
voice_id	string	The voice ID for the desired voice. Based on what voice_id you chose different models will be leveraged. Examples: `'8e9c4bc8-3979-48ab-8626-df53befc2090'`
speed	float	Playback speed of the audio. Supported values snap to `0.7` (slow), `1.0` (normal), or `1.5` (fast). Any value <1.0 is treated as slow, >1.0 as fast.
sampling_rate	int	Sampling rate of the audio returned from the server. Options: `8000`, `16000`, `22050`
encoding	string	Encoding of the audio returned from the server. Options: `'pcm_linear'`, `'pcm_mulaw'`

Pronounciation Control (Control Tags using SSML)

You can use special SSML tags to control how certain words are pronounced by the TTS engine.

For example, to spell out a word letter-by-letter, use the <say-as interpret-as="characters"> tag, and close it with </say-as>.

Examples

To spell out Neuphonic, use:

<say-as interpret-as="characters">Neuphonic</say-as>

To spell out API, use:

<say-as interpret-as="characters">API</say-as>

You can integrate phoneme pronunciations using the <phoneme> tag. For example, to pronounce data as "day-tuh", use:

<phoneme alphabet="ipa" ph="ˈdeɪtə">data</phoneme>

Note that we only support the ipa alphabet at the moment.

Providing a context ID

You can provide a context ID to uniquely identify the audio chunks generated for a given text input. This ID is returned with the corresponding audio, allowing you to track, interrupt, or associate the audio with its original request.

Python SDK
JavaScript SDK

client = Neuphonic()

ws = client.tts.AsyncWebsocketClient()
audio_bytes = bytearray()

async def on_message(message: APIResponse[TTSResponse]):
    nonlocal audio_bytes
    audio_bytes += message.data.audio
    print(message.data.context_id) # here the context_id will be "1"

ws.on(WebsocketEvents.MESSAGE, on_message)

await ws.open()

await ws.send(
    {"text": "This is message one example.<STOP>", "context_id": "1" }
)

import { createClient } from '@neuphonic/neuphonic-js';

const client = createClient();

const ws = await client.tts.websocketCb({
  lang_code: 'en'
});

const msg = `Hello how are you?<STOP>`;

let byteLen = 0;
const chunks: Uint8Array[] = [];

ws.onMessage((chunk) => {
  byteLen += chunk.audio.byteLength;
  chunks.push(chunk.audio);
  console.log({ ...chunk, audio: chunk.audio.byteLength }); // here the context_id will be "1"
});

ws.send(msg, {  context_id: '1' });

More Examples

To see more examples, check out our Python SDK examples and JavaScript SDK examples on GitHub.

Speech Synthesis​

Return Configuration​

Text-to-Speech Configuration​

Pronounciation Control (Control Tags using SSML)​

Examples​

Providing a context ID​

More Examples​