Skip to main content

Overview

Text-to-speech using Zonos
  • Output Type: audio
  • Estimated Cost: 0.05 * text.length credits
  • Handler: replicate

Parameters

Required Parameters

text
string
required
Text to generate speech from
  • Label: Text

Optional Parameters

audio
audio
Optional reference audio to use for voice cloning
language
string
Language to generate voice in
  • Options: en-us, en-gb, ja, cmn, yue, fr-fr, de
emotion
string
Emotion vector for the generated speech
Encodes emotion in an 8D vector. Included emotions are Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral in that order. This vector tends to be entangled with various other conditioning inputs. More notably, it’s entangled with text based on the text sentiment (eg. Angry texts will be more effectively conditioned to be angry, but if you try to make it sound sad it will be a lot less effective). It’s also entangled with pitch standard deviation since larger values there tend to correlate to more emotional utterances. Make sure to always surround the emotion vector with quotes to avoid list-parsing!
speaking_rate
float
default:15
Speaking rate in phonemes per second.
Speaking rate in phonemes per second. Default is 15.0. 10-12 is slow and clear, 15-17 is natural conversational, 20+ is fast. Values above 25 may produce artifacts.
  • Label: Speaking Rate
  • Minimum: 5
  • Maximum: 30
seed
integer
default:"random"
Set random seed for reproducibility. If blank, will be set to a random value.
You should only set this if you want to start from/copy the seed of a previous generation. Unless one is specified, you should leave this blank!
  • Label: Seed
  • Minimum: 0
  • Maximum: 2147483647
I