Gemini 2.5 Pro Text to Speech | Realistic Voice & TTS API

Gemini 2.5 Pro Text-to-Speech

Gemini 2.5 Pro Text-to-Speech is Google's advanced multi-speaker speech synthesis model that turns written dialogue into natural, expressive audio. It supports multiple speakers with distinct voices in a single generation, making it ideal for podcasts, conversations, audiobooks, and any content that needs realistic multi-voice narration.

Why Choose This?

Multi-speaker dialogue Assign different voices to different speakers and generate a natural-sounding conversation in one pass — no need to stitch separate audio clips together.
Expressive, natural voices Powered by Gemini 2.5 Pro, the voices carry natural intonation, pacing, and emotional range for lifelike results.
Multi-language support Supports a wide range of languages including Arabic (Egypt), Bangla (Bangladesh), Dutch (Netherlands), English (India), English (United States), French (France), German (Germany), Hindi (India), Indonesian (Indonesia), and more.
Flexible speaker setup Add as many speakers as your script needs, each with their own named voice. Simply write dialogue with speaker labels and the model handles the rest.

Parameters

Parameter	Required	Description
text	Yes	The script or dialogue text. Use "Speaker: line" format for multi-speaker content.
language	Yes	Language and locale for synthesis (e.g., English (United States), French (France)).
speakers	Yes	A list of speaker entries, each with a speaker name and a voice selection.

How to Use

Write your script in the text field using the "Speaker: dialogue" format (e.g., "Rose: Welcome back to Tech Talk!").
Select the language from the dropdown.
Add speakers — for each speaker in your script, add an entry with the speaker name and choose a voice.
Run — the model generates a single audio file with all speakers voiced naturally.
Download the output audio.

Pricing

$0.08 per 1,000 characters of input text.

Billing Rules

Billed by text length, rounded up to the nearest 1,000 characters
Minimum charge is $0.08 (for texts up to 1,000 characters)

Examples

Text Length	Cost
500 characters	$0.08
1,000 characters	$0.08
2,500 characters	$0.24
5,000 characters	$0.40
10,000 characters	$0.80

Best Use Cases

Podcasts & Talk Shows — Generate multi-host audio content with distinct voices for each speaker.
Audiobooks & Narration — Bring stories to life with different character voices in a single generation.
E-learning & Training — Create engaging instructional audio with conversational dialogue.
Content Localization — Produce voiceovers in multiple languages for global audiences.
Prototyping & Pre-production — Quickly audition dialogue and voice pairings before recording with real talent.

Pro Tips

Use the "Speaker: dialogue" format consistently throughout your script to ensure correct voice assignment.
Make sure each speaker name in the text exactly matches the speaker name in the speakers list.
Keep dialogue natural — the model handles pacing and intonation best with conversational writing.
For long scripts, break content into logical segments to review quality before generating the full piece.

Notes

The number of available voices may vary by language. Experiment with different voice options to find the best fit for your content.
Please ensure your content complies with Google's usage policies.

Gemini 2.5 Pro Text To Speech API — Quick start

Grab a WaveSpeedAI API key, then call POST https://api.wavespeed.ai/api/v3/google/gemini-2.5-pro/text-to-speech with your input as JSON. The endpoint returns a prediction id; poll the prediction endpoint until status flips to completed, then read the output URL from data.outputs[0]. Examples for Gemini 2.5 Pro Text To Speech below.

HTTP example

# Submit the prediction
curl -X POST "https://api.wavespeed.ai/api/v3/google/gemini-2.5-pro/text-to-speech" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY" \
  -d '{
    "language": "English (United States)",
    "speakers": [
        {
            "speaker": "",
            "voice": "Achernar"
        }
    ]
}'

# Response includes a prediction id. Poll for the result:
curl -X GET "https://api.wavespeed.ai/api/v3/predictions/{request_id}/result" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY"

# When status is "completed", read the output from data.outputs[0].

Node.js example

// npm install wavespeed
const WaveSpeed = require('wavespeed');

const client = new WaveSpeed(); // reads WAVESPEED_API_KEY from env

const result = await client.run("google/gemini-2.5-pro/text-to-speech", {
        "language": "English (United States)",
        "speakers": [
                {
                        "speaker": "",
                        "voice": "Achernar"
                }
        ]
});

console.log(result.outputs[0]); // → URL of the generated output

Python example

# pip install wavespeed
import wavespeed

output = wavespeed.run(
    "google/gemini-2.5-pro/text-to-speech",
    {
    "language": "English (United States)",
    "speakers": [
        {
            "speaker": "",
            "voice": "Achernar"
        }
    ]
}
)

print(output["outputs"][0])  # → URL of the generated output

Gemini 2.5 Pro Text To Speech API — Frequently asked questions

What is the Gemini 2.5 Pro Text To Speech API?

Gemini 2.5 Pro Text To Speech is a Google model for audio generation, exposed as a REST API on WaveSpeedAI. Google Gemini 2.5 Pro Text-to-Speech delivers natural multi-speaker voice synthesis with 30+ voices across 24 languages. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing. You can call it programmatically or try it from the playground above.

How do I call the Gemini 2.5 Pro Text To Speech API?

POST your input parameters to the model's REST endpoint (shown in the API tab of this playground) with your WaveSpeedAI API key in the Authorization header. Submission returns a prediction ID; poll the prediction endpoint until status flips to "completed", then read the output URL from the result. The playground generates a ready-to-paste code sample in Python, JavaScript, or cURL for whatever inputs you've set. Full request/response shape is documented at https://wavespeed.ai/docs/docs-api/google/google-gemini-2.5-pro-text-to-speech.

How much does Gemini 2.5 Pro Text To Speech cost per run?

Gemini 2.5 Pro Text To Speech starts at $0.080 per run. That figure is the base price — the final charge scales with the parameters you set in the form (output size, length, count, references, or whatever knobs this model exposes), so a higher-quality or larger output costs more than a minimal one. The exact cost for your current input is shown live next to the Generate button before you submit, and the actual per-call charge is recorded on the prediction afterwards.

What inputs does Gemini 2.5 Pro Text To Speech accept?

Key inputs: `language`, `speakers`, `text`. The full JSON schema (types, defaults, allowed values) is rendered above the Generate button and mirrored in the API reference at https://wavespeed.ai/docs/docs-api/google/google-gemini-2.5-pro-text-to-speech.

How do I get started with the Gemini 2.5 Pro Text To Speech API?

Sign up for a free WaveSpeedAI account to claim starter credits, copy your API key from /accesskey, then call the endpoint shown in the API tab of the playground. The playground also auto-generates a code sample in Python, JavaScript, or cURL for the parameters you've set.

Can I use Gemini 2.5 Pro Text To Speech outputs commercially?

Commercial usage rights depend on the model's license, set by its provider (Google). The license summary appears on the model card above; see WaveSpeedAI's Terms of Service for platform-level conditions.

ExamplesView all

Related Models

README