Enjoy 50% OFF Vidu Q3 & Q3 Pro models • Only on WaveSpeedAI | May 20 – Jun 2

Dashboard Explore AI GeneratorHOT Desktop App

LLM

Settings

Speech Generation

Convert text into expressive spoken audio

Our selection

video-dubbing

wavespeed-ai/mmaudio-v2

MMaudio v2 produces synchronized audio from video or text inputs, ideal for adding soundtracks to videos when paired with video models. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Try it now!See docs

All models

39 models

video-dubbing

wavespeed-ai/mmaudio-v2

text-to-audio

kwaivgi/kling-text-to-audio

Kling Text-to-Audio turns text prompts into custom sound effects for videos, games, and multimedia using KlingAI's audio model. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/turbo-v2

ElevenLabs Turbo V2 is a Text-To-Speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for API requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

wavespeed-ai/ace-step/audio-outpaint

ACE-Step Audio Outpaint generates seamless start or end extensions that match the original, ideal for intros, outros and longer tracks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/voice-design

MiniMax Voice Design generates natural voices from textual descriptions - no cloning - lets you set tone, accent and personality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.5-hd-preview

MiniMax Speech 2.5 HD Preview offers HD TTS with enhanced multilingual expressiveness, accurate voice cloning, and 40-language support. Ready-to-use REST API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.5-turbo-preview

Minimax Speech 2.5 Turbo Preview: HD TTS with multilingual support, accurate voice replication across 40 languages. $0.04/1000 chars. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

wavespeed-ai/ace-step/audio-inpaint

ACE-Step Audio Inpaint edits a specific audio segment to change lyrics or style while preserving the surrounding audio. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/multilingual-v2

ElevenLabs Multilingual V2 is a multilingual text-to-speech model; cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-02-turbo

Minimax Speech-02 Turbo is a high-definition text-to-speech model delivering natural voice output. Cost: $0.03 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/flash-v2

ElevenLabs Flash V2 is a Text-to-Speech model that converts text into spoken audio using the ElevenLabs Flash V2 engine. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/flash-v2.5

ElevenLabs Flash v2.5 is a text-to-speech model on WaveSpeedAI, billed at $0.05 per 1000 characters for generated speech. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/multilingual-v1

ElevenLabs Multilingual V1 provides natural-sounding multilingual text-to-speech across many languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

digital-human

wavespeed-ai/wan-2.2/speech-to-video

Wan-2.2-S2V turns images and speech into high-fidelity videos with realistic face and body motion; supports up to 10-minute clips in 480p, from $0.15/5s. Ready-to-use REST API, no coldstarts, affordable pricing.

text-to-audio

kwaivgi/kling-v1-tts

Kling V1 TTS creates natural-sounding audio and supports KlingAI image, video, sound effect, virtual model, and custom AI workflows. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/music-v1.5

MiniMax Music v1.5 turns text prompts into high-quality, diverse music (Text-to-Audio) using advanced AI for versatile tracks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

alibaba/qwen3-tts-flash

Qwen3 TTS Flash: Low-latency Text-to-Speech for English and Chinese with multiple voices, ideal for real-time dialogue. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/eleven-v3

ElevenLabs eleven-v3 is a text-to-speech model available as a hosted endpoint; requests cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/ace-step

ACE-Step generates up to 4-minute music with lyrics from text and high acoustic fidelity; supports voice cloning, lyric edits, and remixing. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

wavespeed-ai/ace-step/audio-to-audio

ACE-Step Audio-to-Audio turns existing tracks into remixes or vocal edits using remix and lyrics modes while preserving audio character. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

minimax/voice-clone

Minimax Voice Clone creates high-quality voice clones from short reference clips, closely matching tone, accent, and speaking style. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/ace-step/prompt-to-audio

ACE-Step Prompt-to-Audio creates music from simple prompts, auto-generating genre tags and lyrics for quick song creation. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/turbo-v2.5

ElevenLabs Turbo V2.5 is a text-to-speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for TTS requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/music-01

Minimax Music-01 Synthesizes Accompaniment And Vocals Simultaneously To Produce Complete Songs Across Diverse Styles. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.6-hd

Minimax Speech 2.6 HD: Ultra-human, low-latency (< 250ms) TTS with voice cloning, text normalization and support for 40+ languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.6-turbo

Minimax Speech 2.6 Turbo is a Text-to-Speech model offering ultra-human voice cloning, industry-leading text normalization, sub-250ms latency and 40+ language support. Pricing: $0.06 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/music-02

Minimax Music-02 is a compact, fast, cost-effective MoE music generator (230B params, 10B active) for high-quality music production. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-02-hd

Minimax Speech 02 HD is Minimax's high-definition text-to-speech model delivering clear HD voices; pricing $0.05 per 1,000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/vibevoice

wavespeed-ai/vibevoice is an advanced voice generation model for producing high-fidelity, natural, and expressive speech from text, with optional speaker/region-style control for more precise results and easy integration into real-world applications. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.8-turbo

MiniMax Speech 2.8 Turbo is a high-definition text-to-speech model with natural and expressive voice synthesis. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.8-hd

MiniMax Speech 2.8 HD is a high-definition text-to-speech model with natural and expressive voice synthesis for premium audio quality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/qwen3-tts/text-to-speech

Qwen3 TTS: Multi-language, multi-voice text-to-speech synthesis with style control. Supports 11 languages and 9 voice characters. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

audio-to-audio

wavespeed-ai/qwen3-tts/voice-clone

Qwen3 TTS Voice Clone: Clone any voice from a reference audio and generate speech in that voice. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

text-to-audio

wavespeed-ai/qwen3-tts/voice-design

Qwen3 TTS Voice Design: Generate speech with custom voice characteristics described in natural language. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

text-to-audio

microsoft/vibevoice

Microsoft VibeVoice text-to-speech model generates long-form speech from text with multi-speaker dialogue support. Choose from 9 voice presets across English, Chinese, and Hindi. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

inworld/inworld-1.5-max/text-to-speech

Inworld 1.5 Max delivers premium text-to-speech synthesis with 56+ multilingual voices, adjustable speaking rate, and high-fidelity natural-sounding audio output. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

inworld/inworld-1.5-mini/text-to-speech

Inworld 1.5 Mini delivers high-quality text-to-speech synthesis with 56+ multilingual voices, adjustable speaking rate, and natural-sounding audio output. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

google/gemini-2.5-pro/text-to-speech

Google Gemini 2.5 Pro Text-to-Speech delivers natural multi-speaker voice synthesis with 30+ voices across 24 languages. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

google/gemini-2.5-flash/text-to-speech

Google Gemini 2.5 Flash Text-to-Speech delivers fast, natural multi-speaker voice synthesis with 30+ voices across 24 languages at lower cost. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Speech Generation API — pricing & performance

Run any model in the Speech Generation collection through a single REST API. Pay per generation — no subscriptions, no minimums — with industry-leading latency on a 99.9% uptime infrastructure.

Why run Speech Generation on WaveSpeedAI

Transparent pricing

Per-call pricing for every Speech Generation model. The price is listed on each model page — no platform fees on top.

Optimized for low latency

Most Speech Generation image models complete in under 2 seconds. Video and 3D models run several times faster than self-hosted alternatives.

99.9% uptime

Multi-region failover and automatic retries keep your production traffic online — even during provider outages.

Frequently asked questions

How much does the Speech Generation API cost?+

Each model has its own per-call price listed on the model page. We bill per successful generation, with no subscription fees or minimums.

How fast are Speech Generation models on WaveSpeedAI?+

Image models in this collection typically complete in under 2 seconds. Video and 3D models depend on duration and resolution but are usually several times faster than self-hosted runs.

Can I try the API without a credit card?+

Yes — every account gets $1 in free credits on signup, enough to try most Speech Generation models without a credit card.

Are there rate limits?+

Standard accounts have generous concurrent-job limits. Enterprise plans offer custom RPM, higher concurrency, and dedicated capacity — contact sales for details.

Explore 1,000+ AI Models

Browse our full catalog of state-of-the-art AI models — image, video, 3D, audio, LLM, and more.

wavespeed.ai/models →

Build with the API

Integrate AI into your own apps. RESTful API with client libraries — no cold starts, pay per use.

wavespeed.ai/docs →