Introducing WaveSpeedAI LTX 2.3 Text-to-Video on WaveSpeedAI

LTX-2.3 Text-to-Video: Generate Synchronized Video and Audio From a Single Prompt

LTX-2.3 is a DiT-based audio-video foundation model that generates fully synchronized video and audio from a single text prompt — eliminating the traditional two-step workflow of producing visuals and sound separately. Now available on WaveSpeedAI, this upgraded release delivers sharper visuals, richer audio, and noticeably better prompt adherence than its predecessor, making it a compelling choice for creators who want production-ready clips without piecing together multiple AI tools.

For studios, marketers, and indie creators, the headline is simple: type a scene, get a video that already sounds right.

How LTX-2.3 Text-to-Video Works

LTX-2.3 is built on a Diffusion Transformer (DiT) architecture trained jointly on video and audio data. Instead of generating silent footage and dubbing in sound afterwards, the model produces both modalities in a single forward pass, so on-screen events and audio cues stay aligned — footsteps land on the beat, rain hisses when raindrops appear, and dialogue-like ambience matches the visual context.

Key technical specs developers care about:

Input: Text prompt describing scene, motion, and audio cues
Output: MP4 video with embedded synchronized audio
Resolutions: 480p, 720p (default), 1080p
Duration: 5 to 20 seconds in a single generation
Constraints: Width and height divisible by 32; frame count divisible by 8 + 1
Seed control: Optional fixed seed for reproducible iteration

Compared with text-to-video models that output silent clips (Sora-style or earlier diffusion baselines), LTX-2.3 collapses two pipelines — visual synthesis and audio generation — into one foundation model. That means lower latency, lower cost, and no manual sync work in post.

Ready to test it? Try LTX-2.3 Text-to-Video on WaveSpeedAI and generate your first clip in under a minute.

Key Features of LTX-2.3 Text-to-Video

Synchronized audio-video in one pass — No separate sound design step. The model generates matching ambience, effects, and atmospheric audio as part of the same diffusion process.
Improved prompt adherence over LTX-2 — The 2.3 update tightens alignment between detailed prompts and rendered scenes, so complex descriptions translate more reliably to screen.
Three resolution tiers (480p / 720p / 1080p) — Iterate cheaply at 480p, then scale up to 1080p for final delivery without changing your prompt or workflow.
Variable clip length up to 20 seconds — Long enough for ad reads, social hooks, and short narrative beats; short enough to keep generations fast.
DiT-based foundation model — Diffusion Transformer architecture delivers temporally consistent motion and high-fidelity textures, especially on dynamic scenes.
Production-ready REST API — Available on WaveSpeedAI with no cold starts, predictable latency, and pay-per-use pricing.
Reproducible outputs with seed control — Lock the seed to A/B test prompt variations without random variance interfering.

Best Use Cases for LTX-2.3 Text-to-Video

Short-form platforms reward velocity and audio. LTX-2.3 lets creators ship 10–15 second TikTok, Reels, and Shorts clips with built-in sound design — no royalty-free music hunting, no Audacity timelining. Type “neon-lit Tokyo street, rain hitting puddles, distant jazz, slow dolly forward,” and the model returns a usable post.

Marketing and Performance Ads

Performance marketers need to test dozens of creative variants per week. With LTX-2.3, an agency can generate a full ad in 720p for $0.30 per 10-second spot, swap copy or scene descriptions, and iterate creative concepts faster than any traditional production pipeline. Synchronized audio means each variant is ad-network-ready out of the gate.

Storyboarding and Pre-Visualization

Film directors and animators can transform written scenes into living previz with matching atmosphere. Describe a scene from a screenplay — “wind howling across a desert ridge, rider gallops past camera, crow calls overhead” — and use the resulting clip to align cinematographers, editors, and clients before any real shoot day.

Product Demos and Explainers

SaaS and hardware teams can prototype video explainers without booking studios. Describe the product context, motion, and ambient setting, and use LTX-2.3 to generate background B-roll that already sounds polished — perfect for landing pages, onboarding flows, and pitch decks.

Game Trailers and Cinematic Concepts

Indie game studios can rapidly mock up trailer cuts and atmospheric concept videos. The synchronized audio is particularly valuable here: a 10-second forest ambush clip with leaf rustle, sword clash, and bird flutter conveys a game’s tone far better than silent footage.

Music and Mood Visualizers

Musicians and lo-fi creators can generate looping mood pieces — “rain on a window, soft piano, slow zoom on a coffee cup” — for streaming visualizers, livestream backgrounds, and social posts.

Educational and Narrative Content

Educators and storytellers can bring written content to life. A children’s book author can prototype animated reads; a history channel can illustrate scene-setting moments without licensing stock footage.

LTX-2.3 Pricing and API Access

LTX-2.3 uses transparent, pay-per-use pricing scaled by resolution and duration:

Resolution	5s	10s	15s	20s
480p	$0.10	$0.20	$0.30	$0.40
720p	$0.15	$0.30	$0.45	$0.60
1080p	$0.20	$0.40	$0.60	$0.80

That makes a finished, audio-included 1080p 20-second clip just $0.80 — a fraction of typical stock footage licensing or freelance video production costs.

Calling LTX-2.3 via the WaveSpeedAI API

import wavespeed

output = wavespeed.run(
    "wavespeed-ai/ltx-2.3/text-to-video",
    {
        "prompt": "A golden retriever runs through a sunlit meadow, paws thumping the grass, birds chirping overhead, gentle wind",
        "resolution": "720p",
        "duration": 10,
    },
)

print(output["outputs"][0])

WaveSpeedAI advantages developers care about:

No cold starts — first-call latency matches steady-state latency
REST API — language-agnostic, drop into any stack
Pay-per-use — no minimums, no idle GPU charges
Production-grade uptime — built for high-throughput inference workloads

Get an API key and start building with LTX-2.3.

Tips for Best Results With LTX-2.3 Text-to-Video

Be explicit about audio — The model auto-generates sound, but stating “rain”, “jazz piano”, “crowd cheering”, or “footsteps on gravel” gives you stronger control over the audio track.
Describe motion, not just scenery — Camera moves (“slow dolly in”, “handheld tracking shot”), subject motion, and pacing cues yield more cinematic outputs than static descriptions.
Iterate at 480p, render at 1080p — Use the cheapest tier to dial in your prompt, then upscale resolution once the composition is locked. Use a fixed seed to keep changes meaningful.
Constrain prompts to one beat — A 10-second clip can only carry one or two narrative moments. Avoid cramming multi-scene scripts into a single prompt.
Edit longer videos in post — For content over 20 seconds, generate multiple LTX-2.3 clips and stitch them together in your NLE.
Use seed locking for A/B testing — When comparing two prompt variants, set the same seed to isolate prompt changes from noise variance.

For animated content from existing artwork, pair LTX-2.3 with LTX-2.3 Image-to-Video to keep style consistent across a campaign.

Frequently Asked Questions

What is LTX-2.3 Text-to-Video?

LTX-2.3 is a DiT-based audio-video foundation model that generates synchronized video and audio from a text prompt in a single pass, available via REST API on WaveSpeedAI.

How much does LTX-2.3 cost?

Pricing starts at $0.10 for a 5-second 480p clip and scales to $0.80 for a 20-second 1080p clip — billed per generation with no subscription required.

Can I use LTX-2.3 via API?

Yes. LTX-2.3 is available through the WaveSpeedAI REST API with no cold starts. Submit a prompt, resolution, and duration, and receive a video URL with embedded audio.

Does LTX-2.3 generate audio automatically?

Yes — audio is produced jointly with video in the same model pass. You can let the model infer audio from visual context or explicitly describe sounds in your prompt for tighter control.

How long can LTX-2.3 videos be?

Each generation supports 5 to 20 seconds. For longer videos, generate multiple clips and edit them together in post-production.

Start Generating Video and Audio With LTX-2.3 Today

LTX-2.3 collapses video synthesis and audio production into one cost-effective, high-quality model — perfect for marketers, creators, and developers who need fast, finished clips without juggling separate tools.

Try LTX-2.3 Text-to-Video on WaveSpeedAI →