Wan2.2 S2V-14B Released: Open Speech-to-Video Model You Can Try

Esha
By Esha
4 Min Read

Alibaba’s Wan team has released Wan2.2-S2V-14B, a speech-to-video model that turns audio plus a reference image into a finished clip. The drop includes inference code, model weights, and public demos on Hugging Face and ModelScope. The team describes S2V-14B as an “audio-driven cinematic video generation model.”

The S2V release arrives as part of a broader Wan2.2 push. Earlier this summer the project highlighted a Mixture-of-Experts architecture, a larger training mix, and emphasis on cinematic aesthetics. Wan2.2 also shipped models that handle text-to-video (T2V), image-to-video (I2V), and a high-compression text-image-to-video (TI2V) path. Those updates, along with public demos and integration notes, are detailed on the Wan2.2 model page.

Wan has released S2V-14B. You can get it on the Wan2.2-S2V-14B model card on Hugging Face.”
https://huggingface.co/Wan-AI/Wan2.2-S2V-14B/tree/main

Various Generated Videos

Prompt: “In the video, a woman stood on the deck of a sailing boat and sang loudly. The background was the choppy sea and the thundering sky. It was raining heavily in the sky, the ship swayed, the camera swayed, and the waves splashed everywhere, creating a heroic atmosphere. The woman has long dark hair, part of which is wet by rain. Her expression is serious and firm, her eyes are sharp, and she seems to be staring at the distance or thinking.”

Prompt: “In the video, a man in a suit is sitting on the sofa. He leans forward and seems to want to dissuade the opposite person. He speaks to the opposite person with a serious expression of concern.”

Prompt: “The video shows a woman with long hair playing the piano at the seaside. The woman has a long head of silver white hair, and a flame crown is burning on her head. The girls are singing with deep feelings, and their facial expressions are rich. The woman sat sideways in front of the piano, playing attentively.”

What’s in S2V-14B today. The Hugging Face materials show single-GPU and multi-GPU inference commands, plus options to run pose-guided generation (pose video + audio) alongside the usual audio + image inputs. The docs note 480p and 720p support, with an 80 GB GPU recommended for the single-GPU path; multi-GPU runs use PyTorch FSDP and DeepSpeed Ulysses.

State of integrations. Wan2.2 models landed in ComfyUI and Diffusers in July. For S2V specifically, the project’s checklist marks inference code and checkpoints as shipped, with ComfyUI and Diffusers integrations listed as “to do.” In other words: you can run S2V-14B now from the repo; native node/package support is on the roadmap.

Why it matters. Audio-conditioned generation has been a missing piece in many open video stacks. With S2V-14B, Wan2.2 adds a path for dialogue, singing, and performance shots that track speech timing, while keeping the rest of the family’s T2V/I2V/TI2V options for broader storytelling. The combination of MoE capacity and curated aesthetic labels aims to keep motion, framing, and look consistent across shots.

Getting started. The model card links the repo, install steps, and CLI examples; it also lists downloadable checkpoints for T2V-A14B, I2V-A14B, TI2V-5B, and S2V-14B. If you’re testing on a single card and hit OOM, the docs suggest offloading and dtype conversion flags; multi-GPU runs are available out of the box.

Bottom line: Wan2.2’s S2V-14B is out, runnable, and focused on audio-driven video. ComfyUI/Diffusers hooks for S2V are pending, but the rest of the Wan2.2 line is already integrated, with public demos and guides available now.

Share This Article
Follow:
Studied Computer Science. Passionate about AI, ComfyUI workflows, and hands-on learning through trial and error. Creator of AIStudyNow — sharing tested workflows, tutorials, and real-world experiments.
2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *