Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready)

Esha
By Esha
4 Min Read

Wan 2.2 added speech-to-video this week. I ran it in ComfyUI. One still photo and one audio track turn into a talking clip. You will see lips match words, natural blinks, and small head motion. This is part one. Part two will compare full precision to GGUF for small GPUs.

 I made a quick video tutorial showing InfiniteTalk Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready) inside ComfyUI. You can watch it

What I used

All WAN 2.1 model with Lightning LoRA files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)

Workflow in ComfyUI

  • Load your image.
  • Load your MP3.
  • Open Resolution Master and press Auto. It picks a Wan-safe size from your image.
  • In the Audio group, set start and end with Audio Crop.
  • A small math node reads your FPS and your audio length and fills the frame count for you.
  • In Wan Video I2V, set Motion Frames to match output FPS: 25 for 30 FPS, 20 for 25 FPS, 16 for 16 FPS.
  • If color flickers, keep ColorMatch off.
  • Pick one audio path only: Sound (use the MP3) or Chat voice (reference clip + typed text).
  • Render.

Demo 1: singing photo to video

  • Photo: woman by the ocean with a piano.
  • Size: Resolution Master picked 720 × 960 automatically.
  • Audio: 18 s MP3, cropped to 5 s for a quick test.
  • Prompt: “woman with long hair at the seaside, playing piano, singing with feeling, rich facial expression.”
  • Lightning: steps 4, CFG 1.

Result
Lips hit the syllables. Blinks look natural. Shoulders breathe a bit. If your last second looks soft, try a different scheduler or drop one size and run again.

Longer take
I changed the crop to 12 s and rendered again. On my side, a lower distilled rank looked sharper over time (rank 64 beat 128 on this clip). FlowMatch also helped in a repeat pass.

Demo 2: dialogue photo to video

  • Photo: man in a suit on a sofa.
  • Target size: 832 × 480 for speed.
  • Audio: 11 s, crop 0 → 11.
  • FPS: keep 30 FPS with get_fps ON. The math node filled 360 frames.
  • Prompt: “a man in a suit sits on a sofa, leans forward, speaks seriously to someone off-camera.”
  • Lightning: 4 / 1.

Result
Vowels and consonants land on time. When the voice pauses, the head and mouth pause. The lean-in reads as a real reaction. If an end frame looks mushy, switch the scheduler or go one size down.

Quick quality check without Lightning

I bypassed the LoRA, set steps = 20 and CFG = 6, and rendered the sofa clip again. Motion felt a touch slower. Some details looked a bit sharper frame to frame. It took longer. I pick based on the card and the deadline: Lightning for fast turns; no LoRA with more steps if I have time and want extra crispness.

Low-VRAM note (GGUF Q4)

I tried S2V-14B Q4 (GGUF). LoRA off, steps 20, CFG 6.6.
Side by side with BF16 on my box: Q4 gave 5 s while BF16 gave 11 s at similar quality.
When I pushed Q4 to 11 s, it still looked close.

Longer than that, video softened. I’ll test more in part two.

Share This Article
Follow:
Studied Computer Science. Passionate about AI, ComfyUI workflows, and hands-on learning through trial and error. Creator of AIStudyNow — sharing tested workflows, tutorials, and real-world experiments.
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *