Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready)

Wan 2.2 added speech-to-video this week. I ran it in ComfyUI. One still photo and one audio track turn into a talking clip. You will see lips match words, natural blinks, and small head motion. This is part one. Part two will compare full precision to GGUF for small GPUs.

Contents

What I used
Workflow in ComfyUI
- Demo 1: singing photo to video
- Demo 2: dialogue photo to video
Quick quality check without Lightning
- Low-VRAM note (GGUF Q4)

I made a quick video tutorial showing InfiniteTalk Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready) inside ComfyUI. You can watch it

What I used

Model: wan2.2_s2v_14B_bf16.safetensors.
Wan2_2-S2V-14B_fp8_e4m3fn_scaled_KJ.safetensors Put it in ComfyUI/models/diffusion_models/.
you can grab the GGUF From Here variants
Audio encoder: wav2vec2_large_english_fp16.safetensors FP16 in ComfyUI/models/audio_encoders/.
Encoders: the same text encoder and VAE I used for Wan 2.1.
Speed preset: Lightning LoRA, steps = 4, CFG = 1.

All WAN 2.1 model with Lightning LoRA files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)

Workflow in ComfyUI

Load your image.
Load your MP3.
Open Resolution Master and press Auto. It picks a Wan-safe size from your image.
In the Audio group, set start and end with Audio Crop.
A small math node reads your FPS and your audio length and fills the frame count for you.
In Wan Video I2V, set Motion Frames to match output FPS: 25 for 30 FPS, 20 for 25 FPS, 16 for 16 FPS.
If color flickers, keep ColorMatch off.
Pick one audio path only: Sound (use the MP3) or Chat voice (reference clip + typed text).
Render.

Demo 1: singing photo to video

Photo: woman by the ocean with a piano.
Size: Resolution Master picked 720 × 960 automatically.
Audio: 18 s MP3, cropped to 5 s for a quick test.
Prompt: “woman with long hair at the seaside, playing piano, singing with feeling, rich facial expression.”
Lightning: steps 4, CFG 1.

Result
Lips hit the syllables. Blinks look natural. Shoulders breathe a bit. If your last second looks soft, try a different scheduler or drop one size and run again.

Longer take
I changed the crop to 12 s and rendered again. On my side, a lower distilled rank looked sharper over time (rank 64 beat 128 on this clip). FlowMatch also helped in a repeat pass.

Demo 2: dialogue photo to video

Photo: man in a suit on a sofa.
Target size: 832 × 480 for speed.
Audio: 11 s, crop 0 → 11.
FPS: keep 30 FPS with get_fps ON. The math node filled 360 frames.
Prompt: “a man in a suit sits on a sofa, leans forward, speaks seriously to someone off-camera.”
Lightning: 4 / 1.

Result
Vowels and consonants land on time. When the voice pauses, the head and mouth pause. The lean-in reads as a real reaction. If an end frame looks mushy, switch the scheduler or go one size down.

Quick quality check without Lightning

I bypassed the LoRA, set steps = 20 and CFG = 6, and rendered the sofa clip again. Motion felt a touch slower. Some details looked a bit sharper frame to frame. It took longer. I pick based on the card and the deadline: Lightning for fast turns; no LoRA with more steps if I have time and want extra crispness.

Low-VRAM note (GGUF Q4)

I tried S2V-14B Q4 (GGUF). LoRA off, steps 20, CFG 6.6.
Side by side with BF16 on my box: Q4 gave 5 s while BF16 gave 11 s at similar quality.
When I pushed Q4 to 11 s, it still looked close.