Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready)

Wan 2.2 added speech-to-video this week. I ran it in ComfyUI. One still photo and one audio track turn into a talking clip. You will see lips match words, natural blinks, and small head motion. This is part one. Part two will compare full precision to GGUF for small GPUs.

I made a quick video tutorial showing InfiniteTalk Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready) inside ComfyUI. You can watch it

What I used

Model: wan2.2_s2v_14B_bf16.safetensors.
Wan2_2-S2V-14B_fp8_e4m3fn_scaled_KJ.safetensors Put it in ComfyUI/models/diffusion_models/.
you can grab the GGUF From Here variants
Audio encoder: wav2vec2_large_english_fp16.safetensors FP16 in ComfyUI/models/audio_encoders/.
Encoders: the same text encoder and VAE I used for Wan 2.1.
Speed preset: Lightning LoRA, steps = 4, CFG = 1.

All WAN 2.1 model with Lightning LoRA files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)

Workflow in ComfyUI

Load your image.
Load your MP3.
Open Resolution Master and press Auto. It picks a Wan-safe size from your image.
In the Audio group, set start and end with Audio Crop.
A small math node reads your FPS and your audio length and fills the frame count for you.
In Wan Video I2V, set Motion Frames to match output FPS: 25 for 30 FPS, 20 for 25 FPS, 16 for 16 FPS.
If color flickers, keep ColorMatch off.
Pick one audio path only: Sound (use the MP3) or Chat voice (reference clip + typed text).
Render.

Demo 1: singing photo to video

Photo: woman by the ocean with a piano.
Size: Resolution Master picked 720 × 960 automatically.
Audio: 18 s MP3, cropped to 5 s for a quick test.
Prompt: “woman with long hair at the seaside, playing piano, singing with feeling, rich facial expression.”
Lightning: steps 4, CFG 1.

Result
Lips hit the syllables. Blinks look natural. Shoulders breathe a bit. If your last second looks soft, try a different scheduler or drop one size and run again.

Longer take
I changed the crop to 12 s and rendered again. On my side, a lower distilled rank looked sharper over time (rank 64 beat 128 on this clip). FlowMatch also helped in a repeat pass.

Demo 2: dialogue photo to video

Photo: man in a suit on a sofa.
Target size: 832 × 480 for speed.
Audio: 11 s, crop 0 → 11.
FPS: keep 30 FPS with get_fps ON. The math node filled 360 frames.
Prompt: “a man in a suit sits on a sofa, leans forward, speaks seriously to someone off-camera.”
Lightning: 4 / 1.

Result
Vowels and consonants land on time. When the voice pauses, the head and mouth pause. The lean-in reads as a real reaction. If an end frame looks mushy, switch the scheduler or go one size down.

Quick quality check without Lightning

I bypassed the LoRA, set steps = 20 and CFG = 6, and rendered the sofa clip again. Motion felt a touch slower. Some details looked a bit sharper frame to frame. It took longer. I pick based on the card and the deadline: Lightning for fast turns; no LoRA with more steps if I have time and want extra crispness.

Low-VRAM note (GGUF Q4)

I tried S2V-14B Q4 (GGUF). LoRA off, steps 20, CFG 6.6.
Side by side with BF16 on my box: Q4 gave 5 s while BF16 gave 11 s at similar quality.
When I pushed Q4 to 11 s, it still looked close.