Wan 2.2 added speech-to-video this week. I ran it in ComfyUI. One still photo and one audio track turn into a talking clip. You will see lips match words, natural blinks, and small head motion. This is part one. Part two will compare full precision to GGUF for small GPUs.
I made a quick video tutorial showing InfiniteTalk Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready) inside ComfyUI. You can watch it
What I used
- Model: wan2.2_s2v_14B_bf16.safetensors.
- Wan2_2-S2V-14B_fp8_e4m3fn_scaled_KJ.safetensors Put it in
ComfyUI/models/diffusion_models/
. - you can grab the GGUF From Here variants
- Audio encoder:
wav2vec2_large_english_fp16.safetensors
FP16 inComfyUI/models/audio_encoders/
. - Encoders: the same text encoder and VAE I used for Wan 2.1.
- Speed preset: Lightning LoRA, steps = 4, CFG = 1.
All WAN 2.1 model with Lightning LoRA files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)
Workflow in ComfyUI
- Load your image.
- Load your MP3.
- Open Resolution Master and press Auto. It picks a Wan-safe size from your image.
- In the Audio group, set start and end with Audio Crop.
- A small math node reads your FPS and your audio length and fills the frame count for you.
- In Wan Video I2V, set Motion Frames to match output FPS: 25 for 30 FPS, 20 for 25 FPS, 16 for 16 FPS.
- If color flickers, keep ColorMatch off.
- Pick one audio path only: Sound (use the MP3) or Chat voice (reference clip + typed text).
- Render.
Demo 1: singing photo to video

- Photo: woman by the ocean with a piano.
- Size: Resolution Master picked 720 × 960 automatically.
- Audio: 18 s MP3, cropped to 5 s for a quick test.
- Prompt: “woman with long hair at the seaside, playing piano, singing with feeling, rich facial expression.”
- Lightning: steps 4, CFG 1.
Result
Lips hit the syllables. Blinks look natural. Shoulders breathe a bit. If your last second looks soft, try a different scheduler or drop one size and run again.
Longer take
I changed the crop to 12 s and rendered again. On my side, a lower distilled rank looked sharper over time (rank 64 beat 128 on this clip). FlowMatch also helped in a repeat pass.
Demo 2: dialogue photo to video

- Photo: man in a suit on a sofa.
- Target size: 832 × 480 for speed.
- Audio: 11 s, crop 0 → 11.
- FPS: keep 30 FPS with get_fps ON. The math node filled 360 frames.
- Prompt: “a man in a suit sits on a sofa, leans forward, speaks seriously to someone off-camera.”
- Lightning: 4 / 1.
Result
Vowels and consonants land on time. When the voice pauses, the head and mouth pause. The lean-in reads as a real reaction. If an end frame looks mushy, switch the scheduler or go one size down.
Quick quality check without Lightning
I bypassed the LoRA, set steps = 20 and CFG = 6, and rendered the sofa clip again. Motion felt a touch slower. Some details looked a bit sharper frame to frame. It took longer. I pick based on the card and the deadline: Lightning for fast turns; no LoRA with more steps if I have time and want extra crispness.
Low-VRAM note (GGUF Q4)
I tried S2V-14B Q4 (GGUF). LoRA off, steps 20, CFG 6.6.
Side by side with BF16 on my box: Q4 gave 5 s while BF16 gave 11 s at similar quality.
When I pushed Q4 to 11 s, it still looked close.
Longer than that, video softened. I’ll test more in part two.