Wan 2.2 added speech-to-video this week. I ran it in ComfyUI. One still photo and one audio track turn into a talking clip. You will see lips match words, natural blinks, and small head motion. This is part one. Part two will compare full precision to GGUF for small GPUs.
I made a quick video tutorial showing InfiniteTalk Wan 2.2 S2V-14B: Speech-to-Video ComfyUI Workflow (GGUF Ready) inside ComfyUI. You can watch it
What I used
All WAN 2.1 model with Lightning LoRA files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)
Workflow in ComfyUI
Demo 1: singing photo to video

Result
Lips hit the syllables. Blinks look natural. Shoulders breathe a bit. If your last second looks soft, try a different scheduler or drop one size and run again.
Longer take
I changed the crop to 12 s and rendered again. On my side, a lower distilled rank looked sharper over time (rank 64 beat 128 on this clip). FlowMatch also helped in a repeat pass.
Demo 2: dialogue photo to video

Result
Vowels and consonants land on time. When the voice pauses, the head and mouth pause. The lean-in reads as a real reaction. If an end frame looks mushy, switch the scheduler or go one size down.
Quick quality check without Lightning
I bypassed the LoRA, set steps = 20 and CFG = 6, and rendered the sofa clip again. Motion felt a touch slower. Some details looked a bit sharper frame to frame. It took longer. I pick based on the card and the deadline: Lightning for fast turns; no LoRA with more steps if I have time and want extra crispness.
Low-VRAM note (GGUF Q4)
I tried S2V-14B Q4 (GGUF). LoRA off, steps 20, CFG 6.6.
Side by side with BF16 on my box: Q4 gave 5 s while BF16 gave 11 s at similar quality.
When I pushed Q4 to 11 s, it still looked close.
Longer than that, video softened. I’ll test more in part two.

