InfiniteTalk ComfyUI Workflow (WAN 2.1): Img2Vid, Vid2Vid & Multi-Talk

InfiniteTalk is a talking-video system. You feed it a images or an existing video plus an audio track, and it makes a lip-synced clip. There’s a ComfyUI workflow with ready nodes/workflows so you can run it inside ComfyUI. It’s built around the Wan 2.1 i2v pipeline and uses an audio encoder (Wav2Vec2) to drive mouth/face motion.

What I’m doing

I run InfiniteTalk inside ComfyUI to get three results:

A still photo to a talking video
Swap new audio on an old video
Two people talking in the same scene

Files you need

Wan2_1-InfiniTetalk-Single_fp16.safetensors (one speaker)
Wan2_1-InfiniteTalk-Multi_fp16.safetensors (two or more)

Put the file you use in: ComfyUI/models/diffusion_models/.

Base model and fast preset

Base: WAN 2.1 I2V 480p (also works with WAN 2.1 Fusion X and WAN 2.1 720p)
Encoders: Use the same WAN 2.1 text encoder and the same WAN 2.1 VAE you already use
Speed: Lightning LoRA, steps = 4, CFG = 1
Samplers that stayed stable on my card: DPM++ SDE, LCM, FlowMatch. I stayed on DPM++ SDE.

All WAN 2.1 model files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)

I made a quick video tutorial showing InfiniteTalk ComfyUI Workflow inside ComfyUI. You can watch it

Workflow

Load an image or load a video.
Load the MP3.
Open Resolution Master and press Auto so it copies the image size. If you want a standard size, pick a preset. No manual width and height.
In the audio group, set start and end with Audio Crop.
The small math node reads your FPS and the audio length and fills the frame count by itself.
Pick the right InfiniteTalk weight in its node (single or multi).
Render.

Example 1 — image to talking video (single)

Photo size: 1792 × 2368
Resolution Master set Auto to 720 × 960 (fine for WAN 2.1)
Audio: 42 s → Audio Crop from 0 to 42
InfiniteTalk file: infinite_talk_single.safetensors
Sampler: DPM++ SDE
Lightning: steps 4, CFG 1

What I saw: lips match words from start to end. Small blinks. Small head moves. On my GPU it took a bit over 20 min and used about 13–16 GB VRAM.

Example 2 — video to video (new audio, one speaker)

Source video: 1920 × 1080, 30 FPS, 998 frames
Target size: preset 832 × 480
New audio: 27 s, but I only need 12 s → Audio Crop 0 to 12
FPS: keep 30 FPS with get_fps ON. The math node fills 360 frames.
InfiniteTalk file: single
Lightning: 4 / 1
Sampler: DPM++ SDE
Prompt: “looking at the phone, natural review expression”

What I saw: when he says “pixel user,” the mouth shape lands on time. Pauses also look right. It reads like native speech, not a dub.

Example 3 — two people talking (multi)

InfiniteTalk file: infinite_talk_multi.safetensors
Base: WAN 2.1 I2V 480p (same encoder, same VAE)
One photo: a man and a woman in a car. I press Auto in Resolution Master so size is set.

Two audio tracks:

Man: 0 to 9 s
Woman: 0 to 12 s

Each voice gets Load Audio and Audio Crop. The math node sets the frame counts from your FPS and length.
Lightning: 4 / 1
Sampler: DPM++ SDE

What I saw: about 21 s total. When the man talks, the woman looks at him. When she talks, he turns. Lip-sync stays steady.

Small tips that helped

In WAN Video Long I2V, set Motion Frames to match your output FPS: use 25 for 30 FPS, 20 for 25 FPS, 16 for 16 FPS.
If color shifts between frames, keep ColorMatch OFF.
Two or more speakers: add one more Load Audio + Audio Crop pair per voice with clear start and end times.
Keep Lightning LoRA at steps 4, CFG 1 for fast tests.
Start with 480p (WAN 2.1 I2V 480p). Upscale later if your VRAM is small.

Multi Talk Wan2.1(aistudynow.com) (2)Download

Unlimited talk single ai studynow.com Download

InfiniteTalk ComfyUI Workflow (WAN 2.1): Img2Vid, Vid2Vid & Multi-Talk

Files you need

Base model and fast preset

Workflow

Example 1 — image to talking video (single)

Example 2 — video to video (new audio, one speaker)

Example 3 — two people talking (multi)

Small tips that helped

Leave a Reply Cancel reply

Policy

Help & Support

Files you need

Base model and fast preset

Workflow

Example 1 — image to talking video (single)

Example 2 — video to video (new audio, one speaker)

Example 3 — two people talking (multi)

More Read

Small tips that helped

Leave a Reply Cancel reply