InfiniteTalk ComfyUI Workflow (WAN 2.1): Img2Vid, Vid2Vid & Multi-Talk

Esha
By Esha
5 Min Read

InfiniteTalk is a talking-video system. You feed it a images or an existing video plus an audio track, and it makes a lip-synced clip. There’s a ComfyUI workflow with ready nodes/workflows so you can run it inside ComfyUI. It’s built around the Wan 2.1 i2v pipeline and uses an audio encoder (Wav2Vec2) to drive mouth/face motion.

What I’m doing

I run InfiniteTalk inside ComfyUI to get three results:

  • A still photo to a talking video
  • Swap new audio on an old video
  • Two people talking in the same scene

Files you need

Put the file you use in: ComfyUI/models/diffusion_models/.

Base model and fast preset

  • Base: WAN 2.1 I2V 480p (also works with WAN 2.1 Fusion X and WAN 2.1 720p)
  • Encoders: Use the same WAN 2.1 text encoder and the same WAN 2.1 VAE you already use
  • Speed: Lightning LoRA, steps = 4, CFG = 1
  • Samplers that stayed stable on my card: DPM++ SDE, LCM, FlowMatch. I stayed on DPM++ SDE.

All WAN 2.1 model files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)

 I made a quick video tutorial showing InfiniteTalk ComfyUI Workflow inside ComfyUI. You can watch it

Workflow

  • Load an image or load a video.
  • Load the MP3.
  • Open Resolution Master and press Auto so it copies the image size. If you want a standard size, pick a preset. No manual width and height.
  • In the audio group, set start and end with Audio Crop.
  • The small math node reads your FPS and the audio length and fills the frame count by itself.
  • Pick the right InfiniteTalk weight in its node (single or multi).
  • Render.

Example 1 — image to talking video (single)

  • Photo size: 1792 × 2368
  • Resolution Master set Auto to 720 × 960 (fine for WAN 2.1)
  • Audio: 42 s → Audio Crop from 0 to 42
  • InfiniteTalk file: infinite_talk_single.safetensors
  • Sampler: DPM++ SDE
  • Lightning: steps 4, CFG 1

What I saw: lips match words from start to end. Small blinks. Small head moves. On my GPU it took a bit over 20 min and used about 13–16 GB VRAM.

Example 2 — video to video (new audio, one speaker)

  • Source video: 1920 × 1080, 30 FPS, 998 frames
  • Target size: preset 832 × 480
  • New audio: 27 s, but I only need 12 s → Audio Crop 0 to 12
  • FPS: keep 30 FPS with get_fps ON. The math node fills 360 frames.
  • InfiniteTalk file: single
  • Lightning: 4 / 1
  • Sampler: DPM++ SDE
  • Prompt: “looking at the phone, natural review expression”

What I saw: when he says “pixel user,” the mouth shape lands on time. Pauses also look right. It reads like native speech, not a dub.

Example 3 — two people talking (multi)

  • InfiniteTalk file: infinite_talk_multi.safetensors
  • Base: WAN 2.1 I2V 480p (same encoder, same VAE)
  • One photo: a man and a woman in a car. I press Auto in Resolution Master so size is set.

Two audio tracks:

  • Man: 0 to 9 s
  • Woman: 0 to 12 s

Each voice gets Load Audio and Audio Crop. The math node sets the frame counts from your FPS and length.
Lightning: 4 / 1
Sampler: DPM++ SDE

What I saw: about 21 s total. When the man talks, the woman looks at him. When she talks, he turns. Lip-sync stays steady.

Small tips that helped

  • In WAN Video Long I2V, set Motion Frames to match your output FPS: use 25 for 30 FPS, 20 for 25 FPS, 16 for 16 FPS.
  • If color shifts between frames, keep ColorMatch OFF.
  • Two or more speakers: add one more Load Audio + Audio Crop pair per voice with clear start and end times.
  • Keep Lightning LoRA at steps 4, CFG 1 for fast tests.
  • Start with 480p (WAN 2.1 I2V 480p). Upscale later if your VRAM is small.

Share This Article
Follow:
Studied Computer Science. Passionate about AI, ComfyUI workflows, and hands-on learning through trial and error. Creator of AIStudyNow — sharing tested workflows, tutorials, and real-world experiments.
5 Comments

Leave a Reply to hari Cancel reply

Your email address will not be published. Required fields are marked *