InfiniteTalk ComfyUI Workflow (WAN 2.1): Img2Vid, Vid2Vid & Multi-Talk

Esha
By Esha
5 Min Read

InfiniteTalk is a talking-video system. You feed it a images or an existing video plus an audio track, and it makes a lip-synced clip. There’s a ComfyUI workflow with ready nodes/workflows so you can run it inside ComfyUI. It’s built around the Wan 2.1 i2v pipeline and uses an audio encoder (Wav2Vec2) to drive mouth/face motion.

What I’m doing

I run InfiniteTalk inside ComfyUI to get three results:

  • A still photo to a talking video
  • Swap new audio on an old video
  • Two people talking in the same scene

Files you need

Put the file you use in: ComfyUI/models/diffusion_models/.

Base model and fast preset

  • Base: WAN 2.1 I2V 480p (also works with WAN 2.1 Fusion X and WAN 2.1 720p)
  • Encoders: Use the same WAN 2.1 text encoder and the same WAN 2.1 VAE you already use
  • Speed: Lightning LoRA, steps = 4, CFG = 1
  • Samplers that stayed stable on my card: DPM++ SDE, LCM, FlowMatch. I stayed on DPM++ SDE.

All WAN 2.1 model files: (https://aistudynow.com/how-to-run-wan-2-1-fusionx-gguf-advanced-comfyui-workflow-on-low-vram/)

 I made a quick video tutorial showing InfiniteTalk ComfyUI Workflow inside ComfyUI. You can watch it

Workflow

  • Load an image or load a video.
  • Load the MP3.
  • Open Resolution Master and press Auto so it copies the image size. If you want a standard size, pick a preset. No manual width and height.
  • In the audio group, set start and end with Audio Crop.
  • The small math node reads your FPS and the audio length and fills the frame count by itself.
  • Pick the right InfiniteTalk weight in its node (single or multi).
  • Render.

Example 1 — image to talking video (single)

  • Photo size: 1792 × 2368
  • Resolution Master set Auto to 720 × 960 (fine for WAN 2.1)
  • Audio: 42 s → Audio Crop from 0 to 42
  • InfiniteTalk file: infinite_talk_single.safetensors
  • Sampler: DPM++ SDE
  • Lightning: steps 4, CFG 1

What I saw: lips match words from start to end. Small blinks. Small head moves. On my GPU it took a bit over 20 min and used about 13–16 GB VRAM.

Example 2 — video to video (new audio, one speaker)

  • Source video: 1920 × 1080, 30 FPS, 998 frames
  • Target size: preset 832 × 480
  • New audio: 27 s, but I only need 12 s → Audio Crop 0 to 12
  • FPS: keep 30 FPS with get_fps ON. The math node fills 360 frames.
  • InfiniteTalk file: single
  • Lightning: 4 / 1
  • Sampler: DPM++ SDE
  • Prompt: “looking at the phone, natural review expression”

What I saw: when he says “pixel user,” the mouth shape lands on time. Pauses also look right. It reads like native speech, not a dub.

Example 3 — two people talking (multi)

  • InfiniteTalk file: infinite_talk_multi.safetensors
  • Base: WAN 2.1 I2V 480p (same encoder, same VAE)
  • One photo: a man and a woman in a car. I press Auto in Resolution Master so size is set.

Two audio tracks:

  • Man: 0 to 9 s
  • Woman: 0 to 12 s

Each voice gets Load Audio and Audio Crop. The math node sets the frame counts from your FPS and length.
Lightning: 4 / 1
Sampler: DPM++ SDE

What I saw: about 21 s total. When the man talks, the woman looks at him. When she talks, he turns. Lip-sync stays steady.

Small tips that helped

  • In WAN Video Long I2V, set Motion Frames to match your output FPS: use 25 for 30 FPS, 20 for 25 FPS, 16 for 16 FPS.
  • If color shifts between frames, keep ColorMatch OFF.
  • Two or more speakers: add one more Load Audio + Audio Crop pair per voice with clear start and end times.
  • Keep Lightning LoRA at steps 4, CFG 1 for fast tests.
  • Start with 480p (WAN 2.1 I2V 480p). Upscale later if your VRAM is small.

Share This Article
Follow:
Studied Computer Science. Passionate about AI, ComfyUI workflows, and hands-on learning through trial and error. Creator of AIStudyNow — sharing tested workflows, tutorials, and real-world experiments.
15 Comments
  • Can you post what setup you have for ComfyUI, Python version, pytorch version etc? I use Windows. I have not been able to get this to work with Python 3.12. It starts to run and then gets an error about missing Triton. When I try to install Triton, it always fails saying “no compatible version found”

  • please ma’am help me how to fix “I’m use infinitetalk_multi WanVideoSampler ‘NoneType’ object has no attribute ‘max’ “How to fix””

  • Example 3 — two people talking (multi) please ma’am help me how to fix “I’m use infinitetalk_multi WanVideoSampler ‘NoneType’ object has no attribute ‘max’ “How to fix””

  • Hello Dear Esha,

    First of all, I would like to sincerely thank you for sharing your workflow and valuable guidance. I am currently using your workflow (Unlimited Talk – Single AI, studynow.com) and have also downloaded all the recommended models to ensure proper setup.

    Here is my current process:

    I uploaded my character image.

    I added an audio file (9 seconds in length).

    I did not change or modify any settings.

    I simply pressed the RUN button to generate the output.

    The process completed successfully, however, the output video shows correct and clean results only for the first two seconds. After that, the video becomes heavily distorted with high noise. The audio, on the other hand, plays perfectly throughout.
    I have attached a screenshot for reference: (https://i.postimg.cc/VNRPNhYs/Screenshot-2025-08-28-170224.png)
    .

    My system specifications:

    Operating System: Windows 11 Pro

    GPU: NVIDIA RTX 5070

    RAM: 64 GB DDR5

    ComfyUI Installation: Manually installed Step by Step (not using the portable version)

    Could you please guide me on how I can resolve this issue? I would greatly appreciate any troubleshooting steps or adjustments you might recommend.

    Thank you very much for your time and support.

    Best regards,
    Ch Nisar

  • hi, i get error in comfy ui about “ResolutionMaster” node. i haven’t it. can you please give the link of download this node? i tried to download from comfy ui but after i restart comfy ui, still the problem is.

  • Hi! Does this model only fit square 5 characters well? I tried 16*9 and they don’t come out very well, unlike the square or vertical format.

  • Hey, many thanks for sharing the workflows and detailed instructions! (and thanks for the other resources you’re posting on youtube and on this website, very helpful learning how it works)
    I was able to run the workflow, however the video looks slo-mo and the audio isn’t synced (although lipsync was generated). Tried to use 48khz and 41khz. Any ideas why this is happening ?
    Thanks :)

Leave a Reply to Esha Cancel reply

Your email address will not be published. Required fields are marked *