WAN 2.2 VACE: Object Swap with an Image in ComfyUI (GGUF Q5)

I made one simple workflow that does a proper swap. Pick your source video. Pick one image of your subject. Mark the object to replace. Run. Faces stay stable, the motion looks natural.

What this workflow does

  1. Swaps one object or person in the video with your subject image
  2. Keeps the rest of the scene as it is
  3. Uses point-based masking, a 3-sampler chain, and gentle LoRA weights
  4. Works with FP8 and Q5 GGUF builds on low VRAM

The quick demo

Scene
A park. An alien is eating a burger. A bodyguard is there. An older man sits with a newspaper. He smiles and looks at the alien.

Goal
Replace the alien with a woman from my image.

Result
The alien is replaced by the woman. Motion matches. Timing stays the same. Face stays the same across frames. In one try, the legs had mixed footwear. I added one small line in the prompt and ran again. Fixed.

Workflow Groups

Video section

  • Upload your clip.
  • In Resolution Master, pick a supported size. Example 832 x 480.
  • If your clip is vertical, click Swap to make it 480 x 832. Good for low VRAM.

Image section

  • Upload your subject image.
  • Match the same resolution you used for the video.
  • Use Image Background Remover so the subject blends well.

Masking

  • Green points on the target you will change.
  • Red points on the parts you will not touch.
  • Test with only the mask visible once. Then enable the rest.

Model loaders

  • Two loaders are present: one High-Noise and one Low-Noise.
  • You can choose FP8 or Q5 GGUF builds here.
  • If you pick GGUF, keep any extra quant switch inside the node off.
  • There is a VACE Model Select node. Pick one High-Noise file and one Low-Noise file in that node.

LoRA section

  • Load the Lightx 4-step LoRAs you already use.
  • Optional: add HPS v2 LoRA for human preference tuning. Start with small weights.

Text and VAE

  • Use the matching text encoder and VAE for your build.
  • Write your positive prompt in the text encoder node.

Sampling

  • Keep it simple.
    • 1 step base without LoRA.
    • Then 3 + 3 with LoRA.
  • Very low VRAM? Bypass the 1-step base. If you can keep it, it improves motion stability.

Color match

  • Turn on color match so the inserted subject fits the plate.
  • If tones look off, set strength to 0.

Upscale

  • Optional block at the end. Pick an upscaler and run for a sharper final.

Files you need

Put these in the right ComfyUI folders so the nodes can find them fast. If a node still cannot see a file, restart ComfyUI and check the loader path once.

Base models

Text to Video (T2V)

  • GGUF builds: QuantStack Wan2.2 T2V A14B GGUF. Folder: ComfyUI/models/diffusion_models . Hugging Face
  • FP8 builds: Kijai WanVideo fp8 scaled (T2V). Folder: ComfyUI/models/diffusion_models (use Load Diffusion Model or the WanVideo loader). Hugging Face

VACE (video editing and masked swap)

  • GGUF builds: QuantStack Wan2.2 VACE Fun A14B GGUF. Folder: ComfyUI/models/diffusion_models. Hugging Face
  • FP8 builds: Kijai WanVideo fp8 scaled (VACE). Folder: ComfyUI/models/diffusion_models. Hugging Face

LoRAs

  • Lightx 4-step LoRAs for T2V. Folder: ComfyUI/models/loras. Hugging Face
  • Wan2.2 Fun Reward LoRAs from Alibaba PAI (optional quality tuning). Folder: ComfyUI/models/loras. Hugging Face

Compare Wan 2.2 FP8 vs Q5 GGUF test

Now I switched to Q5 GGUF with the same mask and same resolution.

FP8 Model
GGUF Q5

Timing and VRAM usage

On my PC the first pass, High Noise without LoRA, wraps up in around 31 seconds and take roughly 23 to 24 GB of VRAM. After that the High Noise pass with LoRA at three steps takes close to 38 seconds and sits near 12 to 13 GB. The final Low Noise pass with LoRA at three steps is about 51 seconds. Together, for 97 frames, the render lands around 1 minute 32 seconds start to finish.

Output notes

  1. Q5 Result Looks good similar to fp8.
  2. If you want extra detail, turn on the upscale group. My 97-frame upscale took around 48 minutes. The full frame gets sharper. Eyes may still look soft if the source eyes are soft.

Second example with a T-shirt design

New video. A woman walks on the street.


New subject image. Another woman with a T-shirt that says “excuse me”.

Goal
Swap the subject and keep the same T-shirt design.

Steps

  • Mask cleanly. Green circles on the subject. Red on the background.
  • Run the model groups first. Keep upscale off for the first pass.

Result
The swap is clean. The T-shirt text is close but not perfect.
I add a small line in the prompt about shoes. I remove a side cape I do not want.
Run again. Now “excuse me” is readable. Shoes are correct.
The graphic is about 98 percent same. To reach 100 percent, try one or two tiny prompt tweaks and re-run.

FAQ

Do I need High-Noise and Low-Noise models
Yes. The early steps need High-Noise. The fine detail needs Low-Noise. That mix keeps motion clean and stable.

What sampler steps should I keep
Use 1 step without LoRA for the base. Then 3 and 3 with LoRA. This is fast and stable on low VRAM.

When should I use color match
Keep it on by default. If skin or clothes look strange, set it to zero and try again.

By Esha

Studied Computer Science. Passionate about AI, ComfyUI workflows, and hands-on learning through trial and error. Creator of AIStudyNow — sharing tested workflows, tutorials, and real-world experiments.