WAN 2.2 Fun Control In ComfyUI On Low VRAM

The aim of this article is to demonstrate how to run WAN 2.2 Fun Control on low VRAM. With an RTX 5090 I get about 20 to 30 seconds per sampling. It also runs on 5 to 6 GB VRAM GPUs. First update ComfyUI and make sure it is optimized.

If the GPU has enough memory, FP8 also works. For low memory, Q4_K_M gives a quality close to FP8 in these tests.

A video tutorial demonstrating WAN 2.2 Fun Control on Low VRAM within ComfyUI is available for further guidance.

Files And Models

Use one of these in the Diffusion Models folder:

FP8 model file for WAN 2.2
- wan2.2_fun_control_high_noise_14B_fp8_scaled.safetensors
- wan2.2_fun_control_low_noise_14B_fp8_scaled.safetensors
GGUF Q4_K_M for low memory runs: you can grab it from here
- Wan2.2-Fun-A14B-Control_HighNoise-Q4_K_M.gguf
- Wan2.2-Fun-A14B-Control_LowNoise-Q4_K_M.gguf

LoRA: the newest four-step LoRA for both Low Noise and High Noise.

Text encoder options:

WAN 2.2 text encoder in the WANVideo Text Encoder Cache node on the CPU

Or CLIP Load GGUF with the text encoder umt5-xxl-encoder-Q4_K_M.gguf and set the name to text_encoder_lowvram

Setup In ComfyUI

The process is illustrated in a graph divided into five parts for clarity.

Input Image

Upload one reference image.

In Resolution Master Node Presets, cover the supported sizes of Wan2.2. If you get an error out of memory, pick a smaller size from the WAN 2.2 list. Use Swap for vertical.

ControlNet Video

Two choices.

Bring a ready skeleton video or depth video from outside. This saves memory.
Make the control video inside the graph. Load a video, choose DWPose Estimator for skeleton or Depth Anything v2 for depth. Bypass the one not needed. Connect Image Hook to Set Control Signal Hook. It will write the skeleton or the depth video for motion.

Build Tools Note

Many installations miss Microsoft Visual Studio Build Tools. Install it, click Modify, choose Desktop development with C plus plus, and tick all items under Optional. This avoids compile errors when Torch builds kernels.

Models And Loaders

There are two model loaders in the graph, one for Low Noise and one for High Noise.

FP8 users: set Quantization to fp8_e4m3fn_scaled
GGUF users: keep quantization disabled in the loader

Attention: SDPA or SageAttention. SageAttention can be faster and use less memory. If a sampler errors with SageAttention, switch back to SDPA.
Video Block To Swap: raise this number if memory runs out. It lowers peak use with a small hit to time.
Cache folder: point the Torch compile cache to one custom folder. After the first run, warming the cache makes repeats much faster.

Sampler And Speed
DPM++ SDE gives the best look in these runs. For very low memory cards, LCM is a good alternative and looks almost the same in this workflow. Pick one and keep it the same on both sampling stages.

Demo Run On Low VRAM

Image
The start image is 1792 x 2368. This will give an error on a small card. Pick preset 480 x 832 and switch to vertical. This size is safe for low memory.

FP8 Test

Low Noise and High Noise both set to FP8 with fp8_e4m3fn_scale. Attention set to SageAttention. If it throws an error, move to SDPA. Video Block To Swap set higher if memory is low. Four step LoRA on both loaders. Text encoder on CPU in WANVideo TextEncoder Cache.

Control Video

Load a ready skeleton video. If you only have a normal video, wire DWPose Estimator to Resize to get a clean skeleton or depth video.

Numbers Seen

Size: 480 x 832. Frames: 81. During sampling the card stays around 5 to 6 GB. First sampling about 16 seconds. Second sampling about 55 seconds. Motion follows the skeleton well. Hands came out correct in this test.

The following section describes a 9-second segment of the process.
GGUF Q4_K_M Test

GGUF Q4_K_M Test

Switch model to Q4_K_M GGUF. Turn off quantization in the loader. Keep other settings the same. First sampling at 480 x 832 took about 21 seconds, and the second sampling took around 33 seconds. In these runs, FP8, GGUF Q4_K_M, and safetensors showed the same visual quality. Q6 and Q8 also work, but Q4_K_M matched FP8 and BF16 for look while using less memory.

Schedulers

Switching from DPM++ SDE to LCM gave a small change in time in some runs and almost the same look. For example, first sampling changed from 21 seconds to 22 seconds, and second sampling from 33 seconds to 34 seconds. Load time can be shorter with LCM on low memory.

Torch Compile Cache

The WAN Video Torch Compile and Cache block points to one folder for all compile files. With build tools installed, kernels compile once and then runs are quicker. This is useful when testing many clips in one session.

Motion From A New Video

Upload a dance video. Unbypass DWPose Estimator, wire the hooks as shown, and keep the same settings. With Q4_K_M at 480 x 832, memory rises slightly during control video write and render, but stays around 5 to 6 GB. The output keeps face mostly stable and fills clothes and body from the prompt.

Quick Fix List

Memory error during load: lower the preset size or raise Video Block To Swap
Sampler error on SageAttention: change to SDPA
Timing too slow on repeats: set a custom torch compile cache folder and keep the folder for the whole session
Text encoding on GPU uses too much memory: use WANVideo TextEncoder Cache on CPU, or CLIP Load GGUF with text encoder Q4_K_M and name text_encoder_lowvram
Control video does not match: check DWPose Estimator or Depth Anything v2 wiring and confirm Set ControlSignal Hook is connected