I finally got around to updating the Wan 2.1 FusionX workflow to support GGUF for text-to-video and image-to-video generation.
Let me explain why this matters.
A while back, I released a Fusion X workflow that used SafeTensors — which worked great but required more VRAM. The feedback was pretty consistent: “Can you make this work with GGUF instead?”
The short answer: yes.
And not only does it work now, but it can run on as little as 2 GB of VRAM — which is huge for folks with older hardware or limited resources.
That’s right, even if you’re stuck with an older GPU, you can still use Wan 2.1 FusionX (GGUF version) for high-quality video generation without crashing your system.
Supported Models and Quantization Levels
This new workflow uses several GGUF quantized models that are available on Hugging Face. These include both text-to-video (T2V) and image-to-video (I2V) models.
Here’s what you need to know about the quantization levels:
- Q2_K to Q8_0 : You can choose based on your available VRAM.
- If you’re running on 2GB VRAM , stick with Q3 or Q4 models for smoother performance.
In this guide, I’ll also be comparing Q4 vs Q8 versions in real-world tests so you can see how much quality difference there is between them.
One of the nice features in this workflow is optional LoRA support . Here are the models you can optionally use:
- Cusvid 14B v2 : CausVid T2V LoRA (Hugging Face)
- LightX2V – helps reduce step count : Wan21_T2V_14B_lightx2v
- MPS – improves face detail :MPS Rewards LoRA (Hugging Face)
- AccVid: AccVid I2V LoRA (Hugging Face)
- Realism Boost: Realism Boost LoRA (Hugging Face)
- DetailEnhancerV1 – adds sharpness and clarity Detail Enhancer V1 (Hugging Face)
You can bypass any of these by simply disabling their group in the workflow. No need to delete nodes or mess with connections — just toggle off the group and move on.
If you’re using the Fusion X model , I noticed during testing that it actually performs better without LoRAs enabled. So unless you have a specific reason to use them, feel free to skip them when working with Fusion X.
Installation Notes: SegAttention & Triton
Many users might not have SegAttention or Triton installed yet — and that’s totally fine. Both of these tools improve performance, especially when dealing with large models or complex workflows.
But here’s the good news:
You can safely bypass those nodes if you don’t have them. Just disable the group in the workflow, and you won’t hit any errors.
I highly recommend installing both if you plan to keep using this workflow long-term — but no pressure. It works either way.
For Low VRAM Users: MultiGPU Node Setup
Now, if you’re working with limited VRAM (like under 6GB), there’s a special setup you need to follow to avoid out-of-memory errors.
Here’s how to configure the MultiGPU node properly:
- Unbypass the “Unloader GUF MultiGPU” node.
- Bypass all other UNET loader nodes .
- Connect the MultiGPU node directly to the LoraLoader model input.
- In the UNET name selector , choose your model and set the device to cuda:0 .
- Make sure “use other VRAM” is set to true .
- Under Virtual VRAM GB , start with 0.1 and test a generation.
- If you get memory allocation errors, gradually increase the value — try 0.4 next.
This setup lets you run the workflow even on low-end GPUs without crashing.
Important Bypass Groups for T2V vs I2V
Another thing to note: depending on whether you’re doing text-to-video (T2V) or image-to-video (I2V) , you’ll want to enable/disable different sections of the workflow.
Here’s how to handle it:
- If you’re doing text-to-video , bypass the image-to-video group .
- If you’re doing image-to-video , bypass the text-to-video group .
It’s a simple toggle switch in the workflow — just make sure you’re not accidentally running both at the same time.
Results: Old Model vs New Fusion Model
You’ll remember earlier I mentioned that older Wan 2.1 models used to take around 30 steps to generate decent results.
Well, that changed.
With this updated workflow, the same model now produces great results in just 5–10 steps — which is a massive improvement in efficiency.
And when comparing directly with the Fusion X model without LoRAs , the non-Fusion model with LoRAs actually looked slightly better in some areas — like background clarity and motion blur effects.
So if you’re aiming for maximum visual fidelity , and you have the VRAM to spare, using the non-Fusion model with LoRAs enabled could be your best bet.
Final Thoughts Before Moving On
At this point, we’ve tested both text-to-video and image-to-video workflows using different models and configurations.
We’ve seen that:
- Q4 works great on low VRAM systems , and sometimes performs better than Q8 depending on your needs.
- Fusion X doesn’t always benefit from LoRAs — sometimes it’s better to leave them off.
- Older Wan 2.1 models can still deliver strong results, especially when paired with the right LoRAs and updated workflow settings.
- Both T2V and I2V modes produce impressive outputs with the right configuration.
Model Download Links
Main WAN Model
- Model Name: Choose any GGUF file from the list
- WAN 2.1 I2V 14B 720p GGUF Models (Hugging Face)
- Wan2.1_T2V_14B_FusionX-GGUF
- Wan2.1_I2V_14B_FusionX-GGUF
- Place In: ComfyUI/models/diffusion_models
WAN VAE
- Model Name: wan_2.1_vae.safetensors
- WAN 2.1 VAE (Hugging Face)
- Place In: ComfyUI/models/vae
WAN Text Encoder
- Model Name: GGUF text encoders
- umt5-xxl-encoder-gguf (Hugging Face)
- Place In: ComfyUI/models/text_encoders
WAN CLIP
- Model Name: clip_vision_h.safetensors
- WAN CLIP Vision (Hugging Face)
- Place In: ComfyUI/models/clip_vision
Thanks for the free work flow