You probably think the Wan 2.1 LongCat model is just for extending video clips or making avatars talk. Most people use it to make an avatar speak for ten or twenty seconds, and it works great for that. But I found a hidden switch inside this workflow that changes everything.
If you change just one number, this exact same model stops looking at your image and turns into a full text to video generator. It creates a completely new character from scratch just from your words.
In this post, I am going to show you how to use this Overlap 0 hack. I will also show you the exact settings you need to fix the big eyes bug and why the new default is 93 frames.
The Files You Need to Download
First, let’s make sure you have the right files. Since this is a special LongCat workflow, you cannot just use the normal Wan model. I checked the requirements and here is the list of files you must have.
- LongCat TI2V Comfy FP8 is the main checkpoint you need. I use the FP8 version because it fits on my GPU easily.
- LongCat Distill LoRA is required to make the render fast. Do not skip this file or it will be very slow.
- Wan 2.1 VAE is needed for the correct colors.
- UMT5 XXL FP8 is the text encoder that understands your prompt.
- Clip Vision H is required since we are doing Image to Video initially.
I recommend you double check your folder to ensure you have all of these before you start ComfyUI.
Why I Use 93 Frames Now
Now, you may notice that something different in this new workflow. In this old version, we always used 81 frames it was the standard for a 5 second clip. But Kijai updated for the workflow, and now the default is 93 frames.
Why did he change it?
It comes down to how the model sees time. Wan 2.1 does not see individual frames. It processes video in blocks of 4. If you use 81 frames, that is exactly 20 blocks. But the community found a sweet spot.
The model can actually handle 23 blocks, which is 93 frames, without using much more VRAM or breaking the video. You are essentially squeezing out almost an extra second of footage for free. It gives you 5.8 seconds total. It is the maximum safe limit before the quality starts to drop. So I just leave it at 93 because it is better value for your compute.
The Overlap 0 Hack for Text to Video
Now, here is the trick I mentioned at the start. A user on Reddit discovered this, and I tested it myself.
Go to the node called WanVideo LongCat Avatar Extend Embeds. Usually, you set the overlap to 14 or 16. This tells the model to look at your avatar image and extend it.
But if you change this Overlap to 0, you break that connection. The model stops looking at your image entirely. It switches to Text to Video mode.
You do not need to disconnect the image nodes or change the workflow. Just type a prompt like “A cinematic shot of a cyberpunk girl in neon rain” and set overlap to 0 and hit Queue. It will generate a brand new person from scratch.
So remember this simple rule. Use Overlap 14 for your Avatar. Use Overlap 0 for a New Character.
Resolution and Speed Settings
Before we generate, we need to fix your resolution and most important for speed is attention Mode.
For This Model Do not use 1080p. I know you want 1080p, but the model was not trained on it. If you force it to 1920 by 1080, the face will distort, this will crash ComfyUI.
For the best quality, stick to 1280 by 720 for landscape. Or use 720 by 1280 for portrait. If you have a smaller Gpu like a 12 GB one, use 832 by 480. It is much faster Give you better quality result.
Also, check your Attention mode. Make sure you are using Flash Attention and not Sage Attention. In my tests, Flash Attention is significantly faster on Windows and If you see error messages about Sage attention in your console, switch to Flash or SDPA immediately.


