In this guide, I am simply taking my ovi 1.1 10 sec comfyui workflow.
You can follow the same steps inside your own ComfyUI Ovi workflow, even if your GPU has low VRAM. The main goal is simple. You start from nothing, you end with a clean ten second video with sound, and you understand why each setting is there.
I am using the Ovi AI model that comes from Character AI, and I am loading it in ComfyUI through the ComfyUI Ovi custom nodes, which include the OVI Engine Loader and the Ovi Attention Selector.
What this workflow actually does
Before we start clicking things, let me tell you what this thing gives you.
-
It takes your text prompt and turns it into a ten second comfyui video with motion and audio.
-
You can also feed one image and use it as the starting frame, so it becomes image to video.
-
You write the spoken line inside simple tags, and Ovi will try to move the lips in sync with that speech.
So one ovi ai model gives you both video and sound. That is why this workflow is interesting. You do not have to run separate tools for sound and picture.
File You Need To Download
I know you want this part very clear, so I am writing it like a checklist.
If one of these files is missing or in the wrong folder, your ovi comfyui workflow will either show red nodes or it will fail during generation.
1. Main Ovi model file
If you use the Kijai build for ComfyUI, you will see an Ovi 11B fp8 safetensors file for ten second clips at 960 by 960. For example, a file like this on Ovi HuggingFace:
There are usually two main precision options. You will see something like:
-
Ovi 11B bf16 safetensors
-
Ovi 11B fp8 safetensors
The idea is very simple.
If your GPU has more VRAM, you can use the BF16 build.
If your VRAM is limited, you pick the FP8 build.
I keep these files in ComfyUI/models/diffusion_models.
2. Wan 2.2 VAE
Ovi uses the Wan 2.2 VAE for video decoding. You can take it from the Wan 2.2 ComfyUI repack on HuggingFace and put it in the VAE folder of ComfyUI.
I save it as:
-
wan2.2_vae.safetensorsinsideComfyUI/models/vae
This VAE is shared between Wan and Ovi, so you do not need to download a separate one just for Ovi.
3. Text encoder, UMT5
For prompts, Ovi reuses the Wan text encoder.
You have two main options.
Again, FP8 is friendly for 16 to 24 GB cards.
BF16 is more suitable if you have more than 32 GB VRAM.
I keep one of them in:
-
ComfyUI/models/text_encoders
You do not need both at the same time. Choose the one that fits your GPU.
4. MM Audio VAE and coder
Ovi uses MMAudio behind the scenes for sound. The ComfyUI Ovi nodes are built to download MMAudio weights automatically when the OVI Engine Loader runs for the first time.
https://huggingface.co/Kijai/WanVideo_comfy/tree/main/Ovi
But if you want manual control, you place the audio VAE and audio coder files inside the same models/vae folder.
5. Optional LoRA for faster, low step runs
For low VRAM workflows, a LoRA helps when you want to drop the steps down to ten or twenty.
Kijai hosts a Wan 2.2 Turbo LoRA that works nicely with TI2V setups, for example:
I put this in:
-
ComfyUI/models/lora
Later, we will use it when we test ten step runs.
How I installed Wan wrapper for Ovi
I am using is ComfyUI WanVideoWrapper by Kijai. It gives ComfyUI wrapper nodes for WanVideo and related models, and Ovi is now plugged into that system.
Here is what I did.
-
I updated ComfyUI from comfyui manager, just to avoid any old bugs.
-
I cloned or downloaded the ComfyUI-WanVideoWrapper repo into
ComfyUI/custom_nodes/. -
I restarted ComfyUI so it can scan the new node pack.
After this, inside the node search, I can see Wan wrapper nodes like:
-
Wan Video Model Loader
-
Wan Video VAE Loader
-
Wan text encoder node
-
other helper nodes that wrap Wan based models
Ovi plugs into these nodes. So there is no separate Ovi engine loader at all.
The wanwrapper is doing the heavy lifting for this ovi ai model.
Model configuration with Wan wrapper
I will now tell you exactly what I did at the top of the graph.
Wan Video Model Loader
First I open the node named something like Wan Video Model Loader from Wan wrapper.
Inside this node, I do three things:
-
In the model slot, I select the Ovi ten second model file
Wan2_2-5B-Ovi_960x960_10s_fp8_e4m3fn_scaled_KJ.safetensors -
I set the precision that matches my text encoder choice
-
if I picked the BF16 UMT5, I keep the precision here in BF16
-
if I picked the FP8 encoder, I stay on FP8, or the closest option the wrapper gives
-
-
I leave the rest of the options on default for the first test
This node tells Wan wrapper which ovi model to use.
Because Ovi is built as a Wan 2.2 plus MMAudio hybrid, Wan wrapper can host it just like other Wan based models.
Wan Video VAE Loader
Next I scroll down to the Wan Video VAE Loader node.
Here I point it to the Wan 2.2 VAE file I downloaded earlier.
Once that is set, the video branch has a decoder that matches the Ovi backbone.
Attention mode with Wan wrapper
The attention mode is managed inside the Wan wrapper settings.
There is a place where I can choose the attention type.
I set it to Flash Attention 2 on my system.
This makes generation faster and uses less VRAM compared to the standard attention method, as long as your GPU supports it.
If ComfyUI starts to crash or if your card does not support this, you can switch back to the default mode and run slower but safe.
Audio VAE loader for Ovi
Then I look for the node that loads the audio VAE and the audio coder.
In your script this is the Ovi audio VAE loader.
Here I select:
-
the MM Audio VAE FP32 file
-
the MM Audio VAE coder FP32 file
With this, the audio branch is ready.
Now the model can read your speech tags and generate actual sound.
Setting resolution, frames and audio length
Now we go to the middle of the graph.
This is where the ten second timing is decided.
Video latent node
Inside the empty latent node for video, or inside the Wan wrapper settings for the video branch, I see the width, height, and frame count.
For this ovi 1.1 10 sec comfyui workflow, the model is trained mainly on three safe video sizes:
-
960 by 960
-
720 by 720
-
832 by 480
If you have a strong GPU, you can stay on 960 by 960 for the best detail.
If your VRAM is medium, then 720 by 720 is a good compromise.
If your card is tight and you want more room, you can go down to 832 by 480.
For a ten second clip at around 24 frames per second, I set the frame count near:
-
241 frames
So for my first test I used:
-
width 960
-
height 960
-
frames 241
This gives you a smooth ten second sequence.
Audio latent node
Right below the video branch there is an empty latent node for audio, or a parameter for audio length.
For a ten second clip, the correct audio length value in this workflow is around:
-
314
So the pair I remember is:
-
241 frames for video
-
314 units for audio length
These two numbers work together so that your audio and your video stay in sync.
If you change them randomly, you may get a clip where the lips and the voice do not match, or the track cuts early.
So I leave these two values as they are and play with other settings instead.
Ovi 10s 2(aistudynow.com)Download

