WAN 2.2 FP8 Text Encoders
Optimized FP8 text encoder models for WAN (Wan An Nian) 2.2 video generation pipeline. These quantized encoders provide significantly reduced memory footprint while maintaining high-quality text understanding for video generation tasks.
Model Description
This repository contains FP8-quantized text encoder models specifically optimized for the WAN 2.2 video generation system. The models enable text-to-video and image-to-video generation with substantially lower VRAM requirements compared to FP16 variants.
Key Features:
- FP8 Quantization: Reduces model size by ~50% compared to FP16 with minimal quality loss
- Dual Encoder Support: Includes both T5-XXL and UMT5-XXL encoders for flexible text understanding
- Memory Efficient: Enables video generation on GPUs with 16GB+ VRAM
- Drop-in Replacement: Compatible with WAN 2.2 diffusers pipeline
Capabilities:
- Text-to-video generation with natural language prompts
- Enhanced multilingual support (UMT5-XXL)
- High-quality semantic understanding for video synthesis
- Optimized for batch processing and long video generation
Repository Contents
wan22-fp8-encoders/
βββ text_encoders/
βββ t5-xxl-fp8.safetensors # 4.6 GB - T5-XXL FP8 text encoder
βββ umt5-xxl-fp8.safetensors # 6.3 GB - UMT5-XXL FP8 multilingual encoder
Total Repository Size: 11 GB
Model Files
| File | Size | Description | Use Case |
|---|---|---|---|
t5-xxl-fp8.safetensors |
4.6 GB | T5-XXL FP8 encoder | English text understanding |
umt5-xxl-fp8.safetensors |
6.3 GB | UMT5-XXL FP8 encoder | Multilingual text support |
Hardware Requirements
Minimum Requirements
- VRAM: 16 GB (with FP8 encoders + base model)
- System RAM: 32 GB recommended
- Disk Space: 11 GB for encoders + additional space for base models
- GPU: NVIDIA RTX 3090, RTX 4090, or better
Recommended Requirements
- VRAM: 24 GB+ (for higher resolution and longer videos)
- System RAM: 64 GB
- Disk Space: 50 GB+ (including all WAN 2.2 components)
- GPU: NVIDIA RTX 4090, A6000, or better
Performance Notes
- FP8 encoders reduce VRAM usage by ~4-6 GB compared to FP16
- UMT5-XXL provides better multilingual support but uses more VRAM
- T5-XXL is recommended for English-only workflows
- Batch size and video length may require additional VRAM scaling
Usage Examples
Basic Usage with Diffusers
from diffusers import WanPipeline
import torch
# Load WAN pipeline with FP8 text encoders
pipe = WanPipeline.from_pretrained(
"path/to/wan22-base-model",
text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/t5-xxl-fp8.safetensors",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
# Generate video from text prompt
prompt = "A serene mountain landscape at sunset with clouds moving gently"
video = pipe(
prompt=prompt,
num_frames=48,
height=512,
width=512,
num_inference_steps=30,
guidance_scale=7.5
).frames
# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=24)
Using UMT5-XXL for Multilingual Support
from diffusers import WanPipeline
import torch
# Load with multilingual encoder
pipe = WanPipeline.from_pretrained(
"path/to/wan22-base-model",
text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/umt5-xxl-fp8.safetensors",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
# Generate with non-English prompt
prompt = "ηΎδΈ½η樱θ±ζ ε¨ζ₯倩ηεΌοΌθ±η£ιι£ι£θ½"
video = pipe(prompt=prompt, num_frames=48).frames
Memory-Efficient Configuration
import torch
from diffusers import WanPipeline
# Enable memory optimizations
pipe = WanPipeline.from_pretrained(
"path/to/wan22-base-model",
text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/t5-xxl-fp8.safetensors",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
# Enable memory-efficient attention
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
# Generate with lower memory usage
video = pipe(
prompt="Your prompt here",
num_frames=24, # Reduced frame count
height=512,
width=512
).frames
Model Specifications
T5-XXL FP8 Encoder
- Architecture: T5 (Text-to-Text Transfer Transformer)
- Size: XXL variant (11 billion parameters)
- Precision: FP8 (8-bit floating point)
- Format: SafeTensors
- Language: English-optimized
- Context Length: 512 tokens
- Embedding Dimension: 4096
UMT5-XXL FP8 Encoder
- Architecture: UMT5 (Unified Multilingual T5)
- Size: XXL variant (11 billion parameters)
- Precision: FP8 (8-bit floating point)
- Format: SafeTensors
- Languages: 100+ languages supported
- Context Length: 512 tokens
- Embedding Dimension: 4096
Performance Tips and Optimization
Memory Optimization
- Use T5-XXL for English: Save 1.7 GB VRAM with T5-XXL vs UMT5-XXL
- Enable Attention Slicing: Reduces peak memory usage by 20-30%
- Enable VAE Slicing: Further reduces memory for longer videos
- Reduce Frame Count: Start with 24-48 frames for testing
- Lower Resolution: Use 512x512 instead of 1024x1024 for testing
Quality Optimization
- Increase Inference Steps: 30-50 steps for higher quality (default: 30)
- Adjust Guidance Scale: 7.0-9.0 range for better prompt adherence
- Use UMT5 for Complex Prompts: Better semantic understanding
- Longer Prompts: Detailed descriptions produce better results
- Seed Control: Use fixed seeds for reproducible results
Performance Benchmarks
| Configuration | VRAM Usage | Generation Time (48 frames) |
|---|---|---|
| T5-XXL FP8 + Base Model | ~16 GB | ~120 seconds (RTX 4090) |
| UMT5-XXL FP8 + Base Model | ~18 GB | ~130 seconds (RTX 4090) |
| With Attention Slicing | -20% | +10% time |
License
These FP8-quantized text encoders are derived from the original T5 and UMT5 models:
- T5-XXL: Apache 2.0 License
- UMT5-XXL: Apache 2.0 License
- Quantization: Community contribution under Apache 2.0
License Terms: These models may be used for research and commercial purposes. Attribution to the original T5/UMT5 authors and the WAN project is appreciated but not required under Apache 2.0 terms.
Disclaimer: These are quantized versions optimized for memory efficiency. For critical applications, validate output quality against FP16 versions.
Citation
If you use these FP8 encoders in your research or applications, please cite:
@misc{wan22-fp8-encoders,
title={WAN 2.2 FP8 Text Encoders},
author={WAN Community Contributors},
year={2024},
publisher={Hugging Face},
note={FP8-quantized T5-XXL and UMT5-XXL encoders for video generation}
}
@article{t5,
title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
journal={Journal of Machine Learning Research},
volume={21},
number={140},
pages={1--67},
year={2020}
}
@article{umt5,
title={UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining},
author={Chung, Hyung Won and Garrette, Dan and Tan, Kiat Chuan and Riesa, Jason},
journal={arXiv preprint arXiv:2304.09151},
year={2023}
}
Contact and Resources
Official Resources
- WAN Project: Official Repository
- Hugging Face Hub: WAN Models
- Documentation: WAN Docs
Community
- Discord: WAN Community Discord
- GitHub Issues: Report Issues
- Discussions: Hugging Face Discussions
Related Models
- WAN 2.2 Base Model: Full video generation pipeline
- WAN 2.2 VAE: Video autoencoder
- WAN Enhancement LoRAs: Camera control, lighting, quality improvements
Support
For questions, issues, or feature requests:
- Check the official documentation
- Search existing issues
- Join the community Discord
- Open a new issue with detailed information
Note: These FP8 encoders are part of the WAN 2.2 ecosystem. Ensure you have the complete WAN 2.2 pipeline installed for full functionality. Visit the official repository for installation instructions and additional components.
- Downloads last month
- -