Spaces:
Running
on
Zero
Running
on
Zero
File size: 6,711 Bytes
ee81688 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
# RAPO++ Gradio App Documentation
## Overview
This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.
## What It Does
The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.
## How It Works
### Architecture
1. **Knowledge Graph Construction**
- Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
- Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
- Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")
2. **Retrieval Process**
- Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
- Finds top-K most similar places via cosine similarity
- Samples connected actions and atmosphere descriptors from graph neighbors
- Filters modifiers by relevance to the input prompt
3. **Prompt Augmentation**
- Combines original prompt with retrieved modifiers
- Structures the output to maintain coherence
- Returns optimized prompt suitable for T2V generation
### Key Components
**app.py** (main application):
- `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts
- `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU
- Gradio interface with examples and detailed documentation
**requirements.txt**:
- gradio 5.49.1 (pinned for compatibility)
- sentence-transformers + sentencepiece for embeddings
- torch 2.5.1 for tensor operations
- networkx for graph operations
- huggingface_hub for model downloads
## Model Downloads
The app automatically downloads the required model on first run:
- **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB)
Downloaded to: `./ckpt/all-MiniLM-L6-v2/`
## Usage
### Basic Usage
1. Enter a simple prompt (e.g., "A person walking")
2. Click "Optimize Prompt"
3. View the enhanced prompt with contextual details
### Advanced Settings
- **Number of Places to Retrieve**: How many related places to search (1-5, default: 2)
- **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5)
### Example Prompts
Try these examples to see the optimization in action:
- "A person walking"
- "A car driving at night"
- "Someone cooking in a kitchen"
- "A group of people talking"
- "A bird flying"
- "Someone sitting and reading"
## Technical Details
### Graph Structure
**Places (central nodes):**
- forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake
**Edge Types:**
- Place β Verb/Action edges (e.g., "forest" β "walking through")
- Place β Atmosphere edges (e.g., "forest" β "dense trees")
**Retrieval Algorithm:**
1. Encode input prompt: `prompt_emb = model.encode(prompt)`
2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)`
3. Select top-K places by similarity score
4. Sample neighbors from graph: `G.neighbors(place)`
5. Deduplicate and rank modifiers
### ZeroGPU Integration
The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
- Fast embedding computations
- Efficient cosine similarity calculations
- Scalability to larger graphs and batch processing
### Differences from Full RAPO
This demo implements a **simplified version** of Stage 1 RAPO:
**Included:**
β
Knowledge graph with place-verb-scene relations
β
Embedding-based retrieval via SentenceTransformer
β
Cosine similarity ranking
β
Basic prompt augmentation
**Not Included (requires additional models/data):**
β Full relation graph from paper (requires ~GB of graph data)
β LLM-based sentence refactoring (Mistral-7B)
β Iterative merging with similarity thresholds
β Instruction-based rewriting (Llama3.1)
**Why This Approach:**
- Full RAPO requires 7B+ LLM downloads (~15GB+)
- Full graph data requires downloading preprocessed datasets
- This demo focuses on the **core concept**: retrieval-augmented prompt optimization
- Users can understand the methodology without waiting for large downloads
## Running the Full RAPO Pipeline
To run the complete Stage 1 RAPO from the paper:
```bash
cd examples/Stage1_RAPO
# 1. Retrieve modifiers from graph
sh retrieve_modifiers.sh
# 2. Word augmentation
sh word_augment.sh
# 3. Sentence refactoring
sh refactoring.sh
# 4. Instruction-based rewriting
sh rewrite_via_instruction.sh
```
**Requirements:**
- Download full relation graph data to `relation_graph/graph_data/`
- Download Mistral-7B-Instruct-v0.3 to `ckpt/`
- Download llama3_1_instruct_lora_rewrite to `ckpt/`
See README.md for full installation instructions.
## Integration with RAPO++ Stages
This demo showcases **Stage 1 only**. The complete RAPO++ framework includes:
**Stage 1 (RAPO)** - *Demonstrated Here*
- Retrieval-augmented prompt optimization via knowledge graphs
- Offline refinement using curated data
**Stage 2 (SSPO)**
- Self-supervised prompt optimization
- Iterative refinement based on generated video feedback
- Physics-aware consistency checks
- VLM-based alignment scoring
**Stage 3 (Fine-tuning)**
- LLM fine-tuning on collected feedback from Stage 2
- Model-specific prompt refiners
## Performance Notes
- First run: ~1-2 minutes (downloads model)
- Subsequent runs: <1 second per prompt
- GPU allocation: Automatic via ZeroGPU
- Memory usage: ~500MB (model + graph)
## Troubleshooting
**"No module named 'sentencepiece'"**
- Ensure `sentencepiece==0.2.1` is in requirements.txt
- sentence-transformers requires sentencepiece for tokenization
**"CUDA has been initialized before importing spaces"**
- The app correctly imports `spaces` FIRST before torch
- If you modify the code, maintain this import order
**Model download fails**
- Check internet connection
- HuggingFace Hub may be temporarily unavailable
- Model will retry on next run (cached after successful download)
## References
**Papers:**
- [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts
- [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization
**Project Pages:**
- RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
- RAPO++: https://whynothaha.github.io/RAPO_plus_github/
**Code:**
- GitHub: https://github.com/Vchitect/RAPO
## License
Please refer to the original repository for licensing information.
---
**Created for HuggingFace Spaces deployment**
|