File size: 6,711 Bytes
ee81688
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# RAPO++ Gradio App Documentation

## Overview

This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.

## What It Does

The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.

## How It Works

### Architecture

1. **Knowledge Graph Construction**
   - Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
   - Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
   - Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")

2. **Retrieval Process**
   - Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
   - Finds top-K most similar places via cosine similarity
   - Samples connected actions and atmosphere descriptors from graph neighbors
   - Filters modifiers by relevance to the input prompt

3. **Prompt Augmentation**
   - Combines original prompt with retrieved modifiers
   - Structures the output to maintain coherence
   - Returns optimized prompt suitable for T2V generation

### Key Components

**app.py** (main application):
- `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts
- `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU
- Gradio interface with examples and detailed documentation

**requirements.txt**:
- gradio 5.49.1 (pinned for compatibility)
- sentence-transformers + sentencepiece for embeddings
- torch 2.5.1 for tensor operations
- networkx for graph operations
- huggingface_hub for model downloads

## Model Downloads

The app automatically downloads the required model on first run:
- **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB)

Downloaded to: `./ckpt/all-MiniLM-L6-v2/`

## Usage

### Basic Usage

1. Enter a simple prompt (e.g., "A person walking")
2. Click "Optimize Prompt"
3. View the enhanced prompt with contextual details

### Advanced Settings

- **Number of Places to Retrieve**: How many related places to search (1-5, default: 2)
- **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5)

### Example Prompts

Try these examples to see the optimization in action:
- "A person walking"
- "A car driving at night"
- "Someone cooking in a kitchen"
- "A group of people talking"
- "A bird flying"
- "Someone sitting and reading"

## Technical Details

### Graph Structure

**Places (central nodes):**
- forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake

**Edge Types:**
- Place β†’ Verb/Action edges (e.g., "forest" β†’ "walking through")
- Place β†’ Atmosphere edges (e.g., "forest" β†’ "dense trees")

**Retrieval Algorithm:**
1. Encode input prompt: `prompt_emb = model.encode(prompt)`
2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)`
3. Select top-K places by similarity score
4. Sample neighbors from graph: `G.neighbors(place)`
5. Deduplicate and rank modifiers

### ZeroGPU Integration

The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
- Fast embedding computations
- Efficient cosine similarity calculations
- Scalability to larger graphs and batch processing

### Differences from Full RAPO

This demo implements a **simplified version** of Stage 1 RAPO:

**Included:**
βœ… Knowledge graph with place-verb-scene relations
βœ… Embedding-based retrieval via SentenceTransformer
βœ… Cosine similarity ranking
βœ… Basic prompt augmentation

**Not Included (requires additional models/data):**
❌ Full relation graph from paper (requires ~GB of graph data)
❌ LLM-based sentence refactoring (Mistral-7B)
❌ Iterative merging with similarity thresholds
❌ Instruction-based rewriting (Llama3.1)

**Why This Approach:**
- Full RAPO requires 7B+ LLM downloads (~15GB+)
- Full graph data requires downloading preprocessed datasets
- This demo focuses on the **core concept**: retrieval-augmented prompt optimization
- Users can understand the methodology without waiting for large downloads

## Running the Full RAPO Pipeline

To run the complete Stage 1 RAPO from the paper:

```bash
cd examples/Stage1_RAPO

# 1. Retrieve modifiers from graph
sh retrieve_modifiers.sh

# 2. Word augmentation
sh word_augment.sh

# 3. Sentence refactoring
sh refactoring.sh

# 4. Instruction-based rewriting
sh rewrite_via_instruction.sh
```

**Requirements:**
- Download full relation graph data to `relation_graph/graph_data/`
- Download Mistral-7B-Instruct-v0.3 to `ckpt/`
- Download llama3_1_instruct_lora_rewrite to `ckpt/`

See README.md for full installation instructions.

## Integration with RAPO++ Stages

This demo showcases **Stage 1 only**. The complete RAPO++ framework includes:

**Stage 1 (RAPO)** - *Demonstrated Here*
- Retrieval-augmented prompt optimization via knowledge graphs
- Offline refinement using curated data

**Stage 2 (SSPO)**
- Self-supervised prompt optimization
- Iterative refinement based on generated video feedback
- Physics-aware consistency checks
- VLM-based alignment scoring

**Stage 3 (Fine-tuning)**
- LLM fine-tuning on collected feedback from Stage 2
- Model-specific prompt refiners

## Performance Notes

- First run: ~1-2 minutes (downloads model)
- Subsequent runs: <1 second per prompt
- GPU allocation: Automatic via ZeroGPU
- Memory usage: ~500MB (model + graph)

## Troubleshooting

**"No module named 'sentencepiece'"**
- Ensure `sentencepiece==0.2.1` is in requirements.txt
- sentence-transformers requires sentencepiece for tokenization

**"CUDA has been initialized before importing spaces"**
- The app correctly imports `spaces` FIRST before torch
- If you modify the code, maintain this import order

**Model download fails**
- Check internet connection
- HuggingFace Hub may be temporarily unavailable
- Model will retry on next run (cached after successful download)

## References

**Papers:**
- [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts
- [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization

**Project Pages:**
- RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
- RAPO++: https://whynothaha.github.io/RAPO_plus_github/

**Code:**
- GitHub: https://github.com/Vchitect/RAPO

## License

Please refer to the original repository for licensing information.

---

**Created for HuggingFace Spaces deployment**