Spaces:

jbilcke-hf
/

SNIPED_rapo

Running on Zero

File size: 6,711 Bytes

ee81688

# RAPO++ Gradio App Documentation

## Overview

This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.

## What It Does

The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.

## How It Works

### Architecture

1. **Knowledge Graph Construction**
   - Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
   - Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
   - Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")

2. **Retrieval Process**
   - Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
   - Finds top-K most similar places via cosine similarity
   - Samples connected actions and atmosphere descriptors from graph neighbors
   - Filters modifiers by relevance to the input prompt

3. **Prompt Augmentation**
   - Combines original prompt with retrieved modifiers
   - Structures the output to maintain coherence
   - Returns optimized prompt suitable for T2V generation

### Key Components

**app.py** (main application):
- `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts
- `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU
- Gradio interface with examples and detailed documentation

**requirements.txt**:
- gradio 5.49.1 (pinned for compatibility)
- sentence-transformers + sentencepiece for embeddings
- torch 2.5.1 for tensor operations
- networkx for graph operations
- huggingface_hub for model downloads

## Model Downloads

The app automatically downloads the required model on first run:
- **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB)

Downloaded to: `./ckpt/all-MiniLM-L6-v2/`

## Usage

### Basic Usage

1. Enter a simple prompt (e.g., "A person walking")
2. Click "Optimize Prompt"
3. View the enhanced prompt with contextual details

### Advanced Settings

- **Number of Places to Retrieve**: How many related places to search (1-5, default: 2)
- **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5)

### Example Prompts

Try these examples to see the optimization in action:
- "A person walking"
- "A car driving at night"
- "Someone cooking in a kitchen"
- "A group of people talking"
- "A bird flying"
- "Someone sitting and reading"

## Technical Details

### Graph Structure

**Places (central nodes):**
- forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake

**Edge Types:**
- Place → Verb/Action edges (e.g., "forest" → "walking through")
- Place → Atmosphere edges (e.g., "forest" → "dense trees")

**Retrieval Algorithm:**
1. Encode input prompt: `prompt_emb = model.encode(prompt)`
2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)`
3. Select top-K places by similarity score
4. Sample neighbors from graph: `G.neighbors(place)`
5. Deduplicate and rank modifiers

### ZeroGPU Integration

The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
- Fast embedding computations
- Efficient cosine similarity calculations
- Scalability to larger graphs and batch processing

### Differences from Full RAPO

This demo implements a **simplified version** of Stage 1 RAPO:

**Included:**
✅ Knowledge graph with place-verb-scene relations
✅ Embedding-based retrieval via SentenceTransformer
✅ Cosine similarity ranking
✅ Basic prompt augmentation

**Not Included (requires additional models/data):**
❌ Full relation graph from paper (requires ~GB of graph data)
❌ LLM-based sentence refactoring (Mistral-7B)
❌ Iterative merging with similarity thresholds
❌ Instruction-based rewriting (Llama3.1)

**Why This Approach:**
- Full RAPO requires 7B+ LLM downloads (~15GB+)
- Full graph data requires downloading preprocessed datasets
- This demo focuses on the **core concept**: retrieval-augmented prompt optimization
- Users can understand the methodology without waiting for large downloads

## Running the Full RAPO Pipeline

To run the complete Stage 1 RAPO from the paper:

```bash
cd examples/Stage1_RAPO

# 1. Retrieve modifiers from graph
sh retrieve_modifiers.sh

# 2. Word augmentation
sh word_augment.sh

# 3. Sentence refactoring
sh refactoring.sh

# 4. Instruction-based rewriting
sh rewrite_via_instruction.sh
```

**Requirements:**
- Download full relation graph data to `relation_graph/graph_data/`
- Download Mistral-7B-Instruct-v0.3 to `ckpt/`
- Download llama3_1_instruct_lora_rewrite to `ckpt/`

See README.md for full installation instructions.

## Integration with RAPO++ Stages

This demo showcases **Stage 1 only**. The complete RAPO++ framework includes:

**Stage 1 (RAPO)** - *Demonstrated Here*
- Retrieval-augmented prompt optimization via knowledge graphs
- Offline refinement using curated data

**Stage 2 (SSPO)**
- Self-supervised prompt optimization
- Iterative refinement based on generated video feedback
- Physics-aware consistency checks
- VLM-based alignment scoring

**Stage 3 (Fine-tuning)**
- LLM fine-tuning on collected feedback from Stage 2
- Model-specific prompt refiners

## Performance Notes

- First run: ~1-2 minutes (downloads model)
- Subsequent runs: <1 second per prompt
- GPU allocation: Automatic via ZeroGPU
- Memory usage: ~500MB (model + graph)

## Troubleshooting

**"No module named 'sentencepiece'"**
- Ensure `sentencepiece==0.2.1` is in requirements.txt
- sentence-transformers requires sentencepiece for tokenization

**"CUDA has been initialized before importing spaces"**
- The app correctly imports `spaces` FIRST before torch
- If you modify the code, maintain this import order

**Model download fails**
- Check internet connection
- HuggingFace Hub may be temporarily unavailable
- Model will retry on next run (cached after successful download)

## References

**Papers:**
- [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts
- [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization

**Project Pages:**
- RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
- RAPO++: https://whynothaha.github.io/RAPO_plus_github/

**Code:**
- GitHub: https://github.com/Vchitect/RAPO

## License

Please refer to the original repository for licensing information.

---

**Created for HuggingFace Spaces deployment**