File size: 10,897 Bytes
7c08dc3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# Paper2Video

<p align="right">
  <b>English</b> | <a href="./README-CN.md">็ฎ€ไฝ“ไธญๆ–‡</a>
</p>


<p align="center">
  <b>Paper2Video: Automatic Video Generation from Scientific Papers</b>
<br>
ไปŽๅญฆๆœฏ่ฎบๆ–‡่‡ชๅŠจ็”Ÿๆˆๆผ”่ฎฒ่ง†้ข‘
</p>

<p align="center">
  <a href="https://zeyu-zhu.github.io/webpage/">Zeyu Zhu*</a>,
  <a href="https://qhlin.me/">Kevin Qinghong Lin*</a>,
  <a href="https://scholar.google.com/citations?user=h1-3lSoAAAAJ&hl=en">Mike Zheng Shou</a> <br>
  Show Lab, National University of Singapore
</p>


<p align="center">
ย  <a href="https://arxiv.org/abs/2510.05096">๐Ÿ“„ Paper</a> &nbsp; | &nbsp;
  <a href="https://huggingface.co/papers/2510.05096">๐Ÿค— Daily Paper</a> &nbsp; | &nbsp;
ย  <a href="https://huggingface.co/datasets/ZaynZhu/Paper2Video">๐Ÿ“Š Dataset</a> &nbsp; | &nbsp;
ย  <a href="https://showlab.github.io/Paper2Video/">๐ŸŒ Project Website</a> &nbsp; | &nbsp;
ย  <a href="https://x.com/KevinQHLin/status/1976105129146257542">๐Ÿ’ฌ X (Twitter)</a>
</p>

- **Input:** a paper โž• an image โž• an audio
  
| Paper | Image | Audio |
|--------|--------|--------|
| <img src="https://github.com/showlab/Paper2Video/blob/page/assets/hinton/paper.png" width="180"/><br>[๐Ÿ”— Paper link](https://arxiv.org/pdf/1509.01626) | <img src="https://github.com/showlab/Paper2Video/blob/page/assets/hinton/hinton_head.jpeg" width="180"/> <br>Hinton's photo| <img src="assets/sound.png" width="180"/><br>[๐Ÿ”— Audio sample](https://github.com/showlab/Paper2Video/blob/page/assets/hinton/ref_audio_10.wav) |


- **Output:** a presentation video



https://github.com/user-attachments/assets/39221a9a-48cb-4e20-9d1c-080a5d8379c4




Check out more examples at [๐ŸŒ project page](https://showlab.github.io/Paper2Video/).

## ๐Ÿ”ฅ Update
- [x] [2025.10.11] Our work receives attention on [YC Hacker News](https://news.ycombinator.com/item?id=45553701).
- [x] [2025.10.9] Thanks AK for sharing our work on [Twitter](https://x.com/_akhaliq/status/1976099830004072849)!
- [x] [2025.10.9] Our work is reported by [Medium](https://medium.com/@dataism/how-ai-learned-to-make-scientific-videos-from-slides-to-a-talking-head-0d807e491b27).
- [x] [2025.10.8] Check out our demo video below!
- [x] [2025.10.7] We release the [arxiv paper](https://arxiv.org/abs/2510.05096).
- [x] [2025.10.6] We release the [code](https://github.com/showlab/Paper2Video) and [dataset](https://huggingface.co/datasets/ZaynZhu/Paper2Video).
- [x] [2025.9.28] Paper2Video has been accepted to the **Scaling Environments for Agents Workshop([SEA](https://sea-workshop.github.io/)) at NeurIPS 2025**.


https://github.com/user-attachments/assets/a655e3c7-9d76-4c48-b946-1068fdb6cdd9




---

### Table of Contents
- [๐ŸŒŸ Overview](#-overview)
- [๐Ÿš€ Quick Start: PaperTalker](#-try-papertalker-for-your-paper-)
  - [1. Requirements](#1-requirements)
  - [2. Configure LLMs](#2-configure-llms)
  - [3. Inference](#3-inference)
- [๐Ÿ“Š Evaluation: Paper2Video](#-evaluation-paper2video)
- [๐Ÿ˜ผ Fun: Paper2Video for Paper2Video](#-fun-paper2video-for-paper2video)
- [๐Ÿ™ Acknowledgements](#-acknowledgements)
- [๐Ÿ“Œ Citation](#-citation)

---

## ๐ŸŒŸ Overview
<p align="center">
  <img src="assets/teaser.png" alt="Overview" width="100%">
</p>

This work solves two core problems for academic presentations:

- **Left: How to create a presentation video from a paper?**  
  *PaperTalker* โ€” an agent that integrates **slides**, **subtitling**, **cursor grounding**, **speech synthesis**, and **talking-head video rendering**.

- **Right: How to evaluate a presentation video?**  
  *Paper2Video* โ€” a benchmark with well-designed metrics to evaluate presentation quality.


---

## ๐Ÿš€ Try PaperTalker for your Paper!
<p align="center">
  <img src="assets/method.png" alt="Approach" width="100%">
</p>

### 1. Requirements
Prepare the environment:
```bash
cd src
conda create -n p2v python=3.10
conda activate p2v
pip install -r requirements.txt
conda install -c conda-forge tectonic
````
Download the dependent code and follow the instructions in **[Hallo2](https://github.com/fudan-generative-vision/hallo2)** to download the model weight.
```bash
git clone https://github.com/fudan-generative-vision/hallo2.git
```
You need to **prepare the environment separately for talking-head generation** to potential avoide package conflicts, please refer to  <a href="git clone https://github.com/fudan-generative-vision/hallo2.git">Hallo2</a>. After installing, use `which python` to get the python environment path.
```bash
cd hallo2
conda create -n hallo python=3.10
conda activate hallo
pip install -r requirements.txt
```

### 2. Configure LLMs
Export your **API credentials**:
```bash
export GEMINI_API_KEY="your_gemini_key_here"
export OPENAI_API_KEY="your_openai_key_here"
```
The best practice is to use **GPT4.1** or **Gemini2.5-Pro** for both LLM and VLMs. We also support locally deployed open-source model(e.g., Qwen), details please referring to <a href="https://github.com/Paper2Poster/Paper2Poster.git">Paper2Poster</a>.

### 3. Inference
The script `pipeline.py` provides an automated pipeline for generating academic presentation videos. It takes **LaTeX paper sources** together with **reference image/audio** as input, and goes through multiple sub-modules (Slides โ†’ Subtitles โ†’ Speech โ†’ Cursor โ†’ Talking Head) to produce a complete presentation video. โšก The minimum recommended GPU for running this pipeline is **NVIDIA A6000** with 48G.

#### Example Usage

Run the following command to launch a full generation:

```bash
python pipeline.py \
    --model_name_t gpt-4.1 \
    --model_name_v gpt-4.1 \
    --model_name_talking hallo2 \
    --result_dir /path/to/output \
    --paper_latex_root /path/to/latex_proj \
    --ref_img /path/to/ref_img.png \
    --ref_audio /path/to/ref_audio.wav \
    --talking_head_env /path/to/hallo2_env \
    --gpu_list [0,1,2,3,4,5,6,7]
```

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--model_name_t` | `str` | `gpt-4.1` | LLM |
| `--model_name_v` | `str` | `gpt-4.1` | VLM |
| `--model_name_talking` | `str` | `hallo2` | Talking Head model. Currently only **hallo2** is supported |
| `--result_dir` | `str` | `/path/to/output` | Output directory (slides, subtitles, videos, etc.) |
| `--paper_latex_root` | `str` | `/path/to/latex_proj` | Root directory of the LaTeX paper project |
| `--ref_img` | `str` | `/path/to/ref_img.png` | Reference image (must be **square** portrait) |
| `--ref_audio` | `str` | `/path/to/ref_audio.wav` | Reference audio (recommended: ~10s) |
| `--ref_text` | `str` | `None` | Optional reference text (for style guidance for subtitles) |
| `--beamer_templete_prompt` | `str` | `None` | Optional reference text (for style guidance for slides) |
| `--gpu_list` | `list[int]` | `""` | GPU list for parallel execution (used in **cursor generation** and **Talking Head rendering**) |
| `--if_tree_search` | `bool` | `True` | Whether to enable tree search for slide layout refinement |
| `--stage` | `str` | `"[0]"` | Pipeline stages to run (e.g., `[0]` full pipeline, `[1,2,3]` partial stages) |
| `--talking_head_env` | `str` | `/path/to/hallo2_env` | python environment path for talking-head generation |
---

## ๐Ÿ“Š Evaluation: Paper2Video
<p align="center">
  <img src="assets/metrics.png" alt="Metrics" width="100%">
</p>

Unlike natural video generation, academic presentation videos serve a highly specialized role: they are not merely about visual fidelity but about **communicating scholarship**. This makes it difficult to directly apply conventional metrics from video synthesis(e.g., FVD, IS, or CLIP-based similarity). Instead, their value lies in how well they **disseminate research** and **amplify scholarly visibility**.From this perspective, we argue that a high-quality academic presentation video should be judged along two complementary dimensions:
#### For the Audience
- The video is expected to **faithfully convey the paperโ€™s core ideas**.  
- It should remain **accessible to diverse audiences**.  

#### For the Author
- The video should **foreground the authorsโ€™ intellectual contribution and identity**.  
- It should **enhance the workโ€™s visibility and impact**.  

To capture these goals, we introduce evaluation metrics specifically designed for academic presentation videos: Meta Similarity, PresentArena, PresentQuiz, IP Memory.

### Run Eval
- Prepare the environment:
```bash
cd src/evaluation
conda create -n p2v_e python=3.10
conda activate p2v_e
pip install -r requirements.txt
```
- For MetaSimilarity and PresentArena:
```bash
python MetaSim_audio.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
python MetaSim_content.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
```
```bash
python PresentArena.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
```
- For **PresentQuiz**, first generate questions from paper and eval using Gemini:
```bash
cd PresentQuiz
python create_paper_questions.py ----paper_folder /path/to/data
python PresentQuiz.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
```

- For **IP Memory**, first generate question pairs from generated videos and eval using Gemini:
```bash
cd IPMemory
python construct.py
python ip_qa.py
```
See the codes for more details!

๐Ÿ‘‰ Paper2Video Benchmark is available at:
[HuggingFace](https://huggingface.co/datasets/ZaynZhu/Paper2Video)

---

## ๐Ÿ˜ผ Fun: Paper2Video for Paper2Video
Check out **How Paper2Video for Paper2Video**:

https://github.com/user-attachments/assets/ff58f4d8-8376-4e12-b967-711118adf3c4

## ๐Ÿ™ Acknowledgements

* The souces of the presentation videos are SlideLive and YouTuBe.
* We thank all the authors who spend a great effort to create presentation videos!
* We thank [CAMEL](https://github.com/camel-ai/camel) for open-source well-organized multi-agent framework codebase.
* We thank the authors of [Hallo2](https://github.com/fudan-generative-vision/hallo2.git) and [Paper2Poster](https://github.com/Paper2Poster/Paper2Poster.git) for their open-sourced codes.
* We thank [Wei Jia](https://github.com/weeadd) for his effort in collecting the data and implementing the baselines. We also thank all the participants involved in the human studies.
* We thank all the **Show Lab @ NUS** members for support!



---

## ๐Ÿ“Œ Citation


If you find our work useful, please cite:

```bibtex
@misc{paper2video,
      title={Paper2Video: Automatic Video Generation from Scientific Papers}, 
      author={Zeyu Zhu and Kevin Qinghong Lin and Mike Zheng Shou},
      year={2025},
      eprint={2510.05096},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.05096}, 
}
```
[![Star History](https://api.star-history.com/svg?repos=showlab/Paper2Video&type=Date)](https://star-history.com/#showlab/Paper2Video&Date)