Spaces:

JaceWei
/

PaperShow

Running

App Files Files Community

PaperShow / Paper2Video /README.md

ZaynZhu

Clean version without large assets

7c08dc3 about 1 month ago

preview code

raw

history blame contribute delete

10.9 kB

	# Paper2Video

	<p align="right">
	<b>English</b> \| <a href="./README-CN.md">简体中文</a>
	</p>


	<p align="center">
	<b>Paper2Video: Automatic Video Generation from Scientific Papers</b>
	<br>
	从学术论文自动生成演讲视频
	</p>

	<p align="center">
	<a href="https://zeyu-zhu.github.io/webpage/">Zeyu Zhu*</a>,
	<a href="https://qhlin.me/">Kevin Qinghong Lin*</a>,
	<a href="https://scholar.google.com/citations?user=h1-3lSoAAAAJ&hl=en">Mike Zheng Shou</a> <br>
	Show Lab, National University of Singapore
	</p>


	<p align="center">
	<a href="https://arxiv.org/abs/2510.05096">📄 Paper</a>   \|
	<a href="https://huggingface.co/papers/2510.05096">🤗 Daily Paper</a>   \|
	<a href="https://huggingface.co/datasets/ZaynZhu/Paper2Video">📊 Dataset</a>   \|
	<a href="https://showlab.github.io/Paper2Video/">🌐 Project Website</a>   \|
	<a href="https://x.com/KevinQHLin/status/1976105129146257542">💬 X (Twitter)</a>
	</p>

	- Input: a paper ➕ an image ➕ an audio

	\| Paper \| Image \| Audio \|
	\|--------\|--------\|--------\|
	\| <img src="https://github.com/showlab/Paper2Video/blob/page/assets/hinton/paper.png" width="180"/><br>[🔗 Paper link](https://arxiv.org/pdf/1509.01626) \| <img src="https://github.com/showlab/Paper2Video/blob/page/assets/hinton/hinton_head.jpeg" width="180"/> <br>Hinton's photo\| <img src="assets/sound.png" width="180"/><br>[🔗 Audio sample](https://github.com/showlab/Paper2Video/blob/page/assets/hinton/ref_audio_10.wav) \|


	- Output: a presentation video



	https://github.com/user-attachments/assets/39221a9a-48cb-4e20-9d1c-080a5d8379c4




	Check out more examples at [🌐 project page](https://showlab.github.io/Paper2Video/).

	## 🔥 Update
	- [x] [2025.10.11] Our work receives attention on [YC Hacker News](https://news.ycombinator.com/item?id=45553701).
	- [x] [2025.10.9] Thanks AK for sharing our work on [Twitter](https://x.com/_akhaliq/status/1976099830004072849)!
	- [x] [2025.10.9] Our work is reported by [Medium](https://medium.com/@dataism/how-ai-learned-to-make-scientific-videos-from-slides-to-a-talking-head-0d807e491b27).
	- [x] [2025.10.8] Check out our demo video below!
	- [x] [2025.10.7] We release the [arxiv paper](https://arxiv.org/abs/2510.05096).
	- [x] [2025.10.6] We release the [code](https://github.com/showlab/Paper2Video) and [dataset](https://huggingface.co/datasets/ZaynZhu/Paper2Video).
	- [x] [2025.9.28] Paper2Video has been accepted to the Scaling Environments for Agents Workshop([SEA](https://sea-workshop.github.io/)) at NeurIPS 2025.


	https://github.com/user-attachments/assets/a655e3c7-9d76-4c48-b946-1068fdb6cdd9




	---

	### Table of Contents
	- [🌟 Overview](#-overview)
	- [🚀 Quick Start: PaperTalker](#-try-papertalker-for-your-paper-)
	- [1. Requirements](#1-requirements)
	- [2. Configure LLMs](#2-configure-llms)
	- [3. Inference](#3-inference)
	- [📊 Evaluation: Paper2Video](#-evaluation-paper2video)
	- [😼 Fun: Paper2Video for Paper2Video](#-fun-paper2video-for-paper2video)
	- [🙏 Acknowledgements](#-acknowledgements)
	- [📌 Citation](#-citation)

	---

	## 🌟 Overview
	<p align="center">
	<img src="assets/teaser.png" alt="Overview" width="100%">
	</p>

	This work solves two core problems for academic presentations:

	- Left: How to create a presentation video from a paper?
	PaperTalker — an agent that integrates slides, subtitling, cursor grounding, speech synthesis, and talking-head video rendering.

	- Right: How to evaluate a presentation video?
	Paper2Video — a benchmark with well-designed metrics to evaluate presentation quality.


	---

	## 🚀 Try PaperTalker for your Paper!
	<p align="center">
	<img src="assets/method.png" alt="Approach" width="100%">
	</p>

	### 1. Requirements
	Prepare the environment:
	```bash
	cd src
	conda create -n p2v python=3.10
	conda activate p2v
	pip install -r requirements.txt
	conda install -c conda-forge tectonic
	````
	Download the dependent code and follow the instructions in [Hallo2](https://github.com/fudan-generative-vision/hallo2) to download the model weight.
	```bash
	git clone https://github.com/fudan-generative-vision/hallo2.git
	```
	You need to prepare the environment separately for talking-head generation to potential avoide package conflicts, please refer to <a href="git clone https://github.com/fudan-generative-vision/hallo2.git">Hallo2</a>. After installing, use `which python` to get the python environment path.
	```bash
	cd hallo2
	conda create -n hallo python=3.10
	conda activate hallo
	pip install -r requirements.txt
	```

	### 2. Configure LLMs
	Export your API credentials:
	```bash
	export GEMINI_API_KEY="your_gemini_key_here"
	export OPENAI_API_KEY="your_openai_key_here"
	```
	The best practice is to use GPT4.1 or Gemini2.5-Pro for both LLM and VLMs. We also support locally deployed open-source model(e.g., Qwen), details please referring to <a href="https://github.com/Paper2Poster/Paper2Poster.git">Paper2Poster</a>.

	### 3. Inference
	The script `pipeline.py` provides an automated pipeline for generating academic presentation videos. It takes LaTeX paper sources together with reference image/audio as input, and goes through multiple sub-modules (Slides → Subtitles → Speech → Cursor → Talking Head) to produce a complete presentation video. ⚡ The minimum recommended GPU for running this pipeline is NVIDIA A6000 with 48G.

	#### Example Usage

	Run the following command to launch a full generation:

	```bash
	python pipeline.py \
	--model_name_t gpt-4.1 \
	--model_name_v gpt-4.1 \
	--model_name_talking hallo2 \
	--result_dir /path/to/output \
	--paper_latex_root /path/to/latex_proj \
	--ref_img /path/to/ref_img.png \
	--ref_audio /path/to/ref_audio.wav \
	--talking_head_env /path/to/hallo2_env \
	--gpu_list [0,1,2,3,4,5,6,7]
	```

	\| Argument \| Type \| Default \| Description \|
	\|----------\|------\|---------\|-------------\|
	\| `--model_name_t` \| `str` \| `gpt-4.1` \| LLM \|
	\| `--model_name_v` \| `str` \| `gpt-4.1` \| VLM \|
	\| `--model_name_talking` \| `str` \| `hallo2` \| Talking Head model. Currently only hallo2 is supported \|
	\| `--result_dir` \| `str` \| `/path/to/output` \| Output directory (slides, subtitles, videos, etc.) \|
	\| `--paper_latex_root` \| `str` \| `/path/to/latex_proj` \| Root directory of the LaTeX paper project \|
	\| `--ref_img` \| `str` \| `/path/to/ref_img.png` \| Reference image (must be square portrait) \|
	\| `--ref_audio` \| `str` \| `/path/to/ref_audio.wav` \| Reference audio (recommended: ~10s) \|
	\| `--ref_text` \| `str` \| `None` \| Optional reference text (for style guidance for subtitles) \|
	\| `--beamer_templete_prompt` \| `str` \| `None` \| Optional reference text (for style guidance for slides) \|
	\| `--gpu_list` \| `list[int]` \| `""` \| GPU list for parallel execution (used in cursor generation and Talking Head rendering) \|
	\| `--if_tree_search` \| `bool` \| `True` \| Whether to enable tree search for slide layout refinement \|
	\| `--stage` \| `str` \| `"[0]"` \| Pipeline stages to run (e.g., `[0]` full pipeline, `[1,2,3]` partial stages) \|
	\| `--talking_head_env` \| `str` \| `/path/to/hallo2_env` \| python environment path for talking-head generation \|
	---

	## 📊 Evaluation: Paper2Video
	<p align="center">
	<img src="assets/metrics.png" alt="Metrics" width="100%">
	</p>

	Unlike natural video generation, academic presentation videos serve a highly specialized role: they are not merely about visual fidelity but about communicating scholarship. This makes it difficult to directly apply conventional metrics from video synthesis(e.g., FVD, IS, or CLIP-based similarity). Instead, their value lies in how well they disseminate research and amplify scholarly visibility.From this perspective, we argue that a high-quality academic presentation video should be judged along two complementary dimensions:
	#### For the Audience
	- The video is expected to faithfully convey the paper’s core ideas.
	- It should remain accessible to diverse audiences.

	#### For the Author
	- The video should foreground the authors’ intellectual contribution and identity.
	- It should enhance the work’s visibility and impact.

	To capture these goals, we introduce evaluation metrics specifically designed for academic presentation videos: Meta Similarity, PresentArena, PresentQuiz, IP Memory.

	### Run Eval
	- Prepare the environment:
	```bash
	cd src/evaluation
	conda create -n p2v_e python=3.10
	conda activate p2v_e
	pip install -r requirements.txt
	```
	- For MetaSimilarity and PresentArena:
	```bash
	python MetaSim_audio.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
	python MetaSim_content.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
	```
	```bash
	python PresentArena.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
	```
	- For PresentQuiz, first generate questions from paper and eval using Gemini:
	```bash
	cd PresentQuiz
	python create_paper_questions.py ----paper_folder /path/to/data
	python PresentQuiz.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir
	```

	- For IP Memory, first generate question pairs from generated videos and eval using Gemini:
	```bash
	cd IPMemory
	python construct.py
	python ip_qa.py
	```
	See the codes for more details!

	👉 Paper2Video Benchmark is available at:
	[HuggingFace](https://huggingface.co/datasets/ZaynZhu/Paper2Video)

	---

	## 😼 Fun: Paper2Video for Paper2Video
	Check out How Paper2Video for Paper2Video:

	https://github.com/user-attachments/assets/ff58f4d8-8376-4e12-b967-711118adf3c4

	## 🙏 Acknowledgements

	* The souces of the presentation videos are SlideLive and YouTuBe.
	* We thank all the authors who spend a great effort to create presentation videos!
	* We thank [CAMEL](https://github.com/camel-ai/camel) for open-source well-organized multi-agent framework codebase.
	* We thank the authors of [Hallo2](https://github.com/fudan-generative-vision/hallo2.git) and [Paper2Poster](https://github.com/Paper2Poster/Paper2Poster.git) for their open-sourced codes.
	* We thank [Wei Jia](https://github.com/weeadd) for his effort in collecting the data and implementing the baselines. We also thank all the participants involved in the human studies.
	* We thank all the Show Lab @ NUS members for support!



	---

	## 📌 Citation


	If you find our work useful, please cite:

	```bibtex
	@misc{paper2video,
	title={Paper2Video: Automatic Video Generation from Scientific Papers},
	author={Zeyu Zhu and Kevin Qinghong Lin and Mike Zheng Shou},
	year={2025},
	eprint={2510.05096},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2510.05096},
	}
	```
	[![Star History](https://api.star-history.com/svg?repos=showlab/Paper2Video&type=Date)](https://star-history.com/#showlab/Paper2Video&Date)