Spaces:
Paused
Paused
root
commited on
Commit
Β·
18f9df3
1
Parent(s):
424a94c
new README.md
Browse files
README.md
CHANGED
|
@@ -1,249 +1,9 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities.
|
| 11 |
-
|
| 12 |
-
<div style='display:flex; gap: 0.25rem; '>
|
| 13 |
-
<a href='https://modelscope.cn/studios/damo/video-llama/summary'><img src='https://img.shields.io/badge/ModelScope-Demo-blueviolet'></a>
|
| 14 |
-
<a href='https://www.modelscope.cn/models/damo/videollama_7b_llama2_finetuned/summary'><img src='https://img.shields.io/badge/ModelScope-Checkpoint-blueviolet'></a>
|
| 15 |
-
<a href='https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a>
|
| 16 |
-
<a href='https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-blue'></a>
|
| 17 |
-
<a href='https://arxiv.org/abs/2306.02858'><img src='https://img.shields.io/badge/Paper-PDF-red'></a>
|
| 18 |
-
</div>
|
| 19 |
-
|
| 20 |
-
## News
|
| 21 |
-
- [08.03] ππ Release **Video-LLaMA-2** with [Llama-2-7B/13B-Chat](https://huggingface.co/meta-llama) as language decoder
|
| 22 |
-
- **NO** delta weights and separate Q-former weights anymore, full weights to run Video-LLaMA are all here :point_right: [[7B](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned)][[13B](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-13B-Finetuned)]
|
| 23 |
-
- Allow further customization starting from our pre-trained checkpoints [[7B-Pretrained](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Pretrained)] [[13B-Pretrained](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-13B-Pretrained)]
|
| 24 |
-
- [06.14] **NOTE**: The current online interactive demo is primarily for English chatting and it may **NOT** be a good option to ask Chinese questions since Vicuna/LLaMA does not represent Chinese texts very well.
|
| 25 |
-
- [06.13] **NOTE**: The audio support is **ONLY** for Vicuna-7B by now although we have several VL checkpoints available for other decoders.
|
| 26 |
-
- [06.10] **NOTE**: We have NOT updated the HF demo yet because the whole framework (with the audio branch) cannot run normally on A10-24G. The current running demo is still the previous version of Video-LLaMA. We will fix this issue soon.
|
| 27 |
-
- [06.08] ππ Release the checkpoints of the audio-supported Video-LLaMA. Documentation and example outputs are also updated.
|
| 28 |
-
- [05.22] ππ Interactive demo online, try our Video-LLaMA (with **Vicuna-7B** as language decoder) at [Hugging Face](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA) and [ModelScope](https://pre.modelscope.cn/studios/damo/video-llama/summary)!!
|
| 29 |
-
- [05.22] βοΈ Release **Video-LLaMA v2** built with Vicuna-7B
|
| 30 |
-
- [05.18] ππ Support video-grounded chat in Chinese
|
| 31 |
-
- [**Video-LLaMA-BiLLA**](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-billa7b-zh.pth): we introduce [BiLLa-7B-SFT](https://huggingface.co/Neutralzz/BiLLa-7B-SFT) as language decoder and fine-tune the video-language aligned model (i.e., stage 1 model) with machine-translated [VideoChat](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data) instructions.
|
| 32 |
-
- [**Video-LLaMA-Ziya**](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-ziya13b-zh.pth): same with Video-LLaMA-BiLLA but the language decoder is changed to [Ziya-13B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1).
|
| 33 |
-
- [05.18] βοΈ Create a Hugging Face [repo](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series) to store the model weights of all the variants of our Video-LLaMA.
|
| 34 |
-
- [05.15] βοΈ Release [**Video-LLaMA v2**](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna13b-v2.pth): we use the training data provided by [VideoChat](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data) to further enhance the instruction-following capability of Video-LLaMA.
|
| 35 |
-
- [05.07] Release the initial version of **Video-LLaMA**, including its pre-trained and instruction-tuned checkpoints.
|
| 36 |
-
|
| 37 |
-
<p align="center" width="100%">
|
| 38 |
-
<a target="_blank"><img src="figs/architecture_v2.png" alt="Video-LLaMA" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a>
|
| 39 |
-
</p>
|
| 40 |
-
|
| 41 |
-
## Introduction
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
- Video-LLaMA is built on top of [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) and [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4). It is composed of two core components: (1) Vision-Language (VL) Branch and (2) Audio-Language (AL) Branch.
|
| 45 |
-
- **VL Branch** (Visual encoder: ViT-G/14 + BLIP-2 Q-Former)
|
| 46 |
-
- A two-layer video Q-Former and a frame embedding layer (applied to the embeddings of each frame) are introduced to compute video representations.
|
| 47 |
-
- We train VL Branch on the Webvid-2M video caption dataset with a video-to-text generation task. We also add image-text pairs (~595K image captions from [LLaVA](https://github.com/haotian-liu/LLaVA)) into the pre-training dataset to enhance the understanding of static visual concepts.
|
| 48 |
-
- After pre-training, we further fine-tune our VL Branch using the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything).
|
| 49 |
-
- **AL Branch** (Audio encoder: ImageBind-Huge)
|
| 50 |
-
- A two-layer audio Q-Former and an audio segment embedding layer (applied to the embedding of each audio segment) are introduced to compute audio representations.
|
| 51 |
-
- As the used audio encoder (i.e., ImageBind) is already aligned across multiple modalities, we train AL Branch on video/image instruction data only, just to connect the output of ImageBind to the language decoder.
|
| 52 |
-
- Only the Video/Audio Q-Former, positional embedding layers, and linear layers are trainable during cross-modal training.
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
## Example Outputs
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
- **Video with background sound**
|
| 60 |
-
|
| 61 |
-
<p float="left">
|
| 62 |
-
<img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/7f7bddb2-5cf1-4cf4-bce3-3fa67974cbb3" style="width: 45%; margin: auto;">
|
| 63 |
-
<img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/ec76be04-4aa9-4dde-bff2-0a232b8315e0" style="width: 45%; margin: auto;">
|
| 64 |
-
</p>
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
- **Video without sound effects**
|
| 68 |
-
<p float="left">
|
| 69 |
-
<img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/539ea3cc-360d-4b2c-bf86-5505096df2f7" style="width: 45%; margin: auto;">
|
| 70 |
-
<img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/7304ad6f-1009-46f1-aca4-7f861b636363" style="width: 45%; margin: auto;">
|
| 71 |
-
</p>
|
| 72 |
-
|
| 73 |
-
- **Static image**
|
| 74 |
-
<p float="left">
|
| 75 |
-
<img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/a146c169-8693-4627-96e6-f885ca22791f" style="width: 45%; margin: auto;">
|
| 76 |
-
<img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/66fc112d-e47e-4b66-b9bc-407f8d418b17" style="width: 45%; margin: auto;">
|
| 77 |
-
</p>
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
## Pre-trained & Fine-tuned Checkpoints
|
| 82 |
-
|
| 83 |
-
The following checkpoints store learnable parameters (positional embedding layers, Video/Audio Q-former, and linear projection layers) only.
|
| 84 |
-
|
| 85 |
-
#### Vision-Language Branch
|
| 86 |
-
| Checkpoint | Link | Note |
|
| 87 |
-
|:------------|-------------|-------------|
|
| 88 |
-
| pretrain-vicuna7b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain_vicuna7b-v2.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
|
| 89 |
-
| finetune-vicuna7b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
|
| 90 |
-
| pretrain-vicuna13b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-vicuna13b.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
|
| 91 |
-
| finetune-vicuna13b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna13b-v2.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
|
| 92 |
-
| pretrain-ziya13b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-ziya13b-zh.pth) | Pre-trained with Chinese LLM [Ziya-13B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
|
| 93 |
-
| finetune-ziya13b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-ziya13b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese)|
|
| 94 |
-
| pretrain-billa7b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-billa7b-zh.pth) | Pre-trained with Chinese LLM [BiLLA-7B-SFT](https://huggingface.co/Neutralzz/BiLLa-7B-SFT) |
|
| 95 |
-
| finetune-billa7b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-billa7b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese) |
|
| 96 |
-
|
| 97 |
-
#### Audio-Language Branch
|
| 98 |
-
| Checkpoint | Link | Note |
|
| 99 |
-
|:------------|-------------|-------------|
|
| 100 |
-
| pretrain-vicuna7b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain_vicuna7b_audiobranch.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
|
| 101 |
-
| finetune-vicuna7b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
## Usage
|
| 105 |
-
#### Enviroment Preparation
|
| 106 |
-
|
| 107 |
-
First, install ffmpeg.
|
| 108 |
-
```
|
| 109 |
-
apt update
|
| 110 |
-
apt install ffmpeg
|
| 111 |
-
```
|
| 112 |
-
Then, create a conda environment:
|
| 113 |
-
```
|
| 114 |
-
conda env create -f environment.yml
|
| 115 |
-
conda activate videollama
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
## Prerequisites
|
| 120 |
-
|
| 121 |
-
Before using the repository, make sure you have obtained the following checkpoints:
|
| 122 |
-
|
| 123 |
-
#### Pre-trained Language Decoder
|
| 124 |
-
|
| 125 |
-
- Get the original LLaMA weights in the Hugging Face format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
|
| 126 |
-
- Download Vicuna delta weights :point_right: [[7B](https://huggingface.co/lmsys/vicuna-7b-delta-v0)][[13B](https://huggingface.co/lmsys/vicuna-13b-delta-v0)] (Note: we use **v0 weights** instead of v1.1 weights).
|
| 127 |
-
- Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:
|
| 128 |
-
|
| 129 |
-
```
|
| 130 |
-
python apply_delta.py \
|
| 131 |
-
--base /path/to/llama-13b \
|
| 132 |
-
--target /output/path/to/vicuna-13b --delta /path/to/vicuna-13b-delta
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
#### Pre-trained Visual Encoder in Vision-Language Branch
|
| 136 |
-
- Download the MiniGPT-4 model (trained linear layer) from this [link](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view).
|
| 137 |
-
|
| 138 |
-
#### Pre-trained Audio Encoder in Audio-Language Branch
|
| 139 |
-
- Download the weight of ImageBind from this [link](https://github.com/facebookresearch/ImageBind).
|
| 140 |
-
|
| 141 |
-
## Download Learnable Weights
|
| 142 |
-
Use `git-lfs` to download the learnable weights of our Video-LLaMA (i.e., positional embedding layer + Q-Former + linear projection layer):
|
| 143 |
-
```bash
|
| 144 |
-
git lfs install
|
| 145 |
-
git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series
|
| 146 |
-
```
|
| 147 |
-
The above commands will download the model weights of all the Video-LLaMA variants. For sure, you can choose to download the weights on demand. For example, if you want to run Video-LLaMA with Vicuna-7B as language decoder locally, then:
|
| 148 |
-
```bash
|
| 149 |
-
wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth
|
| 150 |
-
wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth
|
| 151 |
-
```
|
| 152 |
-
should meet the requirement.
|
| 153 |
-
|
| 154 |
-
## How to Run Demo Locally
|
| 155 |
-
|
| 156 |
-
Firstly, set the `llama_model`, `imagebind_ckpt_path`, `ckpt` and `ckpt_2` in [eval_configs/video_llama_eval_withaudio.yaml](./eval_configs/video_llama_eval_withaudio.yaml).
|
| 157 |
-
Then run the script:
|
| 158 |
-
```
|
| 159 |
-
python demo_audiovideo.py \
|
| 160 |
-
--cfg-path eval_configs/video_llama_eval_withaudio.yaml \
|
| 161 |
-
--model_type llama_v2 \ # or vicuna
|
| 162 |
-
--gpu-id 0
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
## Training
|
| 166 |
-
|
| 167 |
-
The training of each cross-modal branch (i.e., VL branch or AL branch) in Video-LLaMA consists of two stages,
|
| 168 |
-
|
| 169 |
-
1. Pre-training on the [Webvid-2.5M](https://github.com/m-bain/webvid) video caption dataset and [LLaVA-CC3M]((https://github.com/haotian-liu/LLaVA)) image caption dataset.
|
| 170 |
-
|
| 171 |
-
2. Fine-tuning using the image-based instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)/[LLaVA](https://github.com/haotian-liu/LLaVA) and the video-based instruction-tuning data from [VideoChat](https://github.com/OpenGVLab/Ask-Anything).
|
| 172 |
-
|
| 173 |
-
### 1. Pre-training
|
| 174 |
-
#### Data Preparation
|
| 175 |
-
Download the metadata and video following the instructions from the official Github repo of [Webvid](https://github.com/m-bain/webvid).
|
| 176 |
-
The folder structure of the dataset is shown below:
|
| 177 |
-
```
|
| 178 |
-
|webvid_train_data
|
| 179 |
-
|ββfilter_annotation
|
| 180 |
-
|ββββ0.tsv
|
| 181 |
-
|ββvideos
|
| 182 |
-
|ββββ000001_000050
|
| 183 |
-
|ββββββ1066674784.mp4
|
| 184 |
-
```
|
| 185 |
-
```
|
| 186 |
-
|cc3m
|
| 187 |
-
|ββfilter_cap.json
|
| 188 |
-
|ββimage
|
| 189 |
-
|ββββGCC_train_000000000.jpg
|
| 190 |
-
|ββββ...
|
| 191 |
-
```
|
| 192 |
-
#### Script
|
| 193 |
-
Config the the checkpoint and dataset paths in [video_llama_stage1_pretrain.yaml](./train_configs/video_llama_stage1_pretrain.yaml).
|
| 194 |
-
Run the script:
|
| 195 |
-
```
|
| 196 |
-
conda activate videollama
|
| 197 |
-
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage1_pretrain.yaml
|
| 198 |
-
```
|
| 199 |
-
|
| 200 |
-
### 2. Instruction Fine-tuning
|
| 201 |
-
#### Data
|
| 202 |
-
For now, the fine-tuning dataset consists of:
|
| 203 |
-
* 150K image-based instructions from LLaVA [[link](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/raw/main/llava_instruct_150k.json)]
|
| 204 |
-
* 3K image-based instructions from MiniGPT-4 [[link](https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_2_STAGE.md)]
|
| 205 |
-
* 11K video-based instructions from VideoChat [[link](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data)]
|
| 206 |
-
|
| 207 |
-
#### Script
|
| 208 |
-
Config the checkpoint and dataset paths in [video_llama_stage2_finetune.yaml](./train_configs/video_llama_stage2_finetune.yaml).
|
| 209 |
-
```
|
| 210 |
-
conda activate videollama
|
| 211 |
-
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage2_finetune.yaml
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
## Recommended GPUs
|
| 215 |
-
* Pre-training: 8xA100 (80G)
|
| 216 |
-
* Instruction-tuning: 8xA100 (80G)
|
| 217 |
-
* Inference: 1xA100 (40G/80G) or 1xA6000
|
| 218 |
-
|
| 219 |
-
## Acknowledgement
|
| 220 |
-
We are grateful for the following awesome projects our Video-LLaMA arising from:
|
| 221 |
-
* [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4): Enhancing Vision-language Understanding with Advanced Large Language Models
|
| 222 |
-
* [FastChat](https://github.com/lm-sys/FastChat): An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
|
| 223 |
-
* [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
|
| 224 |
-
* [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
|
| 225 |
-
* [ImageBind](https://github.com/facebookresearch/ImageBind): One Embedding Space To Bind Them All
|
| 226 |
-
* [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
|
| 227 |
-
* [VideoChat](https://github.com/OpenGVLab/Ask-Anything): Chat-Centric Video Understanding
|
| 228 |
-
* [LLaVA](https://github.com/haotian-liu/LLaVA): Large Language and Vision Assistant
|
| 229 |
-
* [WebVid](https://github.com/m-bain/webvid): A Large-scale Video-Text dataset
|
| 230 |
-
* [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl/tree/main): Modularization Empowers Large Language Models with Multimodality
|
| 231 |
-
|
| 232 |
-
The logo of Video-LLaMA is generated by [Midjourney](https://www.midjourney.com/).
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
## Term of Use
|
| 236 |
-
Our Video-LLaMA is just a research preview intended for non-commercial use only. You must **NOT** use our Video-LLaMA for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.
|
| 237 |
-
|
| 238 |
-
## Citation
|
| 239 |
-
If you find our project useful, hope you can star our repo and cite our paper as follows:
|
| 240 |
-
```
|
| 241 |
-
@article{damonlpsg2023videollama,
|
| 242 |
-
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
|
| 243 |
-
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
|
| 244 |
-
year = 2023,
|
| 245 |
-
journal = {arXiv preprint arXiv:2306.02858},
|
| 246 |
-
url = {https://arxiv.org/abs/2306.02858}
|
| 247 |
-
}
|
| 248 |
-
```
|
| 249 |
-
|
|
|
|
| 1 |
+
title: Video LLaMA2 IBM
|
| 2 |
+
emoji: π
|
| 3 |
+
colorFrom: purple
|
| 4 |
+
colorTo: gray
|
| 5 |
+
sdk: gradio
|
| 6 |
+
sdk_version: 3.29.0
|
| 7 |
+
app_file: app.py
|
| 8 |
+
pinned: false
|
| 9 |
+
license: other
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|