DisTime: Distribution-based Time Representation for Video Large Language Models

This repository contains the official implementation and checkpoints for the paper: DisTime: Distribution-based Time Representation for Video Large Language Models (ICCV 2025).

For more details, including installation, training, and evaluation scripts, please refer to the official GitHub repository.

Abstract

Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at this URL.

Dataset

The InternVid-TG dataset proposed in the paper is released at: yingsen/internvid-tg.

Usage

You can load the model using the transformers library and use it for video understanding tasks.

import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoProcessor
from decord import cpu, VideoReader

# Load model, tokenizer, and processor
tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-1B")
model = AutoModelForCausalLM.from_pretrained("UserJoseph/DisTime-1B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-1B")

model.eval()

# Example video input
video_path = "./examples/video1.mp4" # Replace with your video path
qs = "Describe this video in detail"

# Load video frames
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps))])
video_frames = []
for frame_index in frame_indices:
    img = vr[frame_index].asnumpy()
    video_frames.append(img)
video_frames = np.stack(video_frames)

# Prepare inputs
messages = [{"role": "user", "content": [{"type": "video", "video": video_frames}, {"type": "text", "text": qs}]}]
inputs = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(inputs, return_tensors="pt").to(model.device)

# Generate response
with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        do_sample=False,
        temperature=0.2,
        max_new_tokens=128,
        use_cache=True,
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

Citation

If you find this work useful, please cite the paper:

@article{zeng2025distime,
  title={DisTime: Distribution-based Time Representation for Video Large Language Models},
  author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
  journal={arXiv preprint arXiv:2505.24329},
  year={2025}
}

Acknowledgement

DisTime is developed with the codebases of the following projects: InternVL and LLaVA-NeXT. We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models.

Downloads last month: 4

Safetensors

Model size

0.9B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support