Referring Any Pixel from Image and Video
Bringing MLLMs into Embodied World
VideoRefer x VideoLLaMA3
Frontier Foundation Models for Video Understanding
VideoLLaMA2-AV