title: Short Video Descriptions
emoji: π
colorFrom: red
colorTo: red
sdk: docker
pinned: false
ShortScribe Pipeline
This repository provides code for the paper Making Short-Form Videos Accessible With Hierarchical Video Summaries. In this repository, we introduce the pipeline for ShortScribe, a tool that makes short-form videos accessible to blind and low-vision users by generating summaries at varying levels of depth. This source code specifically provides an API for generating the summaries. To see the source code for ShortScribe's interface, see this GitHub repository here
Installing ShortScribe
IMPORTANT NOTE: This repository is using the gpt-4 model because it was built before the release of gpt-4-turbo and gpt-4o
Before building the environment to run the ShortScribe pipeline, ensure that you have enough resources and credentials. You will need Google Cloud service account credentials in the form of a JSON file and an OpenAI API Key. We deployed our system on an NVIDIA A100 GPU with about 50 GB memory using PyTorch 2.0.1+cu118.
To get a local copy of the pipeline, run git lfs install and then git clone git@hf.co:spaces/akhil03/ShortScribe-Pipeline
Running ShortScribe
The Dockerfile builds the environment with the necessary packages and then runs a development server to send API requests to. To setup the Docker container, perform the following steps:
docker build -t <image-name>
docker run -p 80:80 <image-name>
docker exec -it <container-name>
If you are pushing this Dockerfile to your own Hugging Face repository, the Docker container will automatically build and execute, so you can ignore the above commands if this is the case.
API Calls
The API Calls are designed such that the video to be summarized must first be uploaded to the temporary_uploads/ folder before trying to send an API request. Please make sure that the video is titled as <number>.mp4 (e.g. 01.mp4) and uploaded to temporary_uploads/ before making any API calls.
getVideoData/
Given the video ID (specified in the file name), returns all of the data extracted from the video before being summarized by GPT-4. Returns a JSON object shown as follows:
{
"start": "Sample text", // Summary of the extracted data of the entire video without a word limit (float)
"end": "Sample text", // The end of the shot in terms of seconds (float)
"text_on_screen": "Sample text", // On-screen text in the shot (string)
"transcript_text": "Sample text", // Audio transcript of the shot (string)
"image_captions": ["Sample text", "Sample text", "Sample text", "Sample text", "Sample text"], // Five candidate image captions generated by BLIP-2 (not sorted in any particular order)
"image_captions_clip": [
{
"text": "Sample text", // Image caption generated by BLIP-2
"score": 1.0, // Image caption similarity score generated by CLIP
},
... 4 more ...
]
}
getShotSummaries/
Given the video ID (specified by the file name), returns a list of JSON objects for each shot of the video. The format for each JSON object in the list is shown below:
{
"start": 0.0, // The start of the shot in terms of seconds (float)
"end": 5.75, // The end of the shot in terms of seconds (float)
"text_on_screen": "Sample text", // On-screen text in the shot (string)
"per_shot_summaries": "Summary of the shot generated by GPT-4" // Summary of the shot (string)
}
getVideoSummary/
Given the video ID (specified in the file name), returns all of the overall summaries of the video (short description, long description, 25-word description, 50-word description) generated by GPT-4. Returns a JSON object shown as follows:
{
"video_description": "Sample text", // Summary of the extracted data of the entire video without a word limit
"summary_10": "Sample text", // Summary of the extracted data of the entire video in 10 words
"summary_25": "Sample text", // Summary of the extracted data of the entire video in 25 words
"summary_50": "Sample text" // Summary of the extracted data of the entire video in 50 words
}
Credits and Citation
If you have any questions or issues related to the source code, feel free to reach out to Akhil Iyer (akhil.iyer@utexas.edu)
If our work is useful to you, please cite our work with the following citation:
@article{van2024making,
title={Making Short-Form Videos Accessible with Hierarchical Video Summaries},
author={Van Daele, Tess and Iyer, Akhil and Zhang, Yuning and Derry, Jalyn C and Huh, Mina and Pavel, Amy},
journal={arXiv preprint arXiv:2402.10382},
year={2024}
}