Spaces:
Running
Running
| title: README | |
| emoji: 🐢 | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: static | |
| pinned: false | |
| Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions | |
| ============================================================================================================== | |
| Wan Ju Kang, Eunki Kim, Na Min An, Sangryul Kim, Haemin Choi, Ki Hoon Kwak, and James Thorne | |
| ## 📄 [Paper](https://arxiv.org/abs/2503.13369) | |
| Hello, we are a team of researchers based in [KAIST AI](https://gsai.kaist.ac.kr) working on accessible visualization. | |
| In specific, we compiled a diagram description dataset for blind and low-vision (BLV) individuals. | |
| We worked in close cooperation with two schools for the blind, as well as over 30 sighted annotators, and we are grateful for their contribution. | |
| Check out our [preprint](https://arxiv.org/abs/2503.13369), and feel free to contact us at soarhigh@kaist.ac.kr. | |
| --------------------------------------- | |
| ## Abstract | |
| > Often, the needs and visual abilities differ between the annotator group and the end user | |
| group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. | |
| Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat | |
| lacking by BLV standards. In this study, we ask sighted individuals to assess—rather than produce—diagram descriptions generated by vision-language models (VLM) that have been | |
| guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators | |
| who are themselves BLV and teach visually impaired learners. We release SIGHTATION, a collection of diagram description datasets | |
| spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and | |
| demonstrate their fine-tuning potential in various downstream tasks. | |
| ## Sightation Collection | |
| - SightationCompletions | |
| - SightationPreference | |
| - SightationRetrieval | |
| - SightationVQA | |
| - SightationReasoning | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/67a86f66c6f66e2fa5888b41/cNshK4QAdiNMqk7x6J6j7.png" width="100%" height="100%" title="visual_abstract" alt="visual_abstract"></img> | |
| The key benefit of utilizing sighted user feedback lies in their assessments, which are based on solid visual | |
| grounding. The compiled assessments prove an effective training substance for steering VLMs towards more | |
| accessible descriptions. | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/67a86f66c6f66e2fa5888b41/8oYvtq7dtv_Ck-U6OlcAE.png" width="70%" height="70%" title="dimensions_assignment" alt="dimensions_assignment"></img> | |
| The description qualities assessed by their respective evaluator groups. | |
| ## Results | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/67a86f66c6f66e2fa5888b41/094e9Hw7lauvT1tshg1Wj.png" width="90%" height="90%" title="spider_chart" alt="spider_chart"></img> | |
| Tuning VLMs on Sightation enhanced various qualities of the diagram descriptions, evaluated by BLV educators, and shown here as normalized ratings averaged in each aspect. | |
| The capability of the dataset is most strongly pronounced with Qwen2-VL-2B model, shown above. | |
| ## BibTeX | |
| If you find our dataset helpful, please cite our work! | |
| ```bash | |
| @misc{kang2025sightationcountsleveragingsighted, | |
| title={Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions}, | |
| author={Wan Ju Kang and Eunki Kim and Na Min An and Sangryul Kim and Haemin Choi and Ki Hoon Kwak and James Thorne}, | |
| year={2025}, | |
| eprint={2503.13369}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.AI}, | |
| url={https://arxiv.org/abs/2503.13369}, | |
| } | |
| ``` | |
| ### What's in the name? | |
| - Training on our dataset means using the sighted user feedback; in a distant way, you would be citing them. | |
| - Suppose you refer to our dataset in a spoken conversation. The sightation/citation confusion is meant to mimic a small part of the inconvenience faced by BLV learners, who often must rely only on auditory cues for disambiguating homophones. |