Universal-Multimodal-Agent (UMA)

New: Multimodal Datasets Catalog (Phase 1 Data Collection)

We’re kicking off data collection for UMA. Below is a curated, growing catalog of top public multimodal datasets by category with brief notes and links for immediate use.

A. Text–Image

LAION-5B — Massive web-scale image–text pairs; LAION-400M/1B subsets available. https://laion.ai/blog/laion-5b/
COCO (2017 Captions) — Image captioning and detection; strong baselines. https://cocodataset.org/#home
Visual Genome — Dense region descriptions, objects, attributes, relationships. https://visualgenome.org/
Conceptual Captions (3M/12M) — Web image–alt-text pairs. https://ai.google.com/research/ConceptualCaptions/
Flickr30k / Flickr8k — Classic captioning sets. https://hockenmaier.cs.illinois.edu/CS546-2014/data/flickr30k.html
CC3M/CC12M — Common Crawl–derived image–text pairs. https://github.com/google-research-datasets/conceptual-captions
SBU Captions — Image–text pairs from Flickr. http://www.cs.virginia.edu/~vicente/sbucaptions/
TextCaps — OCR-centric captioning with text in images. https://textvqa.org/textcaps/
VizWiz — Images taken by blind users; accessibility focus. https://vizwiz.org/
WebLI (if accessible) — Large-scale multilingual image–text pairs. https://ai.google/discover/papers/webli/

B. Text–Image Reasoning / VQA / Document QA

VQAv2 — Visual Question Answering benchmark. https://visualqa.org/
GQA — Compositional reasoning over scenes. https://cs.stanford.edu/people/dorarad/gqa/
OK-VQA / A-OKVQA — Requires external knowledge. https://okvqa.allenai.org/
ScienceQA — Multimodal science questions with diagrams. https://scienceqa.github.io/
DocVQA / TextVQA — Reading text in images. https://textvqa.org/
InfographicVQA — VQA on charts/infographics. https://www.microsoft.com/en-us/research/project/infographicvqa/
ChartQA / PlotQA / Chart-to-Text — Chart understanding, reasoning. https://github.com/vis-nlp/ChartQA

C. Text–Table (Structured Data)

TabFact — Table fact verification from Wikipedia. https://tabfact.github.io/
WikiTableQuestions — Semantic parsing over tables. https://ppasupat.github.io/WikiTableQuestions/
ToTTo — Controlled table-to-text generation. https://github.com/google-research-datasets/ToTTo
SQA (Sequential QA over tables) — Multi-turn QA on tables. https://allenai.org/data/sqa
Spider — Text-to-SQL over multiple DBs (semi-structured). https://yale-lily.github.io/spider
TURL — Table understanding pretraining. https://github.com/sunlab-osu/TURL
OpenTabQA — Open-domain QA over tables. https://github.com/IBM/OpenTabQA
MultiTab / TABBIE resources — Tabular reasoning. https://multitab-project.github.io/

D. Text–Audio / Speech

LibriSpeech — ASR with read English speech. https://www.openslr.org/12
Common Voice — Multilingual crowdsourced speech. https://commonvoice.mozilla.org/
Librilight — Large-scale unlabeled speech for self-supervised learning. https://github.com/facebookresearch/libri-light
TED-LIUM / How2 — Talks with transcripts and multimodal context. https://lium.univ-lemans.fr/ted-lium/
AudioSet — Weakly labeled ontology of sounds (with YouTube links). https://research.google.com/audioset/
ESC-50 / UrbanSound8K — Environmental sound classification. https://github.com/karoldvl/ESC-50
VoxCeleb — Speaker identification/verification. http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
SPGISpeech (if license allows) — Financial domain ASR. https://datasets.kensho.com/s/sgpispeech

E. Full Multimodal / Multi-domain (Text–Image–Table–Audio—and more)

MMMU — Massive Multidiscipline Multimodal Understanding benchmark. https://mmmu-benchmark.github.io/
MMBench / MME / LVLM-eHub — LVLM comprehensive eval suites. https://mmbench.opencompass.org.cn/
Egoschema / Ego4D (video+audio+text) — Egocentric multi-sensor datasets. https://ego4d-data.org/
MultiModal C4 (MMC4) — Web-scale multi-doc image–text corpus. https://github.com/allenai/mmc4
WebQA / MultimodalQA — QA over web images and text. https://github.com/omni-us/research-multimodalqa
Chart/Document suites: DocLayNet, PubLayNet, DocVQA series. https://github.com/ibm-aur-nlp/PubLayNet
ArXivDoc / ChartX / SynthChart — Synthetic + real doc/chart sets. https://github.com/vis-nlp/ChartX

F. Safety, Bias, and Accessibility-focused Sets

Hateful Memes — Multimodal bias/toxicity benchmark. https://github.com/facebookresearch/mmf/tree/main/projects/hateful_memes
ImageNet-A/O/R — Robustness variants. https://github.com/hendrycks/imagenet-r
VizWiz (again) — Accessibility-oriented images/questions. https://vizwiz.org/
MS MARCO (multimodal passages via docs) + OCR corpora — Retrieval grounding. https://microsoft.github.io/msmarco/

G. Licensing and Usage Notes

Always check each dataset’s license and terms of use; some require access requests or restrict commercial use.
Maintain separate manifests with source, license, checksum, and intended use. Prefer mirrored, deduplicated shards with exact provenance.

Call for Collaboration: Build UMA with Us

We’re assembling an open team. If you’re passionate about agentic multimodal AI, join us.

Roles we’re seeking (volunteer or sponsored collaborations):

Research Scientists: Multimodal learning, alignment, grounding, evaluation.
Research Engineers: Training pipelines, distributed systems, retrieval, tool-use interfaces.
Data Scientists / Data Engineers: Dataset curation, cleaning, deduplication, data governance.
Domain Experts: Finance, healthcare, education, accessibility, scientific communication.
Accessibility Specialists: Inclusive design, alt-text/sonification, screen-reader workflows, disability advocacy.
MLOps/Infra: Dataset storage, versioning, scalable training eval infra (HF Datasets, WebDataset, parquet, Arrow).
Community & Documentation: Tutorials, examples, benchmark harnesses, governance.

How to get involved now:

Open a Discussion with your background and interests: https://huggingface.co/amalsp/Universal-Multimodal-Agent/discussions
Propose datasets or contribute manifests via PRs (add to datasets/manifest/*.jsonl)
Share domain-specific tasks and evaluation rubrics
Star and watch the repo for updates

Initial roadmap for data:

Phase 1: Curate public datasets and licenses; build manifests and downloaders
Phase 2: Unified preprocessing (image, OCR, tables, audio), deduping, quality filters
Phase 3: Balanced training mixtures + eval suites (MMMU/MMBench/DocVQA/ASR)

Ethics & Safety:

Respect dataset licenses, privacy, and consent. Implement filter lists and red-teaming sets.
Document known biases and limitations; enable opt-out mechanisms where applicable.

Contributors will be acknowledged in the README and future preprint.

Original Project Overview

[Existing content retained below]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support