amalsp's picture
Update README.md
4855269 verified
metadata
license: apache-2.0
tags:
  - multimodal
  - agentic-ai
  - retrieval-augmented
  - explainable-ai
  - reasoning
  - automation
  - accessibility
  - vision-language
  - audio-processing
  - table-understanding
language:
  - en
  - multilingual
pipeline_tag: any-to-any

Universal-Multimodal-Agent (UMA)

New: Multimodal Datasets Catalog (Phase 1 Data Collection)

We’re kicking off data collection for UMA. Below is a curated, growing catalog of top public multimodal datasets by category with brief notes and links for immediate use.

A. Text–Image

B. Text–Image Reasoning / VQA / Document QA

C. Text–Table (Structured Data)

D. Text–Audio / Speech

E. Full Multimodal / Multi-domain (Text–Image–Table–Audio—and more)

F. Safety, Bias, and Accessibility-focused Sets

G. Licensing and Usage Notes

  • Always check each dataset’s license and terms of use; some require access requests or restrict commercial use.
  • Maintain separate manifests with source, license, checksum, and intended use. Prefer mirrored, deduplicated shards with exact provenance.

Call for Collaboration: Build UMA with Us

We’re assembling an open team. If you’re passionate about agentic multimodal AI, join us.

Roles we’re seeking (volunteer or sponsored collaborations):

  • Research Scientists: Multimodal learning, alignment, grounding, evaluation.
  • Research Engineers: Training pipelines, distributed systems, retrieval, tool-use interfaces.
  • Data Scientists / Data Engineers: Dataset curation, cleaning, deduplication, data governance.
  • Domain Experts: Finance, healthcare, education, accessibility, scientific communication.
  • Accessibility Specialists: Inclusive design, alt-text/sonification, screen-reader workflows, disability advocacy.
  • MLOps/Infra: Dataset storage, versioning, scalable training eval infra (HF Datasets, WebDataset, parquet, Arrow).
  • Community & Documentation: Tutorials, examples, benchmark harnesses, governance.

How to get involved now:

Initial roadmap for data:

  • Phase 1: Curate public datasets and licenses; build manifests and downloaders
  • Phase 2: Unified preprocessing (image, OCR, tables, audio), deduping, quality filters
  • Phase 3: Balanced training mixtures + eval suites (MMMU/MMBench/DocVQA/ASR)

Ethics & Safety:

  • Respect dataset licenses, privacy, and consent. Implement filter lists and red-teaming sets.
  • Document known biases and limitations; enable opt-out mechanisms where applicable.

Contributors will be acknowledged in the README and future preprint.

Original Project Overview

[Existing content retained below]