| import streamlit as st | |
| st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide") | |
| hide_streamlit_style = """ | |
| <style> | |
| #MainMenu {visibility: hidden;} | |
| footer {visibility: hidden;} | |
| </style> | |
| """ | |
| st.markdown(hide_streamlit_style, unsafe_allow_html=True) | |
| col1, col2 = st.beta_columns(2) | |
| with col1: | |
| st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**") | |
| st.markdown("### **Key Aspects** :bulb:") | |
| st.markdown(""" | |
| 1. **Interaction Protocol** π€ \n | |
| - Define rules for communication and cooperation \n | |
| 2. **Decentralized Decision Making** π― \n | |
| - Autonomous agents make independent decisions \n | |
| 3. **Collaboration and Competition** π€Ό \n | |
| - Agents work together or against each other \n | |
| """) | |
| with col2: | |
| st.markdown("### **Entities** :guards:") | |
| st.markdown(""" | |
| 1. **Autonomous Agents** π€ \n | |
| - Independent entities with decision-making capabilities \n | |
| 2. **Environment** π \n | |
| - Shared space where agents interact \n | |
| 3. **Ruleset** π \n | |
| - Defines interaction protocol and decision-making processes \n | |
| """) | |
| st.markdown("---") | |
| st.markdown("## **Interaction Protocol** π€ :bulb:**") | |
| st.markdown("### **Key Elements** :guards:") | |
| st.markdown(""" | |
| 1. **Communication** π£ \n | |
| - Agents exchange information \n | |
| 2. **Cooperation** π€ \n | |
| -# π©Ίπ Search Results | |
| ### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [β¬οΈ](https://arxiv.org/pdf/2311.17465) | |
| *Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang* | |
| In this study, our goal is to create interactive avatar agents that can | |
| autonomously plan and animate nuanced facial movements realistically, from both | |
| visual and behavioral perspectives. Given high-level inputs about the | |
| environment and agent profile, our framework harnesses LLMs to produce a series | |
| of detailed text descriptions of the avatar agents' facial motions. These | |
| descriptions are then processed by our task-agnostic driving engine into motion | |
| token sequences, which are subsequently converted into continuous motion | |
| embeddings that are further consumed by our standalone neural-based renderer to | |
| generate the final photorealistic avatar animations. These streamlined | |
| processes allow our framework to adapt to a variety of non-verbal avatar | |
| interactions, both monadic and dyadic. Our extensive study, which includes | |
| experiments on both newly compiled and existing datasets featuring two types of | |
| agents -- one capable of monadic interaction with the environment, and the | |
| other designed for dyadic conversation -- validates the effectiveness and | |
| versatility of our approach. To our knowledge, we advanced a leap step by | |
| combining LLMs and neural rendering for generalized non-verbal prediction and | |
| photo-realistic rendering of avatar agents. | |
| --------------- | |
| ### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [β¬οΈ](https://arxiv.org/pdf/2305.02677) | |
| *Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao* | |
| Controllable image captioning is an emerging multimodal topic that aims to | |
| describe the image with natural language following human purpose, | |
| $\textit{e.g.}$, looking at the specified regions or telling in a particular | |
| text style. State-of-the-art methods are trained on annotated pairs of input | |
| controls and output captions. However, the scarcity of such well-annotated | |
| multimodal data largely limits their usability and scalability for interactive | |
| AI systems. Leveraging unimodal instruction-following foundation models is a | |
| promising alternative that benefits from broader sources of data. In this | |
| paper, we present Caption AnyThing (CAT), a foundation model augmented image | |
| captioning framework supporting a wide range of multimodel controls: 1) visual | |
| controls, including points, boxes, and trajectories; 2) language controls, such | |
| as sentiment, length, language, and factuality. Powered by Segment Anything | |
| Model (SAM) and ChatGPT, we unify the visual and language prompts into a | |
| modularized framework, enabling the flexible combination between different | |
| controls. Extensive case studies demonstrate the user intention alignment | |
| capabilities of our framework, shedding light on effective user interaction | |
| modeling in vision-language applications. Our code is publicly available at | |
| https://github.com/ttengwang/Caption-Anything. | |
| --------------- | |
| ### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [β¬οΈ](https://arxiv.org/pdf/2306.14824) | |
| *Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei* | |
| We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new | |
| capabilities of perceiving object descriptions (e.g., bounding boxes) and | |
| grounding text to the visual world. Specifically, we represent refer | |
| expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where | |
| object descriptions are sequences of location tokens. Together with multimodal | |
| corpora, we construct large-scale data of grounded image-text pairs (called | |
| GrIT) to train the model. In addition to the existing capabilities of MLLMs | |
| (e.g., perceiving general modalities, following instructions, and performing | |
| in-context learning), Kosmos-2 integrates the grounding capability into | |
| downstream applications. We evaluate Kosmos-2 on a wide range of tasks, | |
| including (i) multimodal grounding, such as referring expression comprehension, | |
| and phrase grounding, (ii) multimodal referring, such as referring expression | |
| generation, (iii) perception-language tasks, and (iv) language understanding | |
| and generation. This work lays out the foundation for the development of | |
| Embodiment AI and sheds light on the big convergence of language, multimodal | |
| perception, action, and world modeling, which is a key step toward artificial | |
| general intelligence. Code and pretrained models are available at | |
| https://aka.ms/kosmos-2. | |
| --------------- | |
| ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [β¬οΈ](https://arxiv.org/pdf/2402.04615) | |
| *Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma* | |
| Screen user interfaces (UIs) and infographics, sharing similar visual | |
| language and design principles, play important roles in human communication and | |
| human-machine interaction. We introduce ScreenAI, a vision-language model that | |
| specializes in UI and infographics understanding. Our model improves upon the | |
| PaLI architecture with the flexible patching strategy of pix2struct and is | |
| trained on a unique mixture of datasets. At the heart of this mixture is a | |
| novel screen annotation task in which the model has to identify the type and | |
| location of UI elements. We use these text annotations to describe screens to | |
| Large Language Models and automatically generate question-answering (QA), UI | |
| navigation, and summarization training datasets at scale. We run ablation | |
| studies to demonstrate the impact of these design choices. At only 5B | |
| parameters, ScreenAI achieves new state-of-the-artresults on UI- and | |
| infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget | |
| Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and | |
| InfographicVQA) compared to models of similar size. Finally, we release three | |
| new datasets: one focused on the screen annotation task and two others focused | |
| on question answering. | |
| --------------- | |
| ### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [β¬οΈ](https://arxiv.org/pdf/2203.12751) | |
| *Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu* | |
| Task-oriented conversational agents rely on semantic parsers to translate | |
| natural language to formal representations. In this paper, we propose the | |
| design and rationale of the ThingTalk formal representation, and how the design | |
| improves the development of transactional task-oriented agents. | |
| ThingTalk is built on four core principles: (1) representing user requests | |
| directly as executable statements, covering all the functionality of the agent, | |
| (2) representing dialogues formally and succinctly to support accurate | |
| contextual semantic parsing, (3) standardizing types and interfaces to maximize | |
| reuse between agents, and (4) allowing multiple, independently-developed agents | |
| to be composed in a single virtual assistant. ThingTalk is developed as part of | |
| the Genie Framework that allows developers to quickly build transactional | |
| agents given a database and APIs. | |
| We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. | |
| Compared to the others, the ThingTalk design is both more general and more | |
| cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and | |
| associated tools yields a new state of the art accuracy of 79% turn-by-turn. | |
| --------------- | |
| ### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [β¬οΈ](https://arxiv.org/pdf/2310.12945) | |
| *Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould* | |
| In the pursuit of efficient automated content creation, procedural | |
| generation, leveraging modifiable parameters and rule-based systems, emerges as | |
| a promising approach. Nonetheless, it could be a demanding endeavor, given its | |
| intricate nature necessitating a deep understanding of rules, algorithms, and | |
| parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing | |
| large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT | |
| positions LLMs as proficient problem solvers, dissecting the procedural 3D | |
| modeling tasks into accessible segments and appointing the apt agent for each | |
| task. 3D-GPT integrates three core agents: the task dispatch agent, the | |
| conceptualization agent, and the modeling agent. They collaboratively achieve | |
| two objectives. First, it enhances concise initial scene descriptions, evolving | |
| them into detailed forms while dynamically adapting the text based on | |
| subsequent instructions. Second, it integrates procedural generation, | |
| extracting parameter values from enriched text to effortlessly interface with | |
| 3D software for asset creation. Our empirical investigations confirm that | |
| 3D-GPT not only interprets and executes instructions, delivering reliable | |
| results but also collaborates effectively with human designers. Furthermore, it | |
| seamlessly integrates with Blender, unlocking expanded manipulation | |
| possibilities. Our work highlights the potential of LLMs in 3D modeling, | |
| offering a basic framework for future advancements in scene generation and | |
| animation. | |
| --------------- | |
| ### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [β¬οΈ](https://arxiv.org/pdf/2307.01848) | |
| *Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan* | |
| Equipping embodied agents with commonsense is important for robots to | |
| successfully complete complex human instructions in general environments. | |
| Recent large language models (LLM) can embed rich semantic knowledge for agents | |
| in plan generation of complex tasks, while they lack the information about the | |
| realistic world and usually yield infeasible action sequences. In this paper, | |
| we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning | |
| with physical scene constraint, where the agent generates executable plans | |
| according to the existed objects in the scene by aligning LLMs with the visual | |
| perception models. Specifically, we first construct a multimodal dataset | |
| containing triplets of indoor scenes, instructions and action plans, where we | |
| provide the designed prompts and the list of existing objects in the scene for | |
| GPT-3.5 to generate a large number of instructions and corresponding planned | |
| actions. The generated data is leveraged for grounded plan tuning of | |
| pre-trained LLMs. During inference, we discover the objects in the scene by | |
| extending open-vocabulary object detectors to multi-view RGB images collected | |
| in different achievable locations. Experimental results show that the generated | |
| plan from our TaPA framework can achieve higher success rate than LLaVA and | |
| GPT-3.5 by a sizable margin, which indicates the practicality of embodied task | |
| planning in general and complex environments. | |
| --------------- | |
| ### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [β¬οΈ](https://arxiv.org/pdf/2301.07584) | |
| *Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang* | |
| Recent advancements in vision-language pre-training (e.g. CLIP) have shown | |
| that vision models can benefit from language supervision. While many models | |
| using language modality have achieved great success on 2D vision tasks, the | |
| joint representation learning of 3D point cloud with text remains | |
| under-explored due to the difficulty of 3D-Text data pair acquisition and the | |
| irregularity of 3D data structure. In this paper, we propose a novel Text4Point | |
| framework to construct language-guided 3D point cloud models. The key idea is | |
| utilizing 2D images as a bridge to connect the point cloud and the language | |
| modalities. The proposed Text4Point follows the pre-training and fine-tuning | |
| paradigm. During the pre-training stage, we establish the correspondence of | |
| images and point clouds based on the readily available RGB-D data and use | |
| contrastive learning to align the image and point cloud representations. | |
| Together with the well-aligned image and text features achieved by CLIP, the | |
| point cloud features are implicitly aligned with the text embeddings. Further, | |
| we propose a Text Querying Module to integrate language information into 3D | |
| representation learning by querying text embeddings with point cloud features. | |
| For fine-tuning, the model learns task-specific 3D representations under | |
| informative language guidance from the label set without 2D images. Extensive | |
| experiments demonstrate that our model shows consistent improvement on various | |
| downstream tasks, such as point cloud semantic segmentation, instance | |
| segmentation, and object detection. The code will be available here: | |
| https://github.com/LeapLabTHU/Text4Point | |
| --------------- | |
| ### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [β¬οΈ](https://arxiv.org/pdf/2402.01030) | |
| *Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji* | |
| Large Language Model (LLM) agents, capable of performing a broad range of | |
| actions, such as invoking tools and controlling robots, show great potential in | |
| tackling real-world challenges. LLM agents are typically prompted to produce | |
| actions by generating JSON or text in a pre-defined format, which is usually | |
| limited by constrained action space (e.g., the scope of pre-defined tools) and | |
| restricted flexibility (e.g., inability to compose multiple tools). This work | |
| proposes to use executable Python code to consolidate LLM agents' actions into | |
| a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct | |
| can execute code actions and dynamically revise prior actions or emit new | |
| actions upon new observations through multi-turn interactions. Our extensive | |
| analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that | |
| CodeAct outperforms widely used alternatives (up to 20% higher success rate). | |
| The encouraging performance of CodeAct motivates us to build an open-source LLM | |
| agent that interacts with environments by executing interpretable code and | |
| collaborates with users using natural language. To this end, we collect an | |
| instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn | |
| interactions using CodeAct. We show that it can be used with existing data to | |
| improve models in agent-oriented tasks without compromising their general | |
| capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with | |
| Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., | |
| model training) using existing libraries and autonomously self-debug. | |
| --------------- | |
| ### 24 Jan 2024 | [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) | [β¬οΈ](https://arxiv.org/pdf/2401.13649) | |
| *Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried* | |
| Autonomous agents capable of planning, reasoning, and executing actions on | |
| the web offer a promising avenue for automating computer tasks. However, the | |
| majority of existing benchmarks primarily focus on text-based agents, | |
| neglecting many natural tasks that require visual information to effectively | |
| solve. Given that most computer interfaces cater to human perception, visual | |
| information often augments textual data in ways that text-only models struggle | |
| to harness effectively. To bridge this gap, we introduce VisualWebArena, a | |
| benchmark designed to assess the performance of multimodal web agents on | |
| realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set | |
| of diverse and complex web-based tasks that evaluate various capabilities of | |
| autonomous multimodal agents. To perform on this benchmark, agents need to | |
| accurately process image-text inputs, interpret natural language instructions, | |
| and execute actions on websites to accomplish user-defined objectives. We | |
| conduct an extensive evaluation of state-of-the-art LLM-based autonomous | |
| agents, including several multimodal models. Through extensive quantitative and | |
| qualitative analysis, we identify several limitations of text-only LLM agents, | |
| and reveal gaps in the capabilities of state-of-the-art multimodal language | |
| agents. VisualWebArena provides a framework for evaluating multimodal | |
| autonomous language agents, and offers insights towards building stronger | |
| autonomous agents for the web. Our code, baseline models, and data is publicly | |
| available at https://jykoh.com/vwa. | |
| --------------- | |
| ### 22 Feb 2018 | [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) | [β¬οΈ](https://arxiv.org/pdf/1802.07862) | |
| *Seungwhan Moon, Leonardo Neves, Vitor Carvalho* | |
| We introduce a new task called Multimodal Named Entity Recognition (MNER) for | |
| noisy user-generated data such as tweets or Snapchat captions, which comprise | |
| short text with accompanying images. These social media posts often come in | |
| inconsistent or incomplete syntax and lexical notations with very limited | |
| surrounding textual contexts, bringing significant challenges for NER. To this | |
| end, we create a new dataset for MNER called SnapCaptions (Snapchat | |
| image-caption pairs submitted to public and crowd-sourced stories with fully | |
| annotated named entities). We then build upon the state-of-the-art Bi-LSTM | |
| word/character based NER models with 1) a deep image network which incorporates | |
| relevant visual context to augment textual information, and 2) a generic | |
| modality-attention module which learns to attenuate irrelevant modalities while | |
| amplifying the most informative ones to extract contexts from, adaptive to each | |
| sample and token. The proposed MNER model with modality attention significantly | |
| outperforms the state-of-the-art text-only NER models by successfully | |
| leveraging provided visual contexts, opening up potential applications of MNER | |
| on myriads of social media platforms. | |
| --------------- | |
| ### 21 Sep 2023 | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | [β¬οΈ](https://arxiv.org/pdf/2309.11436) | |
| *Zhuosheng Zhang, Aston Zhang* | |
| Autonomous user interface (UI) agents aim to facilitate task automation by | |
| interacting with the user interface without manual intervention. Recent studies | |
| have investigated eliciting the capabilities of large language models (LLMs) | |
| for effective engagement in diverse environments. To align with the | |
| input-output requirement of LLMs, existing approaches are developed under a | |
| sandbox setting where they rely on external tools and application-specific APIs | |
| to parse the environment into textual elements and interpret the predicted | |
| actions. Consequently, those approaches often grapple with inference | |
| inefficiency and error propagation risks. To mitigate the challenges, we | |
| introduce Auto-UI, a multimodal solution that directly interacts with the | |
| interface, bypassing the need for environment parsing or reliance on | |
| application-dependent APIs. Moreover, we propose a chain-of-action technique -- | |
| leveraging a series of intermediate previous action histories and future action | |
| plans -- to help the agent decide what action to execute. We evaluate our | |
| approach on a new device-control benchmark AITW with 30K unique instructions, | |
| spanning multi-step tasks such as application operation, web searching, and web | |
| shopping. Experimental results show that Auto-UI achieves state-of-the-art | |
| performance with an action type prediction accuracy of 90% and an overall | |
| action success rate of 74%. Code is publicly available at | |
| https://github.com/cooelf/Auto-UI. | |
| --------------- | |
| ### 06 Jun 2023 | [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) | [β¬οΈ](https://arxiv.org/pdf/2303.02927) | |
| *Victor Dibia* | |
| Systems that support users in the automatic creation of visualizations must | |
| address several subtasks - understand the semantics of data, enumerate relevant | |
| visualization goals and generate visualization specifications. In this work, we | |
| pose visualization generation as a multi-stage generation problem and argue | |
| that well-orchestrated pipelines based on large language models (LLMs) such as | |
| ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing | |
| these tasks. We present LIDA, a novel tool for generating grammar-agnostic | |
| visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER | |
| that converts data into a rich but compact natural language summary, a GOAL | |
| EXPLORER that enumerates visualization goals given the data, a VISGENERATOR | |
| that generates, refines, executes and filters visualization code and an | |
| INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA | |
| provides a python api, and a hybrid user interface (direct manipulation and | |
| multilingual natural language) for interactive chart, infographics and data | |
| story generation. Learn more about the project here - | |
| https://microsoft.github.io/lida/ | |
| --------------- | |
| ### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [β¬οΈ](https://arxiv.org/pdf/2211.15103) | |
| *Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le* | |
| Video paragraph captioning aims to generate a multi-sentence description of | |
| an untrimmed video with several temporal event locations in coherent | |
| storytelling. Following the human perception process, where the scene is | |
| effectively understood by decomposing it into visual (e.g. human, animal) and | |
| non-visual components (e.g. action, relations) under the mutual influence of | |
| vision and language, we first propose a visual-linguistic (VL) feature. In the | |
| proposed VL feature, the scene is modeled by three modalities including (i) a | |
| global visual environment; (ii) local visual main agents; (iii) linguistic | |
| scene elements. We then introduce an autoregressive Transformer-in-Transformer | |
| (TinT) to simultaneously capture the semantic coherence of intra- and | |
| inter-event contents within a video. Finally, we present a new VL contrastive | |
| loss function to guarantee learnt embedding features are matched with the | |
| captions semantics. Comprehensive experiments and extensive ablation studies on | |
| ActivityNet Captions and YouCookII datasets show that the proposed | |
| Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior | |
| state-of-the-art methods on accuracy and diversity. Source code is made | |
| publicly available at: https://github.com/UARK-AICV/VLTinT. | |
| --------------- | |
| ### 04 Mar 2021 | [FAtiMA Toolkit -- Toward an effective and accessible tool for the development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) | [β¬οΈ](https://arxiv.org/pdf/2103.03020) | |
| *Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias, Rui Prada, Ana Paiva* | |
| More than a decade has passed since the development of FearNot!, an | |
| application designed to help children deal with bullying through role-playing | |
| with virtual characters. It was also the application that led to the creation | |
| of FAtiMA, an affective agent architecture for creating autonomous characters | |
| that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a | |
| collection of open-source tools that is designed to help researchers, game | |
| developers and roboticists incorporate a computational model of emotion and | |
| decision-making in their work. The toolkit was developed with the goal of | |
| making FAtiMA more accessible, easier to incorporate into different projects | |
| and more flexible in its capabilities for human-agent interaction, based upon | |
| the experience gathered over the years across different virtual environments | |
| and human-robot interaction scenarios. As a result, this work makes several | |
| different contributions to the field of Agent-Based Architectures. More | |
| precisely, FAtiMA Toolkit's library based design allows developers to easily | |
| integrate it with other frameworks, its meta-cognitive model affords different | |
| internal reasoners and affective components and its explicit dialogue structure | |
| gives control to the author even within highly complex scenarios. To | |
| demonstrate the use of FAtiMA Toolkit, several different use cases where the | |
| toolkit was successfully applied are described and discussed. | |
| --------------- | |
| ### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [β¬οΈ](https://arxiv.org/pdf/2209.09871) | |
| *Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht, Seyed Ahmad Mansouri* | |
| In the absence of nonverbal cues during messaging communication, users | |
| express part of their emotions using emojis. Thus, having emojis in the | |
| vocabulary of text messaging language models can significantly improve many | |
| natural language processing (NLP) applications such as online communication | |
| analysis. On the other hand, word embedding models are usually trained on a | |
| very large corpus of text such as Wikipedia or Google News datasets that | |
| include very few samples with emojis. In this study, we create emojiSpace, | |
| which is a combined word-emoji embedding using the word2vec model from the | |
| Genism library in Python. We trained emojiSpace on a corpus of more than 4 | |
| billion tweets and evaluated it by implementing sentiment analysis on a Twitter | |
| dataset containing more than 67 million tweets as an extrinsic task. For this | |
| task, we compared the performance of two different classifiers of random forest | |
| (RF) and linear support vector machine (SVM). For evaluation, we compared | |
| emojiSpace performance with two other pre-trained embeddings and demonstrated | |
| that emojiSpace outperforms both. | |
| --------------- | |
| ### 27 Jan 2020 | [CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) | [β¬οΈ](https://arxiv.org/pdf/2001.07935) | |
| *Grigori Fursin, Herve Guillou and Nicolas Essayan* | |
| We present CodeReef - an open platform to share all the components necessary | |
| to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML | |
| models across diverse systems in the most efficient way. We also introduce the | |
| CodeReef solution - a way to package and share models as non-virtualized, | |
| portable, customizable and reproducible archive files. Such ML packages include | |
| JSON meta description of models with all dependencies, Python APIs, CLI actions | |
| and portable workflows necessary to automatically build, benchmark, test and | |
| customize models across diverse platforms, AI frameworks, libraries, compilers | |
| and datasets. We demonstrate several CodeReef solutions to automatically build, | |
| run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO | |
| dataset from the latest MLPerf inference benchmark across a wide range of | |
| platforms from Raspberry Pi, Android phones and IoT devices to data centers. | |
| Our long-term goal is to help researchers share their new techniques as | |
| production-ready packages along with research papers to participate in | |
| collaborative and reproducible benchmarking, compare the different | |
| ML/software/hardware stacks and select the most efficient ones on a Pareto | |
| frontier using online CodeReef dashboards. | |
| --------------- | |
| ### 28 Feb 2024 | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | [β¬οΈ](https://arxiv.org/pdf/2402.17553) | |
| *Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov* | |
| For decades, human-computer interaction has fundamentally been manual. Even | |
| today, almost all productive work done on the computer necessitates human input | |
| at every step. Autonomous virtual agents represent an exciting step in | |
| automating many of these menial tasks. Virtual agents would empower users with | |
| limited technical proficiency to harness the full possibilities of computer | |
| systems. They could also enable the efficient streamlining of numerous computer | |
| tasks, ranging from calendar management to complex travel bookings, with | |
| minimal human intervention. In this paper, we introduce OmniACT, the | |
| first-of-a-kind dataset and benchmark for assessing an agent's capability to | |
| generate executable programs to accomplish computer tasks. Our scope extends | |
| beyond traditional web automation, covering a diverse range of desktop | |
| applications. The dataset consists of fundamental tasks such as "Play the next | |
| song", as well as longer horizon tasks such as "Send an email to John Doe | |
| mentioning the time and place to meet". Specifically, given a pair of screen | |
| image and a visually-grounded natural language task, the goal is to generate a | |
| script capable of fully executing the task. We run several strong baseline | |
| language model agents on our benchmark. The strongest baseline, GPT-4, performs | |
| the best on our benchmark However, its performance level still reaches only 15% | |
| of the human proficiency in generating executable scripts capable of completing | |
| the task, demonstrating the challenge of our task for conventional web agents. | |
| Our benchmark provides a platform to measure and evaluate the progress of | |
| language model agents in automating computer tasks and motivates future work | |
| towards building multimodal models that bridge large language models and the | |
| visual grounding of computer screens. | |
| --------------- | |
| ### 24 Mar 2021 | [Proactive Interaction Framework for Intelligent Social Receptionist Robots](https://arxiv.org/abs/2012.04832) | [β¬οΈ](https://arxiv.org/pdf/2012.04832) | |
| *Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and Yueqiang Dong* | |
| Proactive human-robot interaction (HRI) allows the receptionist robots to | |
| actively greet people and offer services based on vision, which has been found | |
| to improve acceptability and customer satisfaction. Existing approaches are | |
| either based on multi-stage decision processes or based on end-to-end decision | |
| models. However, the rule-based approaches require sedulous expert efforts and | |
| only handle minimal pre-defined scenarios. On the other hand, existing works | |
| with end-to-end models are limited to very general greetings or few behavior | |
| patterns (typically less than 10). To address those challenges, we propose a | |
| new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot | |
| Interaction (TFVT-HRI). The proposed framework extracts visual tokens of | |
| relative objects from an RGB camera first. To ensure the correct interpretation | |
| of the scenario, a transformer decision model is then employed to process the | |
| visual tokens, which is augmented with the temporal and spatial information. It | |
| predicts the appropriate action to take in each scenario and identifies the | |
| right target. Our data is collected from an in-service receptionist robot in an | |
| office building, which is then annotated by experts for appropriate proactive | |
| behavior. The action set includes 1000+ diverse patterns by combining language, | |
| emoji expression, and body motions. We compare our model with other SOTA | |
| end-to-end models on both offline test sets and online user experiments in | |
| realistic office building environments to validate this framework. It is | |
| demonstrated that the decision model achieves SOTA performance in action | |
| triggering and selection, resulting in more humanness and intelligence when | |
| compared with the previous reactive reception policies. | |
| --------------- | |
| ### 15 Mar 2023 | [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) | [β¬οΈ](https://arxiv.org/pdf/2203.02606) | |
| *Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa* | |
| This article presents the design and the implementation of a cloud system for | |
| knowledge-based autonomous interaction devised for Social Robots and other | |
| conversational agents. The system is particularly convenient for low-cost | |
| robots and devices: it can be used as a stand-alone dialogue system or as an | |
| integration to provide "background" dialogue capabilities to any preexisting | |
| Natural Language Processing ability that the robot may already have as part of | |
| its basic skills. By connecting to the cloud, developers are provided with a | |
| sustainable solution to manage verbal interaction through a network connection, | |
| with about 3,000 topics of conversation ready for "chit-chatting" and a library | |
| of pre-cooked plans that only needs to be grounded into the robot's physical | |
| capabilities. The system is structured as a set of REST API endpoints so that | |
| it can be easily expanded by adding new APIs to improve the capabilities of the | |
| clients connected to the cloud. Another key feature of the system is that it | |
| has been designed to make the development of its clients straightforward: in | |
| this way, multiple robots and devices can be easily endowed with the capability | |
| of autonomously interacting with the user, understanding when to perform | |
| specific actions, and exploiting all the information provided by cloud | |
| services. The article outlines and discusses the results of the experiments | |
| performed to assess the system's performance in terms of response time, paving | |
| the way for its use both for research and market solutions. Links to | |
| repositories with clients for ROS and popular robots such as Pepper and NAO are | |
| available on request. | |
| ---------------<s>[INST] Context: | |
| 1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents </b> | |
| Abstract: In this study, our goal is to create interactive avatar agents that can | |
| autonomously plan and animate nuanced facial movements realistically, from both | |
| visual and behavioral perspectives. Given high-level inputs about the | |
| environment and agent profile, our framework harnesses LLMs to produce a series | |
| of detailed text descriptions of the avatar agents' facial motions. These | |
| descriptions are then processed by our task-agnostic driving engine into motion | |
| token sequences, which are subsequently converted into continuous motion | |
| embeddings that are further consumed by our standalone neural-based renderer to | |
| generate the final photorealistic avatar animations. These streamlined | |
| processes allow our framework to adapt to a variety of non-verbal avatar | |
| interactions, both monadic and dyadic. Our extensive study, which includes | |
| experiments on both newly compiled and existing datasets featuring two types of | |
| agents -- one capable of monadic interaction with the environment, and the | |
| other designed for dyadic conversation -- validates the effectiveness and | |
| versatility of our approach. To our knowledge, we advanced a leap step by | |
| combining LLMs and neural rendering for generalized non-verbal prediction and | |
| photo-realistic rendering of avatar agents. | |
| 2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal Controls </b> | |
| Abstract: Controllable image captioning is an emerging multimodal topic that aims to | |
| describe the image with natural language following human purpose, | |
| $\textit{e.g.}$, looking at the specified regions or telling in a particular | |
| text style. State-of-the-art methods are trained on annotated pairs of input | |
| controls and output captions. However, the scarcity of such well-annotated | |
| multimodal data largely limits their usability and scalability for interactive | |
| AI systems. Leveraging unimodal instruction-following foundation models is a | |
| promising alternative that benefits from broader sources of data. In this | |
| paper, we present Caption AnyThing (CAT), a foundation model augmented image | |
| captioning framework supporting a wide range of multimodel controls: 1) visual | |
| controls, including points, boxes, and trajectories; 2) language controls, such | |
| as sentiment, length, language, and factuality. Powered by Segment Anything | |
| Model (SAM) and ChatGPT, we unify the visual and language prompts into a | |
| modularized framework, enabling the flexible combination between different | |
| controls. Extensive case studies demonstrate the user intention alignment | |
| capabilities of our framework, shedding light on effective user interaction | |
| modeling in vision-language applications. Our code is publicly available at | |
| https://github.com/ttengwang/Caption-Anything. | |
| 3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b> | |
| Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new | |
| capabilities of perceiving object descriptions (e.g., bounding boxes) and | |
| grounding text to the visual world. Specifically, we represent refer | |
| expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where | |
| object descriptions are sequences of location tokens. Together with multimodal | |
| corpora, we construct large-scale data of grounded image-text pairs (called | |
| GrIT) to train the model. In addition to the existing capabilities of MLLMs | |
| (e.g., perceiving general modalities, following instructions, and performing | |
| in-context learning), Kosmos-2 integrates the grounding capability into | |
| downstream applications. We evaluate Kosmos-2 on a wide range of tasks, | |
| including (i) multimodal grounding, such as referring expression comprehension, | |
| and phrase grounding, (ii) multimodal referring, such as referring expression | |
| generation, (iii) perception-language tasks, and (iv) language understanding | |
| and generation. This work lays out the foundation for the development of | |
| Embodiment AI and sheds light on the big convergence of language, multimodal | |
| perception, action, and world modeling, which is a key step toward artificial | |
| general intelligence. Code and pretrained models are available at | |
| https://aka.ms/kosmos-2. | |
| 4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b> | |
| Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual | |
| language and design principles, play important roles in human communication and | |
| human-machine interaction. We introduce ScreenAI, a vision-language model that | |
| specializes in UI and infographics understanding. Our model improves upon the | |
| PaLI architecture with the flexible patching strategy of pix2struct and is | |
| trained on a unique mixture of datasets. At the heart of this mixture is a | |
| novel screen annotation task in which the model has to identify the type and | |
| location of UI elements. We use these text annotations to describe screens to | |
| Large Language Models and automatically generate question-answering (QA), UI | |
| navigation, and summarization training datasets at scale. We run ablation | |
| studies to demonstrate the impact of these design choices. At only 5B | |
| parameters, ScreenAI achieves new state-of-the-artresults on UI- and | |
| infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget | |
| Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and | |
| InfographicVQA) compared to models of similar size. Finally, we release three | |
| new datasets: one focused on the screen annotation task and two others focused | |
| on question answering. | |
| 5. <b> ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues </b> | |
| Abstract: Task-oriented conversational agents rely on semantic parsers to translate | |
| natural language to formal representations. In this paper, we propose the | |
| design and rationale of the ThingTalk formal representation, and how the design | |
| improves the development of transactional task-oriented agents. | |
| ThingTalk is built on four core principles: (1) representing user requests | |
| directly as executable statements, covering all the functionality of the agent, | |
| (2) representing dialogues formally and succinctly to support accurate | |
| contextual semantic parsing, (3) standardizing types and interfaces to maximize | |
| reuse between agents, and (4) allowing multiple, independently-developed agents | |
| to be composed in a single virtual assistant. ThingTalk is developed as part of | |
| the Genie Framework that allows developers to quickly build transactional | |
| agents given a database and APIs. | |
| We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. | |
| Compared to the others, the ThingTalk design is both more general and more | |
| cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and | |
| associated tools yields a new state of the art accuracy of 79% turn-by-turn. | |
| 6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b> | |
| Abstract: In the pursuit of efficient automated content creation, procedural | |
| generation, leveraging modifiable parameters and rule-based systems, emerges as | |
| a promising approach. Nonetheless, it could be a demanding endeavor, given its | |
| intricate nature necessitating a deep understanding of rules, algorithms, and | |
| parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing | |
| large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT | |
| positions LLMs as proficient problem solvers, dissecting the procedural 3D | |
| modeling tasks into accessible segments and appointing the apt agent for each | |
| task. 3D-GPT integrates three core agents: the task dispatch agent, the | |
| conceptualization agent, and the modeling agent. They collaboratively achieve | |
| two objectives. First, it enhances concise initial scene descriptions, evolving | |
| them into detailed forms while dynamically adapting the text based on | |
| subsequent instructions. Second, it integrates procedural generation, | |
| extracting parameter values from enriched text to effortlessly interface with | |
| 3D software for asset creation. Our empirical investigations confirm that | |
| 3D-GPT not only interprets and executes instructions, delivering reliable | |
| results but also collaborates effectively with human designers. Furthermore, it | |
| seamlessly integrates with Blender, unlocking expanded manipulation | |
| possibilities. Our work highlights the potential of LLMs in 3D modeling, | |
| offering a basic framework for future advancements in scene generation and | |
| animation. | |
| 7. <b> Embodied Task Planning with Large Language Models </b> | |
| Abstract: Equipping embodied agents with commonsense is important for robots to | |
| successfully complete complex human instructions in general environments. | |
| Recent large language models (LLM) can embed rich semantic knowledge for agents | |
| in plan generation of complex tasks, while they lack the information about the | |
| realistic world and usually yield infeasible action sequences. In this paper, | |
| we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning | |
| with physical scene constraint, where the agent generates executable plans | |
| according to the existed objects in the scene by aligning LLMs with the visual | |
| perception models. Specifically, we first construct a multimodal dataset | |
| containing triplets of indoor scenes, instructions and action plans, where we | |
| provide the designed prompts and the list of existing objects in the scene for | |
| GPT-3.5 to generate a large number of instructions and corresponding planned | |
| actions. The generated data is leveraged for grounded plan tuning of | |
| pre-trained LLMs. During inference, we discover the objects in the scene by | |
| extending open-vocabulary object detectors to multi-view RGB images collected | |
| in different achievable locations. Experimental results show that the generated | |
| plan from our TaPA framework can achieve higher success rate than LLaVA and | |
| GPT-3.5 by a sizable margin, which indicates the practicality of embodied task | |
| planning in general and complex environments. | |
| 8. <b> Joint Representation Learning for Text and 3D Point Cloud </b> | |
| Abstract: Recent advancements in vision-language pre-training (e.g. CLIP) have shown | |
| that vision models can benefit from language supervision. While many models | |
| using language modality have achieved great success on 2D vision tasks, the | |
| joint representation learning of 3D point cloud with text remains | |
| under-explored due to the difficulty of 3D-Text data pair acquisition and the | |
| irregularity of 3D data structure. In this paper, we propose a novel Text4Point | |
| framework to construct language-guided 3D point cloud models. The key idea is | |
| utilizing 2D images as a bridge to connect the point cloud and the language | |
| modalities. The proposed Text4Point follows the pre-training and fine-tuning | |
| paradigm. During the pre-training stage, we establish the correspondence of | |
| images and point clouds based on the readily available RGB-D data and use | |
| contrastive learning to align the image and point cloud representations. | |
| Together with the well-aligned image and text features achieved by CLIP, the | |
| point cloud features are implicitly aligned with the text embeddings. Further, | |
| we propose a Text Querying Module to integrate language information into 3D | |
| representation learning by querying text embeddings with point cloud features. | |
| For fine-tuning, the model learns task-specific 3D representations under | |
| informative language guidance from the label set without 2D images. Extensive | |
| experiments demonstrate that our model shows consistent improvement on various | |
| downstream tasks, such as point cloud semantic segmentation, instance | |
| segmentation, and object detection. The code will be available here: | |
| https://github.com/LeapLabTHU/Text4Point | |
| 9. <b> Executable Code Actions Elicit Better LLM Agents </b> | |
| Abstract: Large Language Model (LLM) agents, capable of performing a broad range of | |
| actions, such as invoking tools and controlling robots, show great potential in | |
| tackling real-world challenges. LLM agents are typically prompted to produce | |
| actions by generating JSON or text in a pre-defined format, which is usually | |
| limited by constrained action space (e.g., the scope of pre-defined tools) and | |
| restricted flexibility (e.g., inability to compose multiple tools). This work | |
| proposes to use executable Python code to consolidate LLM agents' actions into | |
| a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct | |
| can execute code actions and dynamically revise prior actions or emit new | |
| actions upon new observations through multi-turn interactions. Our extensive | |
| analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that | |
| CodeAct outperforms widely used alternatives (up to 20% higher success rate). | |
| The encouraging performance of CodeAct motivates us to build an open-source LLM | |
| agent that interacts with environments by executing interpretable code and | |
| collaborates with users using natural language. To this end, we collect an | |
| instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn | |
| interactions using CodeAct. We show that it can be used with existing data to | |
| improve models in agent-oriented tasks without compromising their general | |
| capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with | |
| Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., | |
| model training) using existing libraries and autonomously self-debug. | |
| 10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks </b> | |
| Abstract: Autonomous agents capable of planning, reasoning, and executing actions on | |
| the web offer a promising avenue for automating computer tasks. However, the | |
| majority of existing benchmarks primarily focus on text-based agents, | |
| neglecting many natural tasks that require visual information to effectively | |
| solve. Given that most computer interfaces cater to human perception, visual | |
| information often augments textual data in ways that text-only models struggle | |
| to harness effectively. To bridge this gap, we introduce VisualWebArena, a | |
| benchmark designed to assess the performance of multimodal web agents on | |
| realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set | |
| of diverse and complex web-based tasks that evaluate various capabilities of | |
| autonomous multimodal agents. To perform on this benchmark, agents need to | |
| accurately process image-text inputs, interpret natural language instructions, | |
| and execute actions on websites to accomplish user-defined objectives. We | |
| conduct an extensive evaluation of state-of-the-art LLM-based autonomous | |
| agents, including several multimodal models. Through extensive quantitative and | |
| qualitative analysis, we identify several limitations of text-only LLM agents, | |
| and reveal gaps in the capabilities of state-of-the-art multimodal language | |
| agents. VisualWebArena provides a framework for evaluating multimodal | |
| autonomous language agents, and offers insights towards building stronger | |
| autonomous agents for the web. | |
| """) |