| import streamlit as st | |
| st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide") | |
| hide_streamlit_style = """ | |
| <style> | |
| #MainMenu {visibility: hidden;} | |
| footer {visibility: hidden;} | |
| </style> | |
| """ | |
| st.markdown(hide_streamlit_style, unsafe_allow_html=True) | |
| col1, col2 = st.columns(2) | |
| with col1: | |
| st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**") | |
| st.markdown("### **Key Aspects** :bulb:") | |
| st.markdown(""" | |
| 1. **Interaction Protocol** 🤝 \n | |
| - Define rules for communication and cooperation \n | |
| 2. **Decentralized Decision Making** 🎯 \n | |
| - Autonomous agents make independent decisions \n | |
| 3. **Collaboration and Competition** 🤼 \n | |
| - Agents work together or against each other \n | |
| """) | |
| with col2: | |
| st.markdown("### **Entities** :guards:") | |
| st.markdown(""" | |
| 1. **Autonomous Agents** 🤖 \n | |
| - Independent entities with decision-making capabilities \n | |
| 2. **Environment** 🌐 \n | |
| - Shared space where agents interact \n | |
| 3. **Ruleset** 📜 \n | |
| - Defines interaction protocol and decision-making processes \n | |
| """) | |
| st.markdown("---") | |
| st.markdown("## **Interaction Protocol** 🤝 :bulb:**") | |
| st.markdown("### **Key Elements** :guards:") | |
| st.markdown(""" | |
| 1. **Communication** 🗣 \n | |
| - Agents exchange information \n | |
| 2. **Cooperation** 🤝 \n | |
| -# 🩺🔍 Search Results | |
| ### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [⬇️](https://arxiv.org/pdf/2311.17465) | |
| *Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang* | |
| In this study, our goal is to create interactive avatar agents that can | |
| autonomously plan and animate nuanced facial movements realistically, from both | |
| visual and behavioral perspectives. Given high-level inputs about the | |
| environment and agent profile, our framework harnesses LLMs to produce a series | |
| of detailed text descriptions of the avatar agents' facial motions. These | |
| descriptions are then processed by our task-agnostic driving engine into motion | |
| token sequences, which are subsequently converted into continuous motion | |
| embeddings that are further consumed by our standalone neural-based renderer to | |
| generate the final photorealistic avatar animations. These streamlined | |
| processes allow our framework to adapt to a variety of non-verbal avatar | |
| interactions, both monadic and dyadic. Our extensive study, which includes | |
| experiments on both newly compiled and existing datasets featuring two types of | |
| agents -- one capable of monadic interaction with the environment, and the | |
| other designed for dyadic conversation -- validates the effectiveness and | |
| versatility of our approach. To our knowledge, we advanced a leap step by | |
| combining LLMs and neural rendering for generalized non-verbal prediction and | |
| photo-realistic rendering of avatar agents. | |
| --------------- | |
| ### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677) | |
| *Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao* | |
| Controllable image captioning is an emerging multimodal topic that aims to | |
| describe the image with natural language following human purpose, | |
| $\textit{e.g.}$, looking at the specified regions or telling in a particular | |
| text style. State-of-the-art methods are trained on annotated pairs of input | |
| controls and output captions. However, the scarcity of such well-annotated | |
| multimodal data largely limits their usability and scalability for interactive | |
| AI systems. Leveraging unimodal instruction-following foundation models is a | |
| promising alternative that benefits from broader sources of data. In this | |
| paper, we present Caption AnyThing (CAT), a foundation model augmented image | |
| captioning framework supporting a wide range of multimodel controls: 1) visual | |
| controls, including points, boxes, and trajectories; 2) language controls, such | |
| as sentiment, length, language, and factuality. Powered by Segment Anything | |
| Model (SAM) and ChatGPT, we unify the visual and language prompts into a | |
| modularized framework, enabling the flexible combination between different | |
| controls. Extensive case studies demonstrate the user intention alignment | |
| capabilities of our framework, shedding light on effective user interaction | |
| modeling in vision-language applications. Our code is publicly available at | |
| https://github.com/ttengwang/Caption-Anything. | |
| --------------- | |
| ### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [⬇️](https://arxiv.org/pdf/2306.14824) | |
| *Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei* | |
| We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new | |
| capabilities of perceiving object descriptions (e.g., bounding boxes) and | |
| grounding text to the visual world. Specifically, we represent refer | |
| expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where | |
| object descriptions are sequences of location tokens. Together with multimodal | |
| corpora, we construct large-scale data of grounded image-text pairs (called | |
| GrIT) to train the model. In addition to the existing capabilities of MLLMs | |
| (e.g., perceiving general modalities, following instructions, and performing | |
| in-context learning), Kosmos-2 integrates the grounding capability into | |
| downstream applications. We evaluate Kosmos-2 on a wide range of tasks, | |
| including (i) multimodal grounding, such as referring expression comprehension, | |
| and phrase grounding, (ii) multimodal referring, such as referring expression | |
| generation, (iii) perception-language tasks, and (iv) language understanding | |
| and generation. This work lays out the foundation for the development of | |
| Embodiment AI and sheds light on the big convergence of language, multimodal | |
| perception, action, and world modeling, which is a key step toward artificial | |
| general intelligence. Code and pretrained models are available at | |
| https://aka.ms/kosmos-2. | |
| --------------- | |
| ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615) | |
| *Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Crbune, Jason Lin, Jindong Chen, Abhanshu Sharma* | |
| Screen user interfaces (UIs) and infographics, sharing similar visual | |
| language and design principles, play important roles in human communication and | |
| human-machine interaction. We introduce ScreenAI, a vision-language model that | |
| specializes in UI and infographics understanding. Our model improves upon the | |
| PaLI architecture with the flexible patching strategy of pix2struct and is | |
| trained on a unique mixture of datasets. At the heart of this mixture is a | |
| novel screen annotation task in which the model has to identify the type and | |
| location of UI elements. We use these text annotations to describe screens to | |
| Large Language Models and automatically generate question-answering (QA), UI | |
| navigation, and summarization training datasets at scale. We run ablation | |
| studies to demonstrate the impact of these design choices. At only 5B | |
| parameters, ScreenAI achieves new state-of-the-artresults on UI- and | |
| infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget | |
| Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and | |
| InfographicVQA) compared to models of similar size. Finally, we release three | |
| new datasets: one focused on the screen annotation task and two others focused | |
| on question answering. | |
| --------------- | |
| ### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [⬇️](https://arxiv.org/pdf/2203.12751) | |
| *Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu* | |
| Task-oriented conversational agents rely on semantic parsers to translate | |
| natural language to formal representations. In this paper, we propose the | |
| design and rationale of the ThingTalk formal representation, and how the design | |
| improves the development of transactional task-oriented agents. | |
| ThingTalk is built on four core principles: (1) representing user requests | |
| directly as executable statements, covering all the functionality of the agent, | |
| (2) representing dialogues formally and succinctly to support accurate | |
| contextual semantic parsing, (3) standardizing types and interfaces to maximize | |
| reuse between agents, and (4) allowing multiple, independently-developed agents | |
| to be composed in a single virtual assistant. ThingTalk is developed as part of | |
| the Genie Framework that allows developers to quickly build transactional | |
| agents given a database and APIs. | |
| We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. | |
| Compared to the others, the ThingTalk design is both more general and more | |
| cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and | |
| associated tools yields a new state of the art accuracy of 79% turn-by-turn. | |
| --------------- | |
| ### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [⬇️](https://arxiv.org/pdf/2310.12945) | |
| *Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould* | |
| In the pursuit of efficient automated content creation, procedural | |
| generation, leveraging modifiable parameters and rule-based systems, emerges as | |
| a promising approach. Nonetheless, it could be a demanding endeavor, given its | |
| intricate nature necessitating a deep understanding of rules, algorithms, and | |
| parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing | |
| large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT | |
| positions LLMs as proficient problem solvers, dissecting the procedural 3D | |
| modeling tasks into accessible segments and appointing the apt agent for each | |
| task. 3D-GPT integrates three core agents: the task dispatch agent, the | |
| conceptualization agent, and the modeling agent. They collaboratively achieve | |
| two objectives. First, it enhances concise initial scene descriptions, evolving | |
| them into detailed forms while dynamically adapting the text based on | |
| subsequent instructions. Second, it integrates procedural generation, | |
| extracting parameter values from enriched text to effortlessly interface with | |
| 3D software for asset creation. Our empirical investigations confirm that | |
| 3D-GPT not only interprets and executes instructions, delivering reliable | |
| results but also collaborates effectively with human designers. Furthermore, it | |
| seamlessly integrates with Blender, unlocking expanded manipulation | |
| possibilities. Our work highlights the potential of LLMs in 3D modeling, | |
| offering a basic framework for future advancements in scene generation and | |
| animation. | |
| --------------- | |
| ### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [⬇️](https://arxiv.org/pdf/2307.01848) | |
| *Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan* | |
| Equipping embodied agents with commonsense is important for robots to | |
| successfully complete complex human instructions in general environments. | |
| Recent large language models (LLM) can embed rich semantic knowledge for agents | |
| in plan generation of complex tasks, while they lack the information about the | |
| realistic world and usually yield infeasible action sequences. In this paper, | |
| we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning | |
| with physical scene constraint, where the agent generates executable plans | |
| according to the existed objects in the scene by aligning LLMs with the visual | |
| perception models. Specifically, we first construct a multimodal dataset | |
| containing triplets of indoor scenes, instructions and action plans, where we | |
| provide the designed prompts and the list of existing objects in the scene for | |
| GPT-3.5 to generate a large number of instructions and corresponding planned | |
| actions. The generated data is leveraged for grounded plan tuning of | |
| pre-trained LLMs. During inference, we discover the objects in the scene by | |
| extending open-vocabulary object detectors to multi-view RGB images collected | |
| in different achievable locations. Experimental results show that the generated | |
| plan from our TaPA framework can achieve higher success rate than LLaVA and | |
| GPT-3.5 by a sizable margin, which indicates the practicality of embodied task | |
| planning in general and complex environments. | |
| --------------- | |
| ### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584) | |
| *Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang* | |
| Recent advancements in vision-language pre-training (e.g. CLIP) have shown | |
| that vision models can benefit from language supervision. While many models | |
| using language modality have achieved great success on 2D vision tasks, the | |
| joint representation learning of 3D point cloud with text remains | |
| under-explored due to the difficulty of 3D-Text data pair acquisition and the | |
| irregularity of 3D data structure. In this paper, we propose a novel Text4Point | |
| framework to construct language-guided 3D point cloud models. The key idea is | |
| utilizing 2D images as a bridge to connect the point cloud and the language | |
| modalities. The proposed Text4Point follows the pre-training and fine-tuning | |
| paradigm. During the pre-training stage, we establish the correspondence of | |
| images and point clouds based on the readily available RGB-D data and use | |
| contrastive learning to align the image and point cloud representations. | |
| Together with the well-aligned image and text features achieved by CLIP, the | |
| point cloud features are implicitly aligned with the text embeddings. Further, | |
| we propose a Text Querying Module to integrate language information into 3D | |
| representation learning by querying text embeddings with point cloud features. | |
| For fine-tuning, the model learns task-specific 3D representations under | |
| informative language guidance from the label set without 2D images. Extensive | |
| experiments demonstrate that our model shows consistent improvement on various | |
| downstream tasks, such as point cloud semantic segmentation, instance | |
| segmentation, and object detection. The code will be available here: | |
| https://github.com/LeapLabTHU/Text4Point | |
| --------------- | |
| ### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [⬇️](https://arxiv.org/pdf/2402.01030) | |
| *Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji* | |
| Large Language Model (LLM) agents, capable of performing a broad range of | |
| actions, such as invoking tools and controlling robots, show great potential in | |
| tackling real-world challenges. LLM agents are typically prompted to produce | |
| actions by generating JSON or text in a pre-defined format, which is usually | |
| limited by constrained action space (e.g., the scope of pre-defined tools) and | |
| restricted flexibility (e.g., inability to compose multiple tools). This work | |
| proposes to use executable Python code to consolidate LLM agents' actions into | |
| a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct | |
| can execute code actions and dynamically revise prior actions or emit new | |
| actions upon new observations through multi-turn interactions. Our extensive | |
| analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that | |
| CodeAct outperforms widely used alternatives (up to 20% higher success rate). | |
| The encouraging performance of CodeAct motivates us to build an open-source LLM | |
| agent that interacts with environments by executing interpretable code and | |
| collaborates with users using natural language. To this end, we collect an | |
| instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn | |
| interactions using CodeAct. We show that it can be used with existing data to | |
| improve models in agent-oriented tasks without compromising their general | |
| capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with | |
| Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., | |
| model training) using existing libraries and autonomously self-debug. | |
| --------------- | |
| """) |