Spaces:

sharathmajjigi
/

UITARS_Grounding_Model

Runtime error

App Files Files Community

sharathmajjigi commited on Aug 13

Commit

7d18df7

1 Parent(s): e61b31a

Add UI-TARS grounding model implementation

Browse files

Files changed (3) hide show

README.md +14 -0
app.py +56 -0
requirements.txt +4 -0

README.md CHANGED Viewed

@@ -11,4 +11,18 @@ license: mit
 short_description: A grounding model for CUA
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: A grounding model for CUA
 ---
+# UI-TARS Grounding Model
+A grounding model for Computer Use Agents (CUA) that can understand screen elements and generate action plans.
+## Usage
+1. Upload a screenshot of your desktop/browser
+2. Describe what you want to do
+3. Get grounding results with element locations and action plans
+## Model
+This space hosts the UI-TARS-1.5-7B model for visual grounding tasks.
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import gradio as gr
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+from PIL import Image
+import io
+import base64
+import json
+# Load the UI-TARS model (this will download ~7GB on first run)
+model_name = "ByteDance-Seed/UI-TARS-1.5-7B"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+def process_grounding(image, prompt):
+    """
+    Process image with UI-TARS grounding model
+    This is a simplified implementation - you'll need to adapt it
+    """
+    try:
+        # Convert image to PIL if needed
+        if isinstance(image, str):
+            # Handle base64 string
+            image_data = base64.b64decode(image)
+            image = Image.open(io.BytesIO(image_data))
+        # Here you would implement the actual UI-TARS grounding logic
+        # For now, returning a mock response
+        result = {
+            "elements": [
+                {"type": "button", "x": 100, "y": 200, "text": "Click me"},
+                {"type": "text_field", "x": 150, "y": 300, "text": "Input field"}
+            ],
+            "actions": [
+                {"action": "click", "x": 100, "y": 200, "description": "Click button"},
+                {"action": "type", "x": 150, "y": 300, "description": "Type in field"}
+            ]
+        }
+        return json.dumps(result, indent=2)
+    except Exception as e:
+        return f"Error processing image: {str(e)}"
+# Create Gradio interface
+iface = gr.Interface(
+    fn=process_grounding,
+    inputs=[
+        gr.Image(type="pil", label="Upload Screenshot"),
+        gr.Textbox(label="Prompt/Goal", placeholder="What do you want to do?")
+    ],
+    outputs=gr.Textbox(label="Grounding Results", lines=10),
+    title="UI-TARS Grounding Model",
+    description="Upload a screenshot and describe your goal to get grounding results"
+)
+iface.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+transformers
+torch
+Pillow
+gradio