Spaces:

burtenshaw
/

browser-gym

Runtime error

App Files Files Community

browser-gym / README.md

burtenshaw HF Staff

Upload folder using huggingface_hub

f3f0fe2 verified 8 days ago

preview code

raw

history blame contribute delete

19.9 kB

metadata

title: Browsergym_env Environment Server
emoji: 🐏
colorFrom: gray
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Browsergym_env Environment Server

FastAPI server for browsergym_env environment powered by Meta's OpenEnv.

About

This Space provides a containerized environment for browsergym_env interactions. Built with FastAPI and OpenEnv framework.

Web Interface

This deployment includes an interactive web interface for exploring the environment:

HumanAgent Interface: Interact with the environment using a web form
State Observer: Real-time view of environment state and action history
Live Updates: WebSocket-based real-time updates

Access the web interface at: /web

API Documentation

Visit /docs for interactive API documentation.

Health Check

The environment provides a health check endpoint at /health.

BrowserGym Environment

BrowserGym is a unified framework for web-based agent tasks that provides access to multiple benchmarks under a single Gymnasium-compatible API. This integration brings the complete training-to-evaluation pipeline for web agents into OpenEnv.

Why BrowserGym?

BrowserGym provides a complete pipeline for developing web agents: train on simple tasks, then evaluate on realistic websites.

What are these benchmarks?

MiniWoB++ (Training): 100+ synthetic web tasks like "click this button", "fill out this form", "select from dropdown". Each task is a simple webpage with a clear objective. Fast resets, randomized variations, dense rewards. Perfect for learning basic web navigation skills. No external setup needed - tasks run in isolated browser sessions.
WebArena (Evaluation): 812 tasks on real websites (e-commerce, forums, GitLab, Wikipedia). Tasks like "find the cheapest laptop and add to cart" or "create a merge request for bug #123". Multi-step, requires reasoning, sparse rewards. Tests if your agent can handle actual websites. Requires running 7 backend services (shopping site, GitLab instance, etc).
VisualWebArena: Similar to WebArena but requires visual understanding - agents need to interpret images, identify UI elements visually, handle multimodal content.
WorkArena: Enterprise software tasks (CRM, project management, business workflows). Tests automation on corporate-style applications.

The training → evaluation pipeline:

Train on MiniWoB (simple, controlled, fast iterations)
Evaluate on WebArena (complex, realistic, measures real-world capability)

Key advantage: You can start training immediately with MiniWoB. No need to set up infrastructure just to test if your code works.

Quick Start - Training (MiniWoB)

No Setup Required! 🎉

from envs.browsergym_env import BrowserGymEnv, BrowserGymAction

# Create environment for MiniWoB training task
env = BrowserGymEnv.from_docker_image(
    "ghcr.io/openenv/browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "miniwob",
        "BROWSERGYM_TASK_NAME": "click-test",  # or "click-button", "click-dialog", etc.
    }
)

# Train your agent!
for episode in range(1000):
    result = env.reset()
    print(f"Goal: {result.observation.goal}")

    done = False
    while not done:
        # Your agent decides what to do
        action_str = agent.get_action(result.observation.text)
        action = BrowserGymAction(action_str=action_str)

        result = env.step(action)
        done = result.done

        print(f"Reward: {result.reward}")

env.close()

Available Tasks by Benchmark

MiniWoB++ Tasks (Training - 100+ tasks)

MiniWoB tasks are organized by difficulty and type. Here are the main categories:

Click Tasks (Basic interaction)

Task Name	Description	Difficulty
`click-test`	Click a single button	⭐ Easy
`click-button`	Click button with specific text	⭐ Easy
`click-button-sequence`	Click buttons in order	⭐⭐ Medium
`click-checkboxes`	Select specific checkboxes	⭐⭐ Medium
`click-checkboxes-soft`	Select checkboxes (multiple valid)	⭐⭐ Medium
`click-checkboxes-large`	Many checkboxes to select from	⭐⭐ Medium
`click-checkboxes-transfer`	Transfer learning variation	⭐⭐ Medium
`click-dialog`	Click correct button in dialog	⭐ Easy
`click-dialog-2`	More complex dialog	⭐⭐ Medium
`click-link`	Click on a link	⭐ Easy
`click-option`	Select from dropdown	⭐⭐ Medium
`click-pie`	Click on pie chart slice	⭐⭐ Medium
`click-scroll-list`	Click item in scrollable list	⭐⭐⭐ Hard
`click-shades`	Click on specific color shade	⭐⭐ Medium
`click-shape`	Click on specific shape	⭐⭐ Medium
`click-tab`	Switch between tabs	⭐⭐ Medium
`click-tab-2`	More complex tab switching	⭐⭐⭐ Hard
`click-widget`	Click on UI widget	⭐⭐ Medium

Text Entry Tasks (Typing and forms)

Task Name	Description	Difficulty
`enter-text`	Type text into input field	⭐ Easy
`enter-text-dynamic`	Dynamic text entry	⭐⭐ Medium
`enter-text-2`	Multiple text fields	⭐⭐ Medium
`enter-password`	Fill password field	⭐ Easy
`enter-date`	Enter a date	⭐⭐ Medium
`enter-time`	Enter a time	⭐⭐ Medium
`login-user`	Complete login form	⭐⭐ Medium
`login-user-popup`	Login via popup	⭐⭐⭐ Hard

Navigation Tasks (Multi-step interaction)

Task Name	Description	Difficulty
`navigate-tree`	Navigate through tree structure	⭐⭐⭐ Hard
`search-engine`	Use search interface	⭐⭐ Medium
`use-autocomplete`	Interact with autocomplete	⭐⭐⭐ Hard
`book-flight`	Book a flight (complex form)	⭐⭐⭐⭐ Very Hard
`choose-date`	Pick date from calendar	⭐⭐⭐ Hard
`choose-date-easy`	Simplified date picker	⭐⭐ Medium
`choose-date-medium`	Medium difficulty date picker	⭐⭐⭐ Hard
`choose-list`	Select from long list	⭐⭐ Medium

Visual/Spatial Tasks (Requires visual understanding)

Task Name	Description	Difficulty
`count-sides`	Count sides of shape	⭐⭐ Medium
`count-shape`	Count specific shapes	⭐⭐ Medium
`find-word`	Find word in text	⭐⭐ Medium
`focus-text`	Focus on text element	⭐ Easy
`focus-text-2`	More complex focus task	⭐⭐ Medium
`grid-coordinate`	Click grid coordinate	⭐⭐ Medium
`guess-number`	Guess a number game	⭐⭐⭐ Hard
`identify-shape`	Identify shape type	⭐⭐ Medium
`read-table`	Extract info from table	⭐⭐⭐ Hard
`read-table-2`	More complex table reading	⭐⭐⭐ Hard

Email/Social Tasks (Realistic scenarios)

Task Name	Description	Difficulty
`email-inbox`	Manage email inbox	⭐⭐⭐⭐ Very Hard
`email-inbox-forward`	Forward emails	⭐⭐⭐⭐ Very Hard
`email-inbox-nl`	Natural language email task	⭐⭐⭐⭐ Very Hard
`email-inbox-star-reply`	Star and reply to emails	⭐⭐⭐⭐ Very Hard
`social-media`	Social media interaction	⭐⭐⭐⭐ Very Hard
`social-media-some`	Partial social media task	⭐⭐⭐ Hard

Total: 100+ tasks across all categories

Usage:

# Easy task for quick testing
env = BrowserGymEnv(environment={"BROWSERGYM_TASK_NAME": "click-test"})

# Medium difficulty for training
env = BrowserGymEnv(environment={"BROWSERGYM_TASK_NAME": "click-checkboxes"})

# Hard task for evaluation
env = BrowserGymEnv(environment={"BROWSERGYM_TASK_NAME": "email-inbox"})

WebArena Tasks (Evaluation - 812 tasks)

WebArena tasks are organized by website and difficulty. Tasks are numbered 0-811.

By Website:

Website	Task Count	Description	Example Tasks
Shopping	~200	E-commerce site	Search products, add to cart, checkout
Shopping Admin	~150	Admin panel	Manage products, orders, customers
Reddit	~150	Forum/social	Post, comment, search discussions
GitLab	~200	Code repository	Create issues, merge requests, review code
Wikipedia	~100	Knowledge base	Search, read, extract information
Map	~12	Location service	Find places, get directions

By Difficulty:

Difficulty	Task Count	Steps Required	Example
Easy	~200	1-5 steps	"Find the price of product X"
Medium	~400	5-15 steps	"Add cheapest laptop to cart"
Hard	~212	15+ steps	"Create merge request for bug fix"

Usage:

# Task 0 (usually easy)
env = BrowserGymEnv(environment={
    "BROWSERGYM_BENCHMARK": "webarena",
    "BROWSERGYM_TASK_NAME": "0",
    "SHOPPING": "http://your-server:7770",
    # ... other URLs
})

# Task 156 (GitLab merge request)
env = BrowserGymEnv(environment={
    "BROWSERGYM_BENCHMARK": "webarena",
    "BROWSERGYM_TASK_NAME": "156",
    # ... URLs
})

Note: WebArena tasks require the full backend infrastructure. See WebArena setup guide.

VisualWebArena Tasks (910 tasks)

Similar to WebArena but requires visual understanding. Tasks involve:

Image-based reasoning
Visual element identification
Multimodal interaction (text + images)

WorkArena Tasks

Enterprise software automation tasks:

CRM operations
Project management
Business workflows

Full task lists:

Evaluation (WebArena)

Prerequisites

WebArena requires setting up backend infrastructure. See the WebArena documentation.

Usage

from envs.browsergym_env import BrowserGymEnv, BrowserGymAction

# Create environment for WebArena evaluation
env = BrowserGymEnv.from_docker_image(
    "ghcr.io/openenv/browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "webarena",
        "BROWSERGYM_TASK_NAME": "0",  # Task ID
        # WebArena backend URLs (required)
        "SHOPPING": "http://your-server:7770",
        "SHOPPING_ADMIN": "http://your-server:7780/admin",
        "REDDIT": "http://your-server:9999",
        "GITLAB": "http://your-server:8023",
        "MAP": "http://your-server:3000",
        "WIKIPEDIA": "http://your-server:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing",
        "HOMEPAGE": "http://your-server:4399",
    }
)

# Evaluate your trained agent
result = env.reset()
while not result.done:
    action_str = agent.get_action(result.observation)
    action = BrowserGymAction(action_str=action_str)
    result = env.step(action)

print(f"Success: {result.reward}")
env.close()

Building the Docker Image

Prerequisites

Base Image: Build the OpenEnv base image first:

# From the OpenEnv repository root
docker build -t openenv-base:latest -f src/core/containers/images/Dockerfile .

Build the BrowserGym Environment

# From the OpenEnv repository root
docker build -t browsergym-env:latest -f src/envs/browsergym_env/server/Dockerfile .

Run the Server

For MiniWoB (Training):

docker run -p 8000:8000 \
  -e BROWSERGYM_BENCHMARK="miniwob" \
  -e BROWSERGYM_TASK_NAME="click-test" \
  browsergym-env:latest

For WebArena (Evaluation):

docker run -p 8000:8000 \
  -e BROWSERGYM_BENCHMARK="webarena" \
  -e BROWSERGYM_TASK_NAME="0" \
  -e SHOPPING="http://your-server:7770" \
  -e SHOPPING_ADMIN="http://your-server:7780/admin" \
  -e REDDIT="http://your-server:9999" \
  -e GITLAB="http://your-server:8023" \
  -e MAP="http://your-server:3000" \
  -e WIKIPEDIA="http://your-server:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing" \
  -e HOMEPAGE="http://your-server:4399" \
  browsergym-env:latest

Environment Details

Action

Actions in BrowserGym are natural language strings that describe browser operations:

from envs.browsergym_env import BrowserGymAction

# Click actions
action = BrowserGymAction(action_str="click('Submit button')")
action = BrowserGymAction(action_str="click('element_id_123')")

# Type actions
action = BrowserGymAction(action_str="fill('username', 'john@example.com')")
action = BrowserGymAction(action_str="fill('password', 'secret123')")

# Navigate actions
action = BrowserGymAction(action_str="goto('https://example.com')")

# Keyboard actions
action = BrowserGymAction(action_str="press('Enter')")
action = BrowserGymAction(action_str="press('Tab')")

# Scroll actions
action = BrowserGymAction(action_str="scroll('down')")

Observation

Observations contain multiple modalities:

result = env.step(action)
obs = result.observation

# Text observations
print(obs.text)          # Primary text representation (AXTree or DOM)
print(obs.axtree_txt)    # Accessibility tree
print(obs.pruned_html)   # Pruned HTML (interactive elements only)

# Page metadata
print(obs.url)           # Current URL
print(obs.goal)          # Task goal/instruction

# Visual (if enabled)
if obs.screenshot is not None:
    print(obs.screenshot.shape)  # [height, width, channels]

# Error handling
if obs.last_action_error:
    print(f"Action failed: {obs.error}")

# Episode status
print(obs.done)          # True if episode ended
print(obs.reward)        # Reward for the step

# Access full BrowserGym data (includes timestamps, etc.)
print(obs.metadata["browsergym_obs"])  # Full observation dict from BrowserGym
print(obs.metadata["browsergym_info"]) # Full info dict (timestamps, page state, etc.)

Advanced: Accessing Raw BrowserGym Data

For VisualWebArena or custom training, you may need additional data like timestamps or browser state. The full BrowserGym observation and info dicts are preserved in metadata:

result = env.step(action)

# Access timestamps (if available)
info = result.observation.metadata["browsergym_info"]
if "timestamp" in info:
    print(f"Action timestamp: {info['timestamp']}")

# Access additional observation fields
obs_dict = result.observation.metadata["browsergym_obs"]
if "dom_object" in obs_dict:
    dom = obs_dict["dom_object"]
    # Work with raw DOM object

# Access page performance data
if "performance" in info:
    print(f"Page load time: {info['performance']}")

State

The environment state tracks progress:

state = env.state()

print(f"Benchmark: {state.benchmark}")     # 'miniwob', 'webarena', etc.
print(f"Task: {state.task_name}")          # Task name/ID
print(f"Episode: {state.episode_id}")      # Unique episode ID
print(f"Steps: {state.step_count}")        # Number of steps taken
print(f"Total Reward: {state.cum_reward}") # Cumulative reward
print(f"Goal: {state.goal}")               # Task instruction
print(f"URL: {state.current_url}")         # Current page URL

Configuration

Environment variables:

Common Settings

BROWSERGYM_BENCHMARK: Benchmark to use (miniwob, webarena, visualwebarena, workarena)
BROWSERGYM_TASK_NAME: Specific task name (optional, will use first available if not set)
BROWSERGYM_HEADLESS: Run browser in headless mode (default: true)
BROWSERGYM_VIEWPORT_WIDTH: Browser viewport width (default: 1280)
BROWSERGYM_VIEWPORT_HEIGHT: Browser viewport height (default: 720)
BROWSERGYM_TIMEOUT: Action timeout in milliseconds (default: 10000)

WebArena-Specific (only needed for WebArena benchmark)

SHOPPING: Shopping website URL
SHOPPING_ADMIN: Shopping admin panel URL
REDDIT: Reddit-like forum URL
GITLAB: GitLab instance URL
MAP: Map service URL
WIKIPEDIA: Wikipedia instance URL
HOMEPAGE: Homepage URL

Supported Benchmarks

1. MiniWoB++ (Training) ✅ Recommended for Training

100+ tasks ranging from simple (click buttons) to complex (form filling, navigation)
Fast: Instant resets, quick episodes
Randomized: Task variations for generalization
No setup: Works out-of-the-box
Dense rewards: Immediate feedback for learning

Use Case: Train agents on fundamental web navigation skills

2. WebArena (Evaluation) 📊 Benchmark

812 realistic tasks across 6 websites
Complex: Multi-step reasoning, real web interfaces
Requires setup: Need to run 7 backend services
Sparse rewards: Binary success/failure
Evaluation-focused: Test real-world performance

Use Case: Evaluate agents on realistic web tasks

3. VisualWebArena (Evaluation) 👁️ Visual Benchmark

910 tasks requiring visual understanding
Multimodal: Both text and visual observations
Requires setup: Similar to WebArena
Challenging: Requires visual reasoning

Use Case: Test visual web navigation capabilities

4. WorkArena (Evaluation) 💼 Enterprise Benchmark

Enterprise tasks: CRM, project management, etc.
Realistic workflows: Real enterprise software
Requires setup: Enterprise software instances

Use Case: Evaluate on business automation tasks

Typical Training Pipeline

from envs.browsergym_env import BrowserGymEnv, BrowserGymAction

# Stage 1: Train on MiniWoB (simple tasks, fast)
train_env = BrowserGymEnv.from_docker_image(
    "browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "miniwob",
        "BROWSERGYM_TASK_NAME": "click-button",
    }
)

# Train your agent (RL, imitation learning, etc.)
agent.train(train_env, num_episodes=10000)
train_env.close()

# Stage 2: Evaluate on WebArena (complex tasks, realistic)
eval_env = BrowserGymEnv.from_docker_image(
    "browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "webarena",
        "BROWSERGYM_TASK_NAME": "0",
        # ... WebArena URLs
    }
)

# Test performance
success_rate = agent.evaluate(eval_env, num_tasks=812)
print(f"WebArena Success Rate: {success_rate:.2%}")
eval_env.close()

Development & Testing

Running Tests

# From the OpenEnv repository root
pytest tests/envs/test_browsergym_env.py

Local Development

# Install in development mode
cd /path/to/OpenEnv
pip install -e .

# Install BrowserGym
pip install browsergym browsergym-miniwob browsergym-webarena

# Run the server locally
cd src/envs/browsergym_env/server
export BROWSERGYM_BENCHMARK=miniwob
export BROWSERGYM_TASK_NAME=click-test
python app.py

Project Structure

browsergym_env/
├── __init__.py              # Module exports
├── models.py                # Action, Observation, State dataclasses
├── client.py                # HTTPEnvClient implementation
├── README.md                # This file
└── server/
    ├── __init__.py
    ├── app.py               # FastAPI application
    ├── browsergym_environment.py  # Environment implementation
    ├── Dockerfile           # Container specification
    └── requirements.txt     # Python dependencies

Browsergym_env Environment Server

About

Web Interface

API Documentation

Health Check

BrowserGym Environment

Why BrowserGym?

Quick Start - Training (MiniWoB)

No Setup Required! 🎉

Available Tasks by Benchmark

MiniWoB++ Tasks (Training - 100+ tasks)

WebArena Tasks (Evaluation - 812 tasks)

VisualWebArena Tasks (910 tasks)

WorkArena Tasks

Evaluation (WebArena)

Prerequisites

Usage

Building the Docker Image

Prerequisites

Build the BrowserGym Environment

Run the Server

For MiniWoB (Training):

For WebArena (Evaluation):

Environment Details

Action

Observation

Advanced: Accessing Raw BrowserGym Data

State

Configuration

Common Settings

WebArena-Specific (only needed for WebArena benchmark)

Supported Benchmarks

1. MiniWoB++ (Training) ✅ Recommended for Training

2. WebArena (Evaluation) 📊 Benchmark

3. VisualWebArena (Evaluation) 👁️ Visual Benchmark

4. WorkArena (Evaluation) 💼 Enterprise Benchmark

Typical Training Pipeline

Development & Testing

Running Tests

Local Development

Project Structure

References