Spaces:

burtenshaw
/

browser-gym

Runtime error

File size: 19,923 Bytes

5e0154b
2fd4aea
f3f0fe2
 
 
5e0154b
 
2fd4aea
 
 
 
5e0154b
 
2fd4aea

---
title: Browsergym_env Environment Server
emoji: 🐏
colorFrom: gray
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Browsergym_env Environment Server

FastAPI server for browsergym_env environment powered by Meta's OpenEnv.

## About

This Space provides a containerized environment for browsergym_env interactions.
Built with FastAPI and OpenEnv framework.

## Web Interface

This deployment includes an interactive web interface for exploring the environment:
- **HumanAgent Interface**: Interact with the environment using a web form
- **State Observer**: Real-time view of environment state and action history
- **Live Updates**: WebSocket-based real-time updates

Access the web interface at: `/web`

## API Documentation

Visit `/docs` for interactive API documentation.

## Health Check

The environment provides a health check endpoint at `/health`.


# BrowserGym Environment

BrowserGym is a unified framework for web-based agent tasks that provides access to multiple benchmarks under a single Gymnasium-compatible API. This integration brings the complete training-to-evaluation pipeline for web agents into OpenEnv.

## Why BrowserGym?

BrowserGym provides a complete pipeline for developing web agents: train on simple tasks, then evaluate on realistic websites.

**What are these benchmarks?**

- **MiniWoB++ (Training)**: 100+ synthetic web tasks like "click this button", "fill out this form", "select from dropdown". Each task is a simple webpage with a clear objective. Fast resets, randomized variations, dense rewards. Perfect for learning basic web navigation skills. **No external setup needed** - tasks run in isolated browser sessions.

- **WebArena (Evaluation)**: 812 tasks on real websites (e-commerce, forums, GitLab, Wikipedia). Tasks like "find the cheapest laptop and add to cart" or "create a merge request for bug #123". Multi-step, requires reasoning, sparse rewards. Tests if your agent can handle actual websites. **Requires running 7 backend services** (shopping site, GitLab instance, etc).

- **VisualWebArena**: Similar to WebArena but requires visual understanding - agents need to interpret images, identify UI elements visually, handle multimodal content.

- **WorkArena**: Enterprise software tasks (CRM, project management, business workflows). Tests automation on corporate-style applications.

**The training → evaluation pipeline:**
1. Train on MiniWoB (simple, controlled, fast iterations)
2. Evaluate on WebArena (complex, realistic, measures real-world capability)

**Key advantage**: You can start training immediately with MiniWoB. No need to set up infrastructure just to test if your code works.

## Quick Start - Training (MiniWoB)

### No Setup Required! 🎉

```python
from envs.browsergym_env import BrowserGymEnv, BrowserGymAction

# Create environment for MiniWoB training task
env = BrowserGymEnv.from_docker_image(
    "ghcr.io/openenv/browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "miniwob",
        "BROWSERGYM_TASK_NAME": "click-test",  # or "click-button", "click-dialog", etc.
    }
)

# Train your agent!
for episode in range(1000):
    result = env.reset()
    print(f"Goal: {result.observation.goal}")

    done = False
    while not done:
        # Your agent decides what to do
        action_str = agent.get_action(result.observation.text)
        action = BrowserGymAction(action_str=action_str)

        result = env.step(action)
        done = result.done

        print(f"Reward: {result.reward}")

env.close()
```

### Available Tasks by Benchmark

#### MiniWoB++ Tasks (Training - 100+ tasks)

MiniWoB tasks are organized by difficulty and type. Here are the main categories:

**Click Tasks** (Basic interaction)
| Task Name | Description | Difficulty |
|-----------|-------------|------------|
| `click-test` | Click a single button | ⭐ Easy |
| `click-button` | Click button with specific text | ⭐ Easy |
| `click-button-sequence` | Click buttons in order | ⭐⭐ Medium |
| `click-checkboxes` | Select specific checkboxes | ⭐⭐ Medium |
| `click-checkboxes-soft` | Select checkboxes (multiple valid) | ⭐⭐ Medium |
| `click-checkboxes-large` | Many checkboxes to select from | ⭐⭐ Medium |
| `click-checkboxes-transfer` | Transfer learning variation | ⭐⭐ Medium |
| `click-dialog` | Click correct button in dialog | ⭐ Easy |
| `click-dialog-2` | More complex dialog | ⭐⭐ Medium |
| `click-link` | Click on a link | ⭐ Easy |
| `click-option` | Select from dropdown | ⭐⭐ Medium |
| `click-pie` | Click on pie chart slice | ⭐⭐ Medium |
| `click-scroll-list` | Click item in scrollable list | ⭐⭐⭐ Hard |
| `click-shades` | Click on specific color shade | ⭐⭐ Medium |
| `click-shape` | Click on specific shape | ⭐⭐ Medium |
| `click-tab` | Switch between tabs | ⭐⭐ Medium |
| `click-tab-2` | More complex tab switching | ⭐⭐⭐ Hard |
| `click-widget` | Click on UI widget | ⭐⭐ Medium |

**Text Entry Tasks** (Typing and forms)
| Task Name | Description | Difficulty |
|-----------|-------------|------------|
| `enter-text` | Type text into input field | ⭐ Easy |
| `enter-text-dynamic` | Dynamic text entry | ⭐⭐ Medium |
| `enter-text-2` | Multiple text fields | ⭐⭐ Medium |
| `enter-password` | Fill password field | ⭐ Easy |
| `enter-date` | Enter a date | ⭐⭐ Medium |
| `enter-time` | Enter a time | ⭐⭐ Medium |
| `login-user` | Complete login form | ⭐⭐ Medium |
| `login-user-popup` | Login via popup | ⭐⭐⭐ Hard |

**Navigation Tasks** (Multi-step interaction)
| Task Name | Description | Difficulty |
|-----------|-------------|------------|
| `navigate-tree` | Navigate through tree structure | ⭐⭐⭐ Hard |
| `search-engine` | Use search interface | ⭐⭐ Medium |
| `use-autocomplete` | Interact with autocomplete | ⭐⭐⭐ Hard |
| `book-flight` | Book a flight (complex form) | ⭐⭐⭐⭐ Very Hard |
| `choose-date` | Pick date from calendar | ⭐⭐⭐ Hard |
| `choose-date-easy` | Simplified date picker | ⭐⭐ Medium |
| `choose-date-medium` | Medium difficulty date picker | ⭐⭐⭐ Hard |
| `choose-list` | Select from long list | ⭐⭐ Medium |

**Visual/Spatial Tasks** (Requires visual understanding)
| Task Name | Description | Difficulty |
|-----------|-------------|------------|
| `count-sides` | Count sides of shape | ⭐⭐ Medium |
| `count-shape` | Count specific shapes | ⭐⭐ Medium |
| `find-word` | Find word in text | ⭐⭐ Medium |
| `focus-text` | Focus on text element | ⭐ Easy |
| `focus-text-2` | More complex focus task | ⭐⭐ Medium |
| `grid-coordinate` | Click grid coordinate | ⭐⭐ Medium |
| `guess-number` | Guess a number game | ⭐⭐⭐ Hard |
| `identify-shape` | Identify shape type | ⭐⭐ Medium |
| `read-table` | Extract info from table | ⭐⭐⭐ Hard |
| `read-table-2` | More complex table reading | ⭐⭐⭐ Hard |

**Email/Social Tasks** (Realistic scenarios)
| Task Name | Description | Difficulty |
|-----------|-------------|------------|
| `email-inbox` | Manage email inbox | ⭐⭐⭐⭐ Very Hard |
| `email-inbox-forward` | Forward emails | ⭐⭐⭐⭐ Very Hard |
| `email-inbox-nl` | Natural language email task | ⭐⭐⭐⭐ Very Hard |
| `email-inbox-star-reply` | Star and reply to emails | ⭐⭐⭐⭐ Very Hard |
| `social-media` | Social media interaction | ⭐⭐⭐⭐ Very Hard |
| `social-media-some` | Partial social media task | ⭐⭐⭐ Hard |

**Total:** 100+ tasks across all categories

**Usage:**
```python
# Easy task for quick testing
env = BrowserGymEnv(environment={"BROWSERGYM_TASK_NAME": "click-test"})

# Medium difficulty for training
env = BrowserGymEnv(environment={"BROWSERGYM_TASK_NAME": "click-checkboxes"})

# Hard task for evaluation
env = BrowserGymEnv(environment={"BROWSERGYM_TASK_NAME": "email-inbox"})
```

#### WebArena Tasks (Evaluation - 812 tasks)

WebArena tasks are organized by website and difficulty. Tasks are numbered 0-811.

**By Website:**
| Website | Task Count | Description | Example Tasks |
|---------|------------|-------------|---------------|
| Shopping | ~200 | E-commerce site | Search products, add to cart, checkout |
| Shopping Admin | ~150 | Admin panel | Manage products, orders, customers |
| Reddit | ~150 | Forum/social | Post, comment, search discussions |
| GitLab | ~200 | Code repository | Create issues, merge requests, review code |
| Wikipedia | ~100 | Knowledge base | Search, read, extract information |
| Map | ~12 | Location service | Find places, get directions |

**By Difficulty:**
| Difficulty | Task Count | Steps Required | Example |
|------------|------------|----------------|---------|
| Easy | ~200 | 1-5 steps | "Find the price of product X" |
| Medium | ~400 | 5-15 steps | "Add cheapest laptop to cart" |
| Hard | ~212 | 15+ steps | "Create merge request for bug fix" |

**Usage:**
```python
# Task 0 (usually easy)
env = BrowserGymEnv(environment={
    "BROWSERGYM_BENCHMARK": "webarena",
    "BROWSERGYM_TASK_NAME": "0",
    "SHOPPING": "http://your-server:7770",
    # ... other URLs
})

# Task 156 (GitLab merge request)
env = BrowserGymEnv(environment={
    "BROWSERGYM_BENCHMARK": "webarena",
    "BROWSERGYM_TASK_NAME": "156",
    # ... URLs
})
```

**Note:** WebArena tasks require the full backend infrastructure. See [WebArena setup guide](https://github.com/web-arena-x/webarena/tree/main/environment_docker).

#### VisualWebArena Tasks (910 tasks)

Similar to WebArena but requires visual understanding. Tasks involve:
- Image-based reasoning
- Visual element identification
- Multimodal interaction (text + images)

#### WorkArena Tasks

Enterprise software automation tasks:
- CRM operations
- Project management
- Business workflows

**Full task lists:**
- [MiniWoB++ tasks](https://github.com/Farama-Foundation/miniwob-plusplus/tree/master/miniwob/environment)
- [WebArena tasks](https://github.com/web-arena-x/webarena/blob/main/config_files/)
- [BrowserGym documentation](https://github.com/ServiceNow/BrowserGym)

## Evaluation (WebArena)

### Prerequisites

WebArena requires setting up backend infrastructure. See the [WebArena documentation](https://github.com/web-arena-x/webarena/tree/main/environment_docker).

### Usage

```python
from envs.browsergym_env import BrowserGymEnv, BrowserGymAction

# Create environment for WebArena evaluation
env = BrowserGymEnv.from_docker_image(
    "ghcr.io/openenv/browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "webarena",
        "BROWSERGYM_TASK_NAME": "0",  # Task ID
        # WebArena backend URLs (required)
        "SHOPPING": "http://your-server:7770",
        "SHOPPING_ADMIN": "http://your-server:7780/admin",
        "REDDIT": "http://your-server:9999",
        "GITLAB": "http://your-server:8023",
        "MAP": "http://your-server:3000",
        "WIKIPEDIA": "http://your-server:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing",
        "HOMEPAGE": "http://your-server:4399",
    }
)

# Evaluate your trained agent
result = env.reset()
while not result.done:
    action_str = agent.get_action(result.observation)
    action = BrowserGymAction(action_str=action_str)
    result = env.step(action)

print(f"Success: {result.reward}")
env.close()
```

## Building the Docker Image

### Prerequisites

1. **Base Image**: Build the OpenEnv base image first:

```bash
# From the OpenEnv repository root
docker build -t openenv-base:latest -f src/core/containers/images/Dockerfile .
```

### Build the BrowserGym Environment

```bash
# From the OpenEnv repository root
docker build -t browsergym-env:latest -f src/envs/browsergym_env/server/Dockerfile .
```

### Run the Server

#### For MiniWoB (Training):

```bash
docker run -p 8000:8000 \
  -e BROWSERGYM_BENCHMARK="miniwob" \
  -e BROWSERGYM_TASK_NAME="click-test" \
  browsergym-env:latest
```

#### For WebArena (Evaluation):

```bash
docker run -p 8000:8000 \
  -e BROWSERGYM_BENCHMARK="webarena" \
  -e BROWSERGYM_TASK_NAME="0" \
  -e SHOPPING="http://your-server:7770" \
  -e SHOPPING_ADMIN="http://your-server:7780/admin" \
  -e REDDIT="http://your-server:9999" \
  -e GITLAB="http://your-server:8023" \
  -e MAP="http://your-server:3000" \
  -e WIKIPEDIA="http://your-server:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing" \
  -e HOMEPAGE="http://your-server:4399" \
  browsergym-env:latest
```

## Environment Details

### Action

Actions in BrowserGym are natural language strings that describe browser operations:

```python
from envs.browsergym_env import BrowserGymAction

# Click actions
action = BrowserGymAction(action_str="click('Submit button')")
action = BrowserGymAction(action_str="click('element_id_123')")

# Type actions
action = BrowserGymAction(action_str="fill('username', 'john@example.com')")
action = BrowserGymAction(action_str="fill('password', 'secret123')")

# Navigate actions
action = BrowserGymAction(action_str="goto('https://example.com')")

# Keyboard actions
action = BrowserGymAction(action_str="press('Enter')")
action = BrowserGymAction(action_str="press('Tab')")

# Scroll actions
action = BrowserGymAction(action_str="scroll('down')")
```

### Observation

Observations contain multiple modalities:

```python
result = env.step(action)
obs = result.observation

# Text observations
print(obs.text)          # Primary text representation (AXTree or DOM)
print(obs.axtree_txt)    # Accessibility tree
print(obs.pruned_html)   # Pruned HTML (interactive elements only)

# Page metadata
print(obs.url)           # Current URL
print(obs.goal)          # Task goal/instruction

# Visual (if enabled)
if obs.screenshot is not None:
    print(obs.screenshot.shape)  # [height, width, channels]

# Error handling
if obs.last_action_error:
    print(f"Action failed: {obs.error}")

# Episode status
print(obs.done)          # True if episode ended
print(obs.reward)        # Reward for the step

# Access full BrowserGym data (includes timestamps, etc.)
print(obs.metadata["browsergym_obs"])  # Full observation dict from BrowserGym
print(obs.metadata["browsergym_info"]) # Full info dict (timestamps, page state, etc.)
```

#### Advanced: Accessing Raw BrowserGym Data

For VisualWebArena or custom training, you may need additional data like timestamps or browser state. The full BrowserGym observation and info dicts are preserved in `metadata`:

```python
result = env.step(action)

# Access timestamps (if available)
info = result.observation.metadata["browsergym_info"]
if "timestamp" in info:
    print(f"Action timestamp: {info['timestamp']}")

# Access additional observation fields
obs_dict = result.observation.metadata["browsergym_obs"]
if "dom_object" in obs_dict:
    dom = obs_dict["dom_object"]
    # Work with raw DOM object

# Access page performance data
if "performance" in info:
    print(f"Page load time: {info['performance']}")
```

### State

The environment state tracks progress:

```python
state = env.state()

print(f"Benchmark: {state.benchmark}")     # 'miniwob', 'webarena', etc.
print(f"Task: {state.task_name}")          # Task name/ID
print(f"Episode: {state.episode_id}")      # Unique episode ID
print(f"Steps: {state.step_count}")        # Number of steps taken
print(f"Total Reward: {state.cum_reward}") # Cumulative reward
print(f"Goal: {state.goal}")               # Task instruction
print(f"URL: {state.current_url}")         # Current page URL
```

## Configuration

Environment variables:

### Common Settings
- `BROWSERGYM_BENCHMARK`: Benchmark to use (`miniwob`, `webarena`, `visualwebarena`, `workarena`)
- `BROWSERGYM_TASK_NAME`: Specific task name (optional, will use first available if not set)
- `BROWSERGYM_HEADLESS`: Run browser in headless mode (default: `true`)
- `BROWSERGYM_VIEWPORT_WIDTH`: Browser viewport width (default: `1280`)
- `BROWSERGYM_VIEWPORT_HEIGHT`: Browser viewport height (default: `720`)
- `BROWSERGYM_TIMEOUT`: Action timeout in milliseconds (default: `10000`)

### WebArena-Specific (only needed for WebArena benchmark)
- `SHOPPING`: Shopping website URL
- `SHOPPING_ADMIN`: Shopping admin panel URL
- `REDDIT`: Reddit-like forum URL
- `GITLAB`: GitLab instance URL
- `MAP`: Map service URL
- `WIKIPEDIA`: Wikipedia instance URL
- `HOMEPAGE`: Homepage URL

## Supported Benchmarks

### 1. MiniWoB++ (Training) ✅ Recommended for Training

- **100+ tasks** ranging from simple (click buttons) to complex (form filling, navigation)
- **Fast**: Instant resets, quick episodes
- **Randomized**: Task variations for generalization
- **No setup**: Works out-of-the-box
- **Dense rewards**: Immediate feedback for learning

**Use Case**: Train agents on fundamental web navigation skills

### 2. WebArena (Evaluation) 📊 Benchmark

- **812 realistic tasks** across 6 websites
- **Complex**: Multi-step reasoning, real web interfaces
- **Requires setup**: Need to run 7 backend services
- **Sparse rewards**: Binary success/failure
- **Evaluation-focused**: Test real-world performance

**Use Case**: Evaluate agents on realistic web tasks

### 3. VisualWebArena (Evaluation) 👁️ Visual Benchmark

- **910 tasks** requiring visual understanding
- **Multimodal**: Both text and visual observations
- **Requires setup**: Similar to WebArena
- **Challenging**: Requires visual reasoning

**Use Case**: Test visual web navigation capabilities

### 4. WorkArena (Evaluation) 💼 Enterprise Benchmark

- **Enterprise tasks**: CRM, project management, etc.
- **Realistic workflows**: Real enterprise software
- **Requires setup**: Enterprise software instances

**Use Case**: Evaluate on business automation tasks

## Typical Training Pipeline

```python
from envs.browsergym_env import BrowserGymEnv, BrowserGymAction

# Stage 1: Train on MiniWoB (simple tasks, fast)
train_env = BrowserGymEnv.from_docker_image(
    "browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "miniwob",
        "BROWSERGYM_TASK_NAME": "click-button",
    }
)

# Train your agent (RL, imitation learning, etc.)
agent.train(train_env, num_episodes=10000)
train_env.close()

# Stage 2: Evaluate on WebArena (complex tasks, realistic)
eval_env = BrowserGymEnv.from_docker_image(
    "browsergym-env:latest",
    environment={
        "BROWSERGYM_BENCHMARK": "webarena",
        "BROWSERGYM_TASK_NAME": "0",
        # ... WebArena URLs
    }
)

# Test performance
success_rate = agent.evaluate(eval_env, num_tasks=812)
print(f"WebArena Success Rate: {success_rate:.2%}")
eval_env.close()
```

## Development & Testing

### Running Tests

```bash
# From the OpenEnv repository root
pytest tests/envs/test_browsergym_env.py
```

### Local Development

```bash
# Install in development mode
cd /path/to/OpenEnv
pip install -e .

# Install BrowserGym
pip install browsergym browsergym-miniwob browsergym-webarena

# Run the server locally
cd src/envs/browsergym_env/server
export BROWSERGYM_BENCHMARK=miniwob
export BROWSERGYM_TASK_NAME=click-test
python app.py
```

## Project Structure

```
browsergym_env/
├── __init__.py              # Module exports
├── models.py                # Action, Observation, State dataclasses
├── client.py                # HTTPEnvClient implementation
├── README.md                # This file
└── server/
    ├── __init__.py
    ├── app.py               # FastAPI application
    ├── browsergym_environment.py  # Environment implementation
    ├── Dockerfile           # Container specification
    └── requirements.txt     # Python dependencies
```

## References

- [BrowserGym GitHub](https://github.com/ServiceNow/BrowserGym)
- [MiniWoB++ Paper](https://arxiv.org/abs/1802.08802)
- [WebArena Paper](https://arxiv.org/abs/2307.13854)
- [WebArena Website](https://webarena.dev/)
- [VisualWebArena Paper](https://jykoh.com/vwa)
- [OpenEnv Documentation](https://github.com/openenv/openenv)