Spaces:

ServiceNow
/

browsergym-leaderboard

Running

App Files Files Community

Aman-J commited on 15 days ago

Commit

f6deee3

1 Parent(s): 24757bf

add new results

Browse files

Files changed (30) hide show

GenericAgent-Claude-3.7-Sonnet/README.md +44 -0
GenericAgent-Claude-3.7-Sonnet/webarena.json +16 -0
GenericAgent-Claude-4-Sonnet/README.md +44 -0
GenericAgent-Claude-4-Sonnet/miniwob.json +17 -0
GenericAgent-Claude-4-Sonnet/workarena-l1.json +16 -0
GenericAgent-Claude-4-Sonnet/workarena-l2.json +16 -0
GenericAgent-GPT-4_1-Mini/README.md +44 -0
GenericAgent-GPT-4_1-Mini/webarena.json +16 -0
GenericAgent-GPT-5-mini/README.md +44 -0
GenericAgent-GPT-5-mini/miniwob.json +16 -0
GenericAgent-GPT-5-mini/workarena-l1.json +16 -0
GenericAgent-GPT-5-mini/workarena-l2.json +16 -0
GenericAgent-GPT-5-nano/README.md +44 -0
GenericAgent-GPT-5-nano/miniwob.json +16 -0
GenericAgent-GPT-5-nano/workarena-l1.json +16 -0
GenericAgent-GPT-5-nano/workarena-l2.json +16 -0
GenericAgent-GPT-5/README.md +44 -0
GenericAgent-GPT-5/miniwob.json +16 -0
GenericAgent-GPT-5/workarena-l1.json +31 -0
GenericAgent-GPT-5/workarena-l2.json +16 -0
GenericAgent-GPT-5/workarena-l3.json +16 -0
GenericAgent-GPT-oss-120b/README.md +44 -0
GenericAgent-GPT-oss-120b/miniwob.json +16 -0
GenericAgent-GPT-oss-120b/workarena-l1.json +16 -0
GenericAgent-GPT-oss-120b/workarena-l2.json +16 -0
GenericAgent-GPT-oss-20b/README.md +44 -0
GenericAgent-GPT-oss-20b/miniwob.json +16 -0
GenericAgent-GPT-oss-20b/workarena-l1.json +16 -0
GenericAgent-GPT-oss-20b/workarena-l2.json +16 -0
results/OrbyAgent-Claude-3.5-Sonnet/README.md +1 -0

GenericAgent-Claude-3.7-Sonnet/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-Claude-3.7-Sonnet
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses Claude-3.7-Sonnet (claude-3-7-sonnet-20250219) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-Claude-3.7-Sonnet/webarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-claude-3-7-sonnet",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Webarena",
+    "score": 0.446,
+    "std_err": 0.025,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-Claude-4-Sonnet/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-Claude-4-Sonnet
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses claude-4-sonnet (claude-sonnet-4-20250514) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-Claude-4-Sonnet/miniwob.json ADDED Viewed

	@@ -0,0 +1,17 @@

+[
+    {
+    "agent_name": "GenericAgent-claude-sonnet-4",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Miniwob",
+    "score": 0.707,
+    "std_err": 0.018,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-Claude-4-Sonnet/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-claude-sonnet-4-20250514",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L1",
+    "score": 0.633,
+    "std_err": 0.027,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-Claude-4-Sonnet/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-claude-sonnet-4-20250514",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L2",
+    "score": 0.404,
+    "std_err": 0.032,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-4_1-Mini/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT_4_1_mini
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-4.1-mini (gpt-4.1-mini-2025-04-14) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-GPT-4_1-Mini/webarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-4.1-mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "webarena",
+    "score": 0.307,
+    "std_err": 0.024,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5-mini/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT-5-Mini
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-5-mini (gpt-5-mini-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-GPT-5-mini/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 0.71,
+    "std_err": 0.018,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5-mini/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L1",
+    "score": 0.606,
+    "std_err": 0.027,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5-mini/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L2",
+    "score": 0.477,
+    "std_err": 0.033,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5-nano/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT-5-Nano
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-5-nano (gpt-5-nano-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-GPT-5-nano/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+ {
+    "agent_name": "GenericAgent-gpt-5-nano",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 0.648,
+    "std_err": 0.019,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5-nano/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-nano",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L1",
+    "score": 0.406,
+    "std_err": 0.027,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5-nano/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-nano",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L2",
+    "score": 0.034,
+    "std_err": 0.012,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT-5
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-5 (gpt-5-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-GPT-5/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-2025-08-07",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 0.715,
+    "std_err": 0.018,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,31 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-2025-08-07",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L1",
+    "score": 0.661,
+    "std_err": 0.026,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  },
+   {
+    "agent_name": "GenericAgent-gpt-5-2025-08-07",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L1",
+    "score": 0.791,
+    "std_err": 0.022,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "No",
+    "reproducible": "Yes",
+    "comments": "Increased max_steps from 15 to 30",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-5-2025-08-07",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L2",
+    "score": 0.694,
+    "std_err": 0.03,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-5/workarena-l3.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-gpt-5-2025-08-07",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L3",
+    "score": 0.115,
+    "std_err": 0.021,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "No",
+    "reproducible": "Yes",
+    "comments": "Increased max_steps from 50 to 100",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-oss-120b/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-OSS-120B
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-oss-120b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-GPT-oss-120b/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-openai_gpt-oss-120b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 0.664,
+    "std_err": 0.019,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-oss-120b/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-openai_gpt-oss-120b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L1",
+    "score": 0.509,
+    "std_err": 0.028,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-oss-120b/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-openai_gpt-oss-120b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L2",
+    "score": 0.115,
+    "std_err": 0.021,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-oss-20b/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-OSS-20b
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-oss-20b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

GenericAgent-GPT-oss-20b/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-openai_gpt-oss-20b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 0.64,
+    "std_err": 0.019,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-oss-20b/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-oss-20b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L1",
+    "score": 0.385,
+    "std_err": 0.027,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

GenericAgent-GPT-oss-20b/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-gpt-oss-20b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "Workarena-L2",
+    "score": 0.026,
+    "std_err": 0.01,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/OrbyAgent-Claude-3.5-Sonnet/README.md CHANGED Viewed

@@ -5,3 +5,4 @@ This agent is developed by [Orby AI](https://www.orby.ai/).
 The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
 It uses Claude-3.5-sonnet-20241022 as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).


5	The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
6
7	It uses Claude-3.5-sonnet-20241022 as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).
8	+