new-results
#10
by
						
jaiswala
	
							
						- opened
							
					
- results/GenericAgent-Claude-3.7-Sonnet/README.md +44 -0
- results/GenericAgent-Claude-3.7-Sonnet/webarena.json +16 -0
- results/GenericAgent-Claude-4-Sonnet/README.md +44 -0
- results/GenericAgent-Claude-4-Sonnet/miniwob.json +17 -0
- results/GenericAgent-Claude-4-Sonnet/workarena-l1.json +16 -0
- results/GenericAgent-Claude-4-Sonnet/workarena-l2.json +16 -0
- results/GenericAgent-GPT-4.1-Mini/README.md +44 -0
- results/GenericAgent-GPT-4.1-Mini/webarena.json +16 -0
- results/GenericAgent-GPT-5-mini/README.md +44 -0
- results/GenericAgent-GPT-5-mini/miniwob.json +16 -0
- results/GenericAgent-GPT-5-mini/workarena-l1.json +16 -0
- results/GenericAgent-GPT-5-mini/workarena-l2.json +16 -0
- results/GenericAgent-GPT-5-nano/README.md +44 -0
- results/GenericAgent-GPT-5-nano/miniwob.json +16 -0
- results/GenericAgent-GPT-5-nano/workarena-l1.json +16 -0
- results/GenericAgent-GPT-5-nano/workarena-l2.json +16 -0
- results/GenericAgent-GPT-5/README.md +44 -0
- results/GenericAgent-GPT-5/miniwob.json +16 -0
- results/GenericAgent-GPT-5/workarena-l1.json +16 -0
- results/GenericAgent-GPT-5/workarena-l2.json +16 -0
- results/GenericAgent-GPT-5/workarena-l3.json +16 -0
- results/GenericAgent-GPT-oss-120b/README.md +44 -0
- results/GenericAgent-GPT-oss-120b/miniwob.json +16 -0
- results/GenericAgent-GPT-oss-120b/workarena-l1.json +16 -0
- results/GenericAgent-GPT-oss-120b/workarena-l2.json +16 -0
- results/GenericAgent-GPT-oss-20b/README.md +44 -0
- results/GenericAgent-GPT-oss-20b/miniwob.json +16 -0
- results/GenericAgent-GPT-oss-20b/workarena-l1.json +16 -0
- results/GenericAgent-GPT-oss-20b/workarena-l2.json +16 -0
- results/OrbyAgent-Claude-3.5-Sonnet/README.md +1 -0
    	
        results/GenericAgent-Claude-3.7-Sonnet/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-Claude-3.7-Sonnet
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses Claude-3.7-Sonnet (claude-3-7-sonnet-20250219) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True,  
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-Claude-3.7-Sonnet/webarena.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
                {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-Claude-3.7-Sonnet",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WebArena",
         | 
| 6 | 
            +
                "score": 44.6,
         | 
| 7 | 
            +
                "std_err": 2.5,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-Claude-4-Sonnet/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-Claude-4-Sonnet
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses claude-4-sonnet (claude-sonnet-4-20250514) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True,
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-Claude-4-Sonnet/miniwob.json
    ADDED
    
    | @@ -0,0 +1,17 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [  
         | 
| 2 | 
            +
                {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-Claude-4-Sonnet",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "MiniWoB",
         | 
| 6 | 
            +
                "score": 70.7,
         | 
| 7 | 
            +
                "std_err": 1.8,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            ]
         | 
    	
        results/GenericAgent-Claude-4-Sonnet/workarena-l1.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
                {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-Claude-4-Sonnet",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L1",
         | 
| 6 | 
            +
                "score": 63.3,
         | 
| 7 | 
            +
                "std_err": 2.7,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-Claude-4-Sonnet/workarena-l2.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
                {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-Claude-4-Sonnet",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L2",
         | 
| 6 | 
            +
                "score": 40.4,
         | 
| 7 | 
            +
                "std_err": 3.2,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-4.1-Mini/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-GPT_4_1_mini
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses gpt-4.1-mini (gpt-4.1-mini-2025-04-14) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True,  
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-GPT-4.1-Mini/webarena.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-4.1-Mini",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WebArena",
         | 
| 6 | 
            +
                "score": 30.7,
         | 
| 7 | 
            +
                "std_err": 2.4,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5-mini/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-GPT-5-Mini
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses gpt-5-mini (gpt-5-mini-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True, 
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-GPT-5-mini/miniwob.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5-mini",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "MiniWoB",
         | 
| 6 | 
            +
                "score": 71,
         | 
| 7 | 
            +
                "std_err": 1.8,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5-mini/workarena-l1.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5-mini",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L1",
         | 
| 6 | 
            +
                "score": 60.6,
         | 
| 7 | 
            +
                "std_err": 2.7,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5-mini/workarena-l2.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5-mini",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L2",
         | 
| 6 | 
            +
                "score": 47.7,
         | 
| 7 | 
            +
                "std_err": 3.3,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5-nano/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-GPT-5-Nano
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses gpt-5-nano (gpt-5-nano-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True, 
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-GPT-5-nano/miniwob.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
             {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5-nano",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "MiniWoB",
         | 
| 6 | 
            +
                "score": 64.8,
         | 
| 7 | 
            +
                "std_err": 1.9,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5-nano/workarena-l1.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5-nano",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L1",
         | 
| 6 | 
            +
                "score": 40.6,
         | 
| 7 | 
            +
                "std_err": 2.7,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5-nano/workarena-l2.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5-nano",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L2",
         | 
| 6 | 
            +
                "score": 3.4,
         | 
| 7 | 
            +
                "std_err": 1.2,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-GPT-5
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses gpt-5 (gpt-5-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True,  
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-GPT-5/miniwob.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "MiniWoB",
         | 
| 6 | 
            +
                "score": 71.5,
         | 
| 7 | 
            +
                "std_err": 1.8,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5/workarena-l1.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
               {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L1",
         | 
| 6 | 
            +
                "score": 79.1,
         | 
| 7 | 
            +
                "std_err": 2.2,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "No",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "Increased max_steps from 15 to 30",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }  
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5/workarena-l2.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L2",
         | 
| 6 | 
            +
                "score": 69.4,
         | 
| 7 | 
            +
                "std_err": 3.0,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-5/workarena-l3.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
                {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-5",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L3",
         | 
| 6 | 
            +
                "score": 11.5,
         | 
| 7 | 
            +
                "std_err": 2.1,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "No",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "Increased max_steps from 50 to 100",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-oss-120b/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-OSS-120B
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses gpt-oss-120b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True, 
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-GPT-oss-120b/miniwob.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-oss-120b",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "MiniWoB",
         | 
| 6 | 
            +
                "score": 66.4,
         | 
| 7 | 
            +
                "std_err": 1.9,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-oss-120b/workarena-l1.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-oss-120b",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L1",
         | 
| 6 | 
            +
                "score": 50.9,
         | 
| 7 | 
            +
                "std_err": 2.8,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-oss-120b/workarena-l2.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-oss-120b",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L2",
         | 
| 6 | 
            +
                "score": 11.5,
         | 
| 7 | 
            +
                "std_err": 2.1,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-oss-20b/README.md
    ADDED
    
    | @@ -0,0 +1,44 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ### GenericAgent-OSS-20b
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            It uses gpt-oss-20b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
         | 
| 6 | 
            +
            ```python
         | 
| 7 | 
            +
            BASE_FLAGS = GenericPromptFlags(
         | 
| 8 | 
            +
                obs=dp.ObsFlags(
         | 
| 9 | 
            +
                    use_html=False,
         | 
| 10 | 
            +
                    use_ax_tree=True,
         | 
| 11 | 
            +
                    use_focused_element=True,
         | 
| 12 | 
            +
                    use_error_logs=True,
         | 
| 13 | 
            +
                    use_history=True,
         | 
| 14 | 
            +
                    use_past_error_logs=False,
         | 
| 15 | 
            +
                    use_action_history=True,
         | 
| 16 | 
            +
                    use_think_history=True,  
         | 
| 17 | 
            +
                    use_diff=False,
         | 
| 18 | 
            +
                    html_type="pruned_html",
         | 
| 19 | 
            +
                    use_screenshot=False,
         | 
| 20 | 
            +
                    use_som=False,
         | 
| 21 | 
            +
                    extract_visible_tag=True,
         | 
| 22 | 
            +
                    extract_clickable_tag=True,
         | 
| 23 | 
            +
                    extract_coords="False",
         | 
| 24 | 
            +
                    filter_visible_elements_only=False,
         | 
| 25 | 
            +
                ),
         | 
| 26 | 
            +
                action=dp.ActionFlags(
         | 
| 27 | 
            +
                    multi_actions=False,
         | 
| 28 | 
            +
                    action_set="bid",
         | 
| 29 | 
            +
                    long_description=False,
         | 
| 30 | 
            +
                    individual_examples=False,
         | 
| 31 | 
            +
                ),
         | 
| 32 | 
            +
                use_plan=False,
         | 
| 33 | 
            +
                use_criticise=False,
         | 
| 34 | 
            +
                use_thinking=True,
         | 
| 35 | 
            +
                use_memory=False,
         | 
| 36 | 
            +
                use_concrete_example=True,
         | 
| 37 | 
            +
                use_abstract_example=True,
         | 
| 38 | 
            +
                use_hints=True,
         | 
| 39 | 
            +
                enable_chat=False,
         | 
| 40 | 
            +
                max_prompt_tokens=40_000,
         | 
| 41 | 
            +
                be_cautious=True,
         | 
| 42 | 
            +
                extra_instructions=None,
         | 
| 43 | 
            +
            )
         | 
| 44 | 
            +
            ```
         | 
    	
        results/GenericAgent-GPT-oss-20b/miniwob.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-oss-20b",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "MiniWoB",
         | 
| 6 | 
            +
                "score": 64,
         | 
| 7 | 
            +
                "std_err": 1.9,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-oss-20b/workarena-l1.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-oss-20b",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L1",
         | 
| 6 | 
            +
                "score": 38.5,
         | 
| 7 | 
            +
                "std_err": 2.7,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/GenericAgent-GPT-oss-20b/workarena-l2.json
    ADDED
    
    | @@ -0,0 +1,16 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "agent_name": "GenericAgent-GPT-oss-20b",
         | 
| 4 | 
            +
                "study_id": "2025-08-07_21-09-16",
         | 
| 5 | 
            +
                "benchmark": "WorkArena-L2",
         | 
| 6 | 
            +
                "score": 2.6,
         | 
| 7 | 
            +
                "std_err": 1.0,
         | 
| 8 | 
            +
                "benchmark_specific": "No",
         | 
| 9 | 
            +
                "benchmark_tuned": "No",
         | 
| 10 | 
            +
                "followed_evaluation_protocol": "Yes",
         | 
| 11 | 
            +
                "reproducible": "Yes",
         | 
| 12 | 
            +
                "comments": "NA",
         | 
| 13 | 
            +
                "original_or_reproduced": "Original",
         | 
| 14 | 
            +
                "date_time": "2025-08-07 21:09:16"
         | 
| 15 | 
            +
              }
         | 
| 16 | 
            +
            ]
         | 
    	
        results/OrbyAgent-Claude-3.5-Sonnet/README.md
    CHANGED
    
    | @@ -5,3 +5,4 @@ This agent is developed by [Orby AI](https://www.orby.ai/). | |
| 5 | 
             
            The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
         | 
| 6 |  | 
| 7 | 
             
            It uses Claude-3.5-sonnet-20241022 as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).
         | 
|  | 
|  | |
| 5 | 
             
            The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
         | 
| 6 |  | 
| 7 | 
             
            It uses Claude-3.5-sonnet-20241022 as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).
         | 
| 8 | 
            +
             | 
