Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		
					
		Running
		
	Update src/md.py
Browse files
    	
        src/md.py
    CHANGED
    
    | @@ -6,22 +6,27 @@ A win is when the score for the chosen response is higher than the score for the | |
| 6 |  | 
| 7 | 
             
            | Subset                 | Num. Samples (Pre-filtering, post-filtering) | Description                                                       |
         | 
| 8 | 
             
            | :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
         | 
| 9 | 
            -
            | alpacaeval-easy        |                     805                     | Great model vs poor model                                         |
         | 
| 10 | 
            -
            | alpacaeval-length      |                     805                     | Good model vs low model, equal length                             |
         | 
| 11 | 
            -
            | alpacaeval-hard        |                     805                     | Great model vs baseline model                                     |
         | 
| 12 | 
             
            | mt-bench-easy          |                  28, 28                    | MT Bench 10s vs 1s                                                |
         | 
| 13 | 
             
            | mt-bench-medium        |                  45, 40                    | MT Bench 9s vs 2-5s                                               |
         | 
| 14 | 
             
            | mt-bench-hard          |                  45, 37                    | MT Bench 7-8 vs 5-6                                               |
         | 
| 15 | 
            -
            | refusals-dangerous     |                     505                     | Dangerous response vs no response                                 |
         | 
| 16 | 
            -
            | refusals-offensive     |                     704                     | Offensive response vs no response                                 |
         | 
| 17 | 
             
            | llmbar-natural         |                     100                     | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
         | 
| 18 | 
             
            | llmbar-adver-neighbor  |                     134                     | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
         | 
| 19 | 
             
            | llmbar-adver-GPTInst   |                     92                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
         | 
| 20 | 
             
            | llmbar-adver-GPTOut    |                     47                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
         | 
| 21 | 
             
            | llmbar-adver-manual    |                     46                      | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
         | 
| 22 | 
            -
            | XSTest | 
| 23 | 
            -
            |  | 
| 24 | 
            -
            | ( | 
|  | |
|  | |
|  | |
|  | |
|  | |
| 25 |  | 
| 26 |  | 
| 27 | 
             
            For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
         | 
|  | |
| 6 |  | 
| 7 | 
             
            | Subset                 | Num. Samples (Pre-filtering, post-filtering) | Description                                                       |
         | 
| 8 | 
             
            | :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
         | 
| 9 | 
            +
            | alpacaeval-easy        |                     805, 100                     | Great model vs poor model                                         |
         | 
| 10 | 
            +
            | alpacaeval-length      |                     805, 95                     | Good model vs low model, equal length                             |
         | 
| 11 | 
            +
            | alpacaeval-hard        |                     805, 95                     | Great model vs baseline model                                     |
         | 
| 12 | 
             
            | mt-bench-easy          |                  28, 28                    | MT Bench 10s vs 1s                                                |
         | 
| 13 | 
             
            | mt-bench-medium        |                  45, 40                    | MT Bench 9s vs 2-5s                                               |
         | 
| 14 | 
             
            | mt-bench-hard          |                  45, 37                    | MT Bench 7-8 vs 5-6                                               |
         | 
| 15 | 
            +
            | refusals-dangerous     |                     505, 100                     | Dangerous response vs no response                                 |
         | 
| 16 | 
            +
            | refusals-offensive     |                     704, 100                     | Offensive response vs no response                                 |
         | 
| 17 | 
             
            | llmbar-natural         |                     100                     | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
         | 
| 18 | 
             
            | llmbar-adver-neighbor  |                     134                     | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
         | 
| 19 | 
             
            | llmbar-adver-GPTInst   |                     92                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
         | 
| 20 | 
             
            | llmbar-adver-GPTOut    |                     47                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
         | 
| 21 | 
             
            | llmbar-adver-manual    |                     46                      | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
         | 
| 22 | 
            +
            | XSTest | 450, 404         | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263))        |        
         | 
| 23 | 
            +
            | do not answer | 939, 136         | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer)        |       
         | 
| 24 | 
            +
            | hep-cpp | 164         | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124))        |        
         | 
| 25 | 
            +
            | hep-go | 164         |   Go code       |    
         | 
| 26 | 
            +
            | hep-java | 164         |  Java code        |      
         | 
| 27 | 
            +
            | hep-js | 164         |    Javascript code        |      
         | 
| 28 | 
            +
            | hep-python | 164         |  Python code         |  
         | 
| 29 | 
            +
            | hep-rust | 164         |   Rust code        |      
         | 
| 30 |  | 
| 31 |  | 
| 32 | 
             
            For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
         | 

