Update src/md.py
Browse files
src/md.py
CHANGED
|
@@ -25,7 +25,10 @@ We include multiple types of reward models in this evaluation:
|
|
| 25 |
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.
|
| 26 |
|
| 27 |
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### Subset Details
|
| 31 |
|
|
|
|
| 25 |
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.
|
| 26 |
|
| 27 |
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
|
| 28 |
+
*Note*: The reference models for DPO models (and other implicit rewards) can be found in two ways.
|
| 29 |
+
* Click on a specific model in results and you'll see a key `ref_model`, e.g. [Qwen](https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set/Qwen/Qwen1.5-72B-Chat.json).
|
| 30 |
+
* All the reference models are listed in the [evaluation configs](https://github.com/allenai/reward-bench/blob/main/scripts/configs/eval_configs.yaml).
|
| 31 |
+
|
| 32 |
|
| 33 |
### Subset Details
|
| 34 |
|