Spaces:
Running
Running
Update src/md.py
Browse files
src/md.py
CHANGED
|
@@ -20,22 +20,13 @@ Once all subsets weighted averages are achieved, the final RewardBench score is
|
|
| 20 |
We include multiple types of reward models in this evaluation:
|
| 21 |
1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
|
| 22 |
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
|
| 23 |
-
3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed.
|
| 24 |
4. **Random**: Random choice baseline.
|
| 25 |
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.
|
| 26 |
|
| 27 |
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
|
| 28 |
Others, such as **Generative Judge** are coming soon.
|
| 29 |
|
| 30 |
-
### Model Types
|
| 31 |
-
|
| 32 |
-
Currently, we evaluate the following model types:
|
| 33 |
-
1. **Sequence Classifiers**: A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
|
| 34 |
-
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
|
| 35 |
-
3. **DPO**: Models trained with Direct Preference Optimization (DPO) with a reference model being either the base or supervised fine-tuning checkpoint.
|
| 36 |
-
|
| 37 |
-
Support of DPO models without a reference model is coming soon.
|
| 38 |
-
|
| 39 |
### Subset Details
|
| 40 |
|
| 41 |
Total number of the prompts is: 2985, filtered from 5123.
|
|
|
|
| 20 |
We include multiple types of reward models in this evaluation:
|
| 21 |
1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
|
| 22 |
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
|
| 23 |
+
3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed. *Note*: This also includes other models trained with implicit rewards, such as those trained with [KTO](https://arxiv.org/abs/2402.01306).
|
| 24 |
4. **Random**: Random choice baseline.
|
| 25 |
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.
|
| 26 |
|
| 27 |
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
|
| 28 |
Others, such as **Generative Judge** are coming soon.
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
### Subset Details
|
| 31 |
|
| 32 |
Total number of the prompts is: 2985, filtered from 5123.
|