## 下载c-eavl数据集

```bash
mkdir ceval-data
cd ceval-data
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip 
unzip ceval-exam.zip -d ceval-exam
wget https://github.com/hkust-nlp/ceval/blob/main/subject_mapping.json
```

In [1]:
! ls ceval-exam

dev
subject_mapping.json
test
val


In [2]:
import os, re
import ujson
import torch
import pandas as pd
from tqdm import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers.generation.configuration_utils import GenerationConfig
from transformers.generation.utils import LogitsProcessorList, InfNanRemoveLogitsProcessor

In [3]:
ceval_dir = './ceval-exam'
result_save_dir = './result'
model_dir = '../model_save/dpo'  # 模型文件在上一层目录，使用dpo后的模型

if not os.path.exists(result_save_dir):
    os.mkdir(result_save_dir)

In [4]:
subject_files = os.listdir(f"{ceval_dir}/val")
subjects = [subjetc.replace('_val.csv', '') for subjetc in subject_files]

subject_mapping = {}
with open('./ceval-exam/subject_mapping.json', 'r', encoding='utf-8') as f:
    subject_mapping = ujson.load(f)

由于本项目的模型在sft阶段删除了很多带input的数据，且没有针对问题回答做微调，直接输入问题会解释问题中提到的关键词。所以c-eval测试使用预测 'A'、'B'、'C'、'D' token的方式。
> 然而有时候，特别是零样本测试和面对没有做过指令微调的模型时，模型可能无法很好的理解指令，甚至有时不会回答问题。这种情况下我们推荐直接计算下一个预测token等于"A", "B", "C", "D"的概率，然后以概率最大的选项作为答案 
> -- 这是一种受限解码生成的方法，MMLU的官方测试代码中是使用了这种方法进行测试。注意这种概率方法对思维链的测试不适用。

见： [如何在C-Eval上测试](https://github.com/hkust-nlp/ceval/blob/main/README_zh.md#如何在C-Eval上测试)

评测模式：zero-shot模式（chatbot/对话机器人模式）  
dev数据集用来做few-shot，暂时不用

In [5]:
def format_prompt(df: pd.Series) -> str:
    '''
    将df中的 'question', 'A', 'B', 'C', 'D',格式化为问题
    '''
    prompt = f"请回答单选题，回答字母A、B、C、D即可。问题：\n{df['question']}\n答案选项：\n"
    for col in ['A', 'B', 'C', 'D']:
        prompt += f"{col}：{df[col]}\n"
    
    return prompt

In [6]:
subject_mapping['accountant']

['Accountant', '注册会计师', 'Other']

In [7]:
do_test = False
all_eval_items = []
for i, subject_name in tqdm(enumerate(subjects), total=len(subjects)):
    val_file = f"{ceval_dir}/val/{subject_name}_val.csv"
    test_file = f"{ceval_dir}/test/{subject_name}_test.csv"

    val_df = pd.read_csv(test_file) if do_test else pd.read_csv(val_file)
    
    for idx, row in val_df.iterrows():
        quesuton = format_prompt(row)
        answer = row['answer'] if 'answer' in val_df.columns else '' 

        item = {
            'subject_en': subject_mapping[subject_name][0],
            'subject_zh': subject_mapping[subject_name][1],
            'category': subject_mapping[subject_name][2],  # 类别(STEM,Social Science,Humanities,Other四选一)
            'question': quesuton,
            'answer':answer,
        }
    
        all_eval_items.append(item)

100%|██████████| 52/52 [00:00<00:00, 617.74it/s]


In [8]:
eval_df = pd.DataFrame(all_eval_items)
eval_df.head(5)

Unnamed: 0,subject_en,subject_zh,category,question,answer
0,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n下列关于税法基本原则的表述中，不正确的是...,D
1,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n甲公司是国内一家领先的新媒体、通信及移动...,C
2,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n根据我国《印花税暂行条例》的规定，下列各...,D
3,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n税务行政复议的申请人可以在得知税务机关作...,A
4,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n关于战略管理表述错误的是____。\n答...,C


In [9]:
# 加载模型
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

generation_config = GenerationConfig()
generation_config.remove_invalid_values = True  # 自动添加InfNanRemoveLogitsProcessor
generation_config.eos_token_id = tokenizer.eos_token_id
generation_config.pad_token_id = tokenizer.pad_token_id
# for t5, set decoder_start_token_id = pad_token_id
generation_config.decoder_start_token_id = tokenizer.pad_token_id  
generation_config.max_new_tokens = 16
generation_config.num_beams = 1
generation_config.do_sample = False   # greedy search

choices = ['A', 'B', 'C', 'D']
choices_ids = [tokenizer.convert_tokens_to_ids(c) for c in choices]
choices_ids

[872, 873, 884, 886]

In [10]:
batch_size = 32
batch_data, batch_answers = [], []
n = len(eval_df)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
model.eval()

for idx, row in tqdm(eval_df.iterrows(), total=n):
    batch_data.append(row['question'])
    
    if len(batch_data) == batch_size or idx == n - 1:
        torch.cuda.empty_cache()
        
        encode_ids = tokenizer(batch_data, padding=True)
        input_ids, attention_mask = torch.LongTensor(encode_ids['input_ids']), torch.LongTensor(encode_ids['attention_mask'])
        
        outputs = model.generate(
            input_ids=input_ids.to(device),
            attention_mask=attention_mask.to(device),
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
        )

        scores = torch.stack(outputs['scores'], dim=1)
        scores = torch.softmax(scores, dim=2)
        scores = scores[...,  0, choices_ids]  #取第一个字符的ABCD概率
        choices_index = torch.argmax(scores, dim=1)
        
        for i in choices_index:
            batch_answers.append(choices[i])

        batch_data = []

100%|██████████| 1346/1346 [00:20<00:00, 64.11it/s]


In [11]:
eval_df.insert(loc=5, column='model_predict', value=batch_answers)
val_df = eval_df.copy(deep=True)

In [12]:
val_df['is_correct'] = val_df['model_predict'] == val_df['answer']
val_df['is_correct'] = val_df['is_correct'].astype(pd.Int16Dtype())

In [13]:
val_df.head(3)

Unnamed: 0,subject_en,subject_zh,category,question,answer,model_predict,is_correct
0,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n下列关于税法基本原则的表述中，不正确的是...,D,A,0
1,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n甲公司是国内一家领先的新媒体、通信及移动...,C,A,0
2,Accountant,注册会计师,Other,请回答单选题，回答字母A、B、C、D即可。问题：\n根据我国《印花税暂行条例》的规定，下列各...,D,A,0


In [14]:
final_df =  val_df.groupby('category').sum('is_correct')
final_df

Unnamed: 0_level_0,is_correct
category,Unnamed: 1_level_1
Humanities,63
Other,89
STEM,89
Social Science,72


In [15]:
final_df['question_count'] =  val_df.groupby('category').count()['question']
final_df['accuracy'] = final_df['is_correct'] / final_df['question_count']
final_df['accuracy']  = final_df['accuracy'] .apply(lambda x: format(x, '.2%'))
final_df

Unnamed: 0_level_0,is_correct,question_count,accuracy
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Humanities,63,257,24.51%
Other,89,384,23.18%
STEM,89,430,20.70%
Social Science,72,275,26.18%
