SII-Enigma commited on
Commit
acd2be7
·
verified ·
1 Parent(s): 7900989

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -32
README.md CHANGED
@@ -1,32 +1,68 @@
1
- ---
2
- tags:
3
- - qwen2.5
4
- - rl
5
- - fine-tuned
6
- language:
7
- - zh
8
- - en
9
- license: apache-2.0
10
- base_model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
11
- ---
12
-
13
- # Qwen2.5-7B-Ins-SFT-AMPO
14
-
15
- Teacher Models: 'Qwen3-8B_thinking', 'DeepSeek-R1-Distill-Qwen-7B', 'Qwen3-8B', 'Qwen2.5-Math-7B-Instruct'
16
- Training Method: AMPO
17
- Base model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
18
-
19
- ## Inference Example
20
-
21
- ```python
22
- from transformers import AutoTokenizer, AutoModelForCausalLM
23
-
24
- model_name = "SII-Enigma/Qwen2.5-7B-Ins-SFT-AMPO"
25
- tokenizer = AutoTokenizer.from_pretrained(model_name)
26
- model = AutoModelForCausalLM.from_pretrained(model_name)
27
-
28
- inputs = tokenizer("Hello", return_tensors="pt")
29
- outputs = model.generate(**inputs)
30
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
31
- print(response)
32
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - qwen2.5
4
+ - RL
5
+ - reasoning
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
+ license: apache-2.0
9
+ base_model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
10
+ ---
11
+
12
+ # Introduction
13
+
14
+ **AMPO**, a novel framework that intelligently leverages guidance from multiple, diverse teacher models, intervening only when the on-policy model fails. Our two core contributions, Adaptive Multi-Guidance Replacement and Comprehension-based Guidance Selection, ensure that this external knowledge is used both efficiently and effectively.
15
+
16
+ [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.02227) [![Github](https://img.shields.io/badge/AMPO-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/SII-Enigma/AMPO)
17
+
18
+ ### Key Highlights:
19
+ - **Adaptive Multi-Guidance Replacement**: Minimizes intervention by providing external guidance only upon complete on-policy failure, preserving self-discovery while enhancing reasoning efficiency.
20
+ - **Comprehension-based Guidance Selection**: Improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, demonstrably boosting performance.
21
+ - **Superior Performance:** Achieves better performance and efficiency compared to using RL or SFT alone.
22
+
23
+
24
+ ### Multi-Guidance Pool
25
+
26
+ Teacher Models: 'Qwen3-8B_thinking', 'DeepSeek-R1-Distill-Qwen-7B', 'Qwen3-8B', 'Qwen2.5-Math-7B-Instruct'
27
+
28
+ ## Inference Example
29
+
30
+ Here’s an example of using AMPO for inference:
31
+
32
+ ```python
33
+ from transformers import AutoTokenizer
34
+ from vllm import LLM, SamplingParams
35
+
36
+ model_path = "SII-Enigma/Qwen2.5-7B-Ins-SFT-AMPO"
37
+
38
+ question = "which number is larger? 9.11 or 9.9?"
39
+
40
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
41
+ messages = [{"role": "user", "content": question}]
42
+ chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
43
+
44
+ llm = LLM(model=model_path)
45
+ params = SamplingParams(temperature=0.6, max_tokens=8192)
46
+ outputs = llm.generate([chat], params)
47
+ print(outputs[0].outputs[0].text)
48
+ ```
49
+
50
+ # Acknowledgement
51
+
52
+ AMPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl), [RLPR](https://github.com/OpenBMB/RLPR) and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones.
53
+
54
+
55
+
56
+ # Citation
57
+ If you find our model, data, or evaluation code useful, please kindly cite our paper:
58
+ ```bib
59
+ @misc{yuan2025teacheradaptivemultiguidancepolicy,
60
+ title={More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration},
61
+ author={Xiaoyang Yuan and Yujuan Ding and Yi Bin and Wenqi Shao and Jinyu Cai and Jingkuan Song and Yang Yang and Heng Tao Shen},
62
+ year={2025},
63
+ eprint={2510.02227},
64
+ archivePrefix={arXiv},
65
+ primaryClass={cs.CL},
66
+ url={https://arxiv.org/abs/2510.02227},
67
+ }
68
+ ```