SII-Enigma
/

Qwen2.5-7B-Ins-SFT-AMPO

@@ -1,32 +1,68 @@
----
-tags:
-- qwen2.5
-- rl
-- fine-tuned
-language:
-- zh
-- en
-license: apache-2.0
-base_model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
----
-# Qwen2.5-7B-Ins-SFT-AMPO
-Teacher Models: 'Qwen3-8B_thinking', 'DeepSeek-R1-Distill-Qwen-7B', 'Qwen3-8B', 'Qwen2.5-Math-7B-Instruct'
-Training Method: AMPO
-Base model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
-## Inference Example
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-model_name = "SII-Enigma/Qwen2.5-7B-Ins-SFT-AMPO"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(model_name)
-inputs = tokenizer("Hello", return_tensors="pt")
-outputs = model.generate(**inputs)
-response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(response)
-```

+---
+tags:
+- qwen2.5
+- RL
+- reasoning
+library_name: transformers
+pipeline_tag: text-generation
+license: apache-2.0
+base_model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
+---
+# Introduction
+**AMPO**, a novel framework that intelligently leverages guidance from multiple, diverse teacher models, intervening only when the on-policy model fails. Our two core contributions, Adaptive Multi-Guidance Replacement and Comprehension-based Guidance Selection, ensure that this external knowledge is used both efficiently and effectively.
+[![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.02227) [![Github](https://img.shields.io/badge/AMPO-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/SII-Enigma/AMPO)
+### Key Highlights:
+- **Adaptive Multi-Guidance Replacement**: Minimizes intervention by providing external guidance only upon complete on-policy failure, preserving self-discovery while enhancing reasoning efficiency.
+- **Comprehension-based Guidance Selection**: Improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, demonstrably boosting performance.
+- **Superior Performance:** Achieves better performance and efficiency compared to using RL or SFT alone.
+### Multi-Guidance Pool
+Teacher Models: 'Qwen3-8B_thinking', 'DeepSeek-R1-Distill-Qwen-7B', 'Qwen3-8B', 'Qwen2.5-Math-7B-Instruct'
+## Inference Example
+Here’s an example of using AMPO for inference:
+```python
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+model_path = "SII-Enigma/Qwen2.5-7B-Ins-SFT-AMPO"
+question = "which number is larger? 9.11 or 9.9?"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+messages = [{"role": "user", "content": question}]
+chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+llm = LLM(model=model_path)
+params = SamplingParams(temperature=0.6, max_tokens=8192)
+outputs = llm.generate([chat], params)
+print(outputs[0].outputs[0].text)
+```
+# Acknowledgement
+AMPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl), [RLPR](https://github.com/OpenBMB/RLPR) and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones.
+# Citation
+If you find our model, data, or evaluation code useful, please kindly cite our paper:
+```bib
+@misc{yuan2025teacheradaptivemultiguidancepolicy,
+      title={More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration},
+      author={Xiaoyang Yuan and Yujuan Ding and Yi Bin and Wenqi Shao and Jinyu Cai and Jingkuan Song and Yang Yang and Heng Tao Shen},
+      year={2025},
+      eprint={2510.02227},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2510.02227},
+}
+```