Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -1,32 +1,68 @@ | |
| 1 | 
            -
            ---
         | 
| 2 | 
            -
            tags:
         | 
| 3 | 
            -
            - qwen2.5
         | 
| 4 | 
            -
            -  | 
| 5 | 
            -
            -  | 
| 6 | 
            -
             | 
| 7 | 
            -
            - | 
| 8 | 
            -
            - | 
| 9 | 
            -
             | 
| 10 | 
            -
             | 
| 11 | 
            -
             | 
| 12 | 
            -
             | 
| 13 | 
            -
             | 
| 14 | 
            -
             | 
| 15 | 
            -
             | 
| 16 | 
            -
             | 
| 17 | 
            -
             | 
| 18 | 
            -
             | 
| 19 | 
            -
             | 
| 20 | 
            -
             | 
| 21 | 
            -
             | 
| 22 | 
            -
             | 
| 23 | 
            -
             | 
| 24 | 
            -
             | 
| 25 | 
            -
             | 
| 26 | 
            -
             | 
| 27 | 
            -
             | 
| 28 | 
            -
             | 
| 29 | 
            -
             | 
| 30 | 
            -
             | 
| 31 | 
            -
             | 
| 32 | 
            -
            ```
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            tags:
         | 
| 3 | 
            +
            - qwen2.5
         | 
| 4 | 
            +
            - RL
         | 
| 5 | 
            +
            - reasoning
         | 
| 6 | 
            +
            library_name: transformers
         | 
| 7 | 
            +
            pipeline_tag: text-generation
         | 
| 8 | 
            +
            license: apache-2.0
         | 
| 9 | 
            +
            base_model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
         | 
| 10 | 
            +
            ---
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            # Introduction
         | 
| 13 | 
            +
             | 
| 14 | 
            +
            **AMPO**, a novel framework that intelligently leverages guidance from multiple, diverse teacher models, intervening only when the on-policy model fails. Our two core contributions, Adaptive Multi-Guidance Replacement and Comprehension-based Guidance Selection, ensure that this external knowledge is used both efficiently and effectively.
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            [](https://arxiv.org/abs/2510.02227) [](https://github.com/SII-Enigma/AMPO)
         | 
| 17 | 
            +
             | 
| 18 | 
            +
            ### Key Highlights:
         | 
| 19 | 
            +
            - **Adaptive Multi-Guidance Replacement**: Minimizes intervention by providing external guidance only upon complete on-policy failure, preserving self-discovery while enhancing reasoning efficiency.
         | 
| 20 | 
            +
            - **Comprehension-based Guidance Selection**: Improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, demonstrably boosting performance.
         | 
| 21 | 
            +
            - **Superior Performance:** Achieves better performance and efficiency compared to using RL or SFT alone.
         | 
| 22 | 
            +
             | 
| 23 | 
            +
             | 
| 24 | 
            +
            ### Multi-Guidance Pool
         | 
| 25 | 
            +
             | 
| 26 | 
            +
            Teacher Models: 'Qwen3-8B_thinking', 'DeepSeek-R1-Distill-Qwen-7B', 'Qwen3-8B', 'Qwen2.5-Math-7B-Instruct'
         | 
| 27 | 
            +
             | 
| 28 | 
            +
            ## Inference Example
         | 
| 29 | 
            +
             | 
| 30 | 
            +
            Here’s an example of using AMPO for inference:
         | 
| 31 | 
            +
             | 
| 32 | 
            +
            ```python
         | 
| 33 | 
            +
            from transformers import AutoTokenizer
         | 
| 34 | 
            +
            from vllm import LLM, SamplingParams
         | 
| 35 | 
            +
             | 
| 36 | 
            +
            model_path = "SII-Enigma/Qwen2.5-7B-Ins-SFT-AMPO"
         | 
| 37 | 
            +
             | 
| 38 | 
            +
            question = "which number is larger? 9.11 or 9.9?"
         | 
| 39 | 
            +
             | 
| 40 | 
            +
            tokenizer = AutoTokenizer.from_pretrained(model_path)
         | 
| 41 | 
            +
            messages = [{"role": "user", "content": question}]
         | 
| 42 | 
            +
            chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
         | 
| 43 | 
            +
             | 
| 44 | 
            +
            llm = LLM(model=model_path)
         | 
| 45 | 
            +
            params = SamplingParams(temperature=0.6, max_tokens=8192)
         | 
| 46 | 
            +
            outputs = llm.generate([chat], params)
         | 
| 47 | 
            +
            print(outputs[0].outputs[0].text)
         | 
| 48 | 
            +
            ```
         | 
| 49 | 
            +
             | 
| 50 | 
            +
            # Acknowledgement
         | 
| 51 | 
            +
             | 
| 52 | 
            +
            AMPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl), [RLPR](https://github.com/OpenBMB/RLPR) and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones.
         | 
| 53 | 
            +
             | 
| 54 | 
            +
             | 
| 55 | 
            +
             | 
| 56 | 
            +
            # Citation
         | 
| 57 | 
            +
            If you find our model, data, or evaluation code useful, please kindly cite our paper:
         | 
| 58 | 
            +
            ```bib
         | 
| 59 | 
            +
            @misc{yuan2025teacheradaptivemultiguidancepolicy,
         | 
| 60 | 
            +
                  title={More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration}, 
         | 
| 61 | 
            +
                  author={Xiaoyang Yuan and Yujuan Ding and Yi Bin and Wenqi Shao and Jinyu Cai and Jingkuan Song and Yang Yang and Heng Tao Shen},
         | 
| 62 | 
            +
                  year={2025},
         | 
| 63 | 
            +
                  eprint={2510.02227},
         | 
| 64 | 
            +
                  archivePrefix={arXiv},
         | 
| 65 | 
            +
                  primaryClass={cs.CL},
         | 
| 66 | 
            +
                  url={https://arxiv.org/abs/2510.02227}, 
         | 
| 67 | 
            +
            }
         | 
| 68 | 
            +
            ```
         |