Update usage to be specific to ORT+DML
#2
by
						
pavignol2
	
							
						- opened
							
					
    	
        README.md
    CHANGED
    
    | @@ -49,96 +49,29 @@ The ONNX model above was processed with the [Olive](https://github.com/microsoft | |
| 49 | 
             
            [EleutherAI’s](https://www.eleuther.ai/) [Pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) and fine-tuned 
         | 
| 50 | 
             
            on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data) generated by Databricks employees and released under a permissive license (CC-BY-SA)
         | 
| 51 |  | 
| 52 | 
            -
             | 
| 53 | 
            -
             | 
| 54 | 
            -
            To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` and `accelerate` libraries installed.
         | 
| 55 | 
            -
            In a Databricks notebook you could run:
         | 
| 56 | 
            -
             | 
| 57 | 
            -
            ```python
         | 
| 58 | 
            -
            %pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
         | 
| 59 | 
            -
            ```
         | 
| 60 |  | 
| 61 | 
            -
             | 
| 62 | 
            -
            found in the model repo [here](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py), which is why `trust_remote_code=True` is required.
         | 
| 63 | 
            -
            Including `torch_dtype=torch.bfloat16` is generally recommended if this type is supported in order to reduce memory usage.  It does not appear to impact output quality.
         | 
| 64 | 
            -
            It is also fine to remove it if there is sufficient memory.
         | 
| 65 | 
            -
             | 
| 66 | 
            -
            ```python
         | 
| 67 | 
            -
            import torch
         | 
| 68 | 
            -
            from transformers import pipeline
         | 
| 69 | 
            -
             | 
| 70 | 
            -
            generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
         | 
| 71 | 
            -
            ```
         | 
| 72 |  | 
| 73 | 
            -
             | 
| 74 |  | 
| 75 | 
             
            ```python
         | 
| 76 | 
            -
             | 
| 77 | 
            -
            print(res[0]["generated_text"])
         | 
| 78 | 
             
            ```
         | 
| 79 |  | 
| 80 | 
            -
             | 
| 81 | 
            -
            store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:
         | 
| 82 |  | 
| 83 | 
             
            ```python
         | 
| 84 | 
            -
            import  | 
|  | |
| 85 | 
             
            from instruct_pipeline import InstructionTextGenerationPipeline
         | 
| 86 | 
            -
            from transformers import AutoModelForCausalLM, AutoTokenizer
         | 
| 87 | 
            -
             | 
| 88 | 
            -
            tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
         | 
| 89 | 
            -
            model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.bfloat16)
         | 
| 90 | 
            -
             | 
| 91 | 
            -
            generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
         | 
| 92 | 
            -
            ```
         | 
| 93 | 
            -
             | 
| 94 | 
            -
            ### LangChain Usage
         | 
| 95 | 
            -
             | 
| 96 | 
            -
            To use the pipeline with LangChain, you must set `return_full_text=True`, as LangChain expects the full text to be returned 
         | 
| 97 | 
            -
            and the default for the pipeline is to only return the new text.
         | 
| 98 | 
            -
             | 
| 99 | 
            -
            ```python
         | 
| 100 | 
            -
            import torch
         | 
| 101 | 
            -
            from transformers import pipeline
         | 
| 102 |  | 
| 103 | 
            -
             | 
| 104 | 
            -
             | 
| 105 | 
            -
            ```
         | 
| 106 | 
            -
             | 
| 107 | 
            -
            You can create a prompt that either has only an instruction or has an instruction with context:
         | 
| 108 | 
            -
             | 
| 109 | 
            -
            ```python
         | 
| 110 | 
            -
            from langchain import PromptTemplate, LLMChain
         | 
| 111 | 
            -
            from langchain.llms import HuggingFacePipeline
         | 
| 112 | 
            -
             | 
| 113 | 
            -
            # template for an instrution with no input
         | 
| 114 | 
            -
            prompt = PromptTemplate(
         | 
| 115 | 
            -
                input_variables=["instruction"],
         | 
| 116 | 
            -
                template="{instruction}")
         | 
| 117 | 
            -
             | 
| 118 | 
            -
            # template for an instruction with input
         | 
| 119 | 
            -
            prompt_with_context = PromptTemplate(
         | 
| 120 | 
            -
                input_variables=["instruction", "context"],
         | 
| 121 | 
            -
                template="{instruction}\n\nInput:\n{context}")
         | 
| 122 | 
            -
             | 
| 123 | 
            -
            hf_pipeline = HuggingFacePipeline(pipeline=generate_text)
         | 
| 124 | 
            -
             | 
| 125 | 
            -
            llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
         | 
| 126 | 
            -
            llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
         | 
| 127 | 
            -
            ```
         | 
| 128 | 
            -
             | 
| 129 | 
            -
            Example predicting using a simple instruction:
         | 
| 130 | 
            -
             | 
| 131 | 
            -
            ```python
         | 
| 132 | 
            -
            print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())
         | 
| 133 | 
            -
            ```
         | 
| 134 | 
            -
             | 
| 135 | 
            -
            Example predicting using an instruction with context:
         | 
| 136 | 
            -
             | 
| 137 | 
            -
            ```python
         | 
| 138 | 
            -
            context = """George Washington (February 22, 1732[b] – December 14, 1799) was an American military officer, statesman,
         | 
| 139 | 
            -
            and Founding Father who served as the first president of the United States from 1789 to 1797."""
         | 
| 140 |  | 
| 141 | 
            -
             | 
|  | |
|  | |
| 142 | 
             
            ```
         | 
| 143 |  | 
| 144 |  | 
|  | |
| 49 | 
             
            [EleutherAI’s](https://www.eleuther.ai/) [Pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) and fine-tuned 
         | 
| 50 | 
             
            on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data) generated by Databricks employees and released under a permissive license (CC-BY-SA)
         | 
| 51 |  | 
| 52 | 
            +
            `dolly-v2-7b-olive-optimized` is an optimized ONNX model of `dolly-v2-7b` generated by [Olive](https://github.com/microsoft/Olive) that is meant to be used with ONNX Runtime and DirectML.
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 53 |  | 
| 54 | 
            +
            ## Usage
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 55 |  | 
| 56 | 
            +
            To use the model with the `transformers` library on a machine with ONNX Runtime and DirectML, first make sure you have the `transformers`, `accelerate`, `optimum`, `onnxruntime-directml` and `onnx` libraries installed:
         | 
| 57 |  | 
| 58 | 
             
            ```python
         | 
| 59 | 
            +
            pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2" "optimum>=1.8.8,<2" "onnxruntime-directml>=1.15.1,<2" "onnx>=1.14.0<2"
         | 
|  | |
| 60 | 
             
            ```
         | 
| 61 |  | 
| 62 | 
            +
            You can then download [instruct_pipeline.py](https://huggingface.co/microsoft/dolly-v2-7b-olive-optimized/raw/main/instruct_pipeline.py) and construct the pipeline from the loaded model and tokenizer:
         | 
|  | |
| 63 |  | 
| 64 | 
             
            ```python
         | 
| 65 | 
            +
            from transformers import AutoTokenizer, TextStreamer
         | 
| 66 | 
            +
            from optimum.onnxruntime import ORTModelForCausalLM
         | 
| 67 | 
             
            from instruct_pipeline import InstructionTextGenerationPipeline
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 68 |  | 
| 69 | 
            +
            tokenizer = AutoTokenizer.from_pretrained("microsoft/dolly-v2-7b-olive-optimized", padding_side="left")
         | 
| 70 | 
            +
            model = ORTModelForCausalLM.from_pretrained("microsoft/dolly-v2-7b-olive-optimized", provider="DmlExecutionProvider", use_cache=True, use_merged=True, use_io_binding=False)
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 71 |  | 
| 72 | 
            +
            streamer = TextStreamer(tokenizer, skip_prompt=True)
         | 
| 73 | 
            +
            generate_text = InstructionTextGenerationPipeline(model=model, streamer=streamer, tokenizer=tokenizer, max_new_tokens=128)
         | 
| 74 | 
            +
            generate_text("Explain to me the difference between nuclear fission and fusion.")
         | 
| 75 | 
             
            ```
         | 
| 76 |  | 
| 77 |  | 
