microsoft
/

dolly-v2-7b-olive-optimized

@@ -49,96 +49,29 @@ The ONNX model above was processed with the [Olive](https://github.com/microsoft
 [EleutherAI’s](https://www.eleuther.ai/) [Pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) and fine-tuned
 on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data) generated by Databricks employees and released under a permissive license (CC-BY-SA)
-## Usage
-To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` and `accelerate` libraries installed.
-In a Databricks notebook you could run:
-```python
-%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
-```
-The instruction following pipeline can be loaded using the `pipeline` function as shown below.  This loads a custom `InstructionTextGenerationPipeline`
-found in the model repo [here](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py), which is why `trust_remote_code=True` is required.
-Including `torch_dtype=torch.bfloat16` is generally recommended if this type is supported in order to reduce memory usage.  It does not appear to impact output quality.
-It is also fine to remove it if there is sufficient memory.
-```python
-import torch
-from transformers import pipeline
-generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
-```
-You can then use the pipeline to answer instructions:
 ```python
-res = generate_text("Explain to me the difference between nuclear fission and fusion.")
-print(res[0]["generated_text"])
 ```
-Alternatively, if you prefer to not use `trust_remote_code=True` you can download [instruct_pipeline.py](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py),
-store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:
 ```python
-import torch
 from instruct_pipeline import InstructionTextGenerationPipeline
-from transformers import AutoModelForCausalLM, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
-model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.bfloat16)
-generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
-```
-### LangChain Usage
-To use the pipeline with LangChain, you must set `return_full_text=True`, as LangChain expects the full text to be returned
-and the default for the pipeline is to only return the new text.
-```python
-import torch
-from transformers import pipeline
-generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16,
-                         trust_remote_code=True, device_map="auto", return_full_text=True)
-```
-You can create a prompt that either has only an instruction or has an instruction with context:
-```python
-from langchain import PromptTemplate, LLMChain
-from langchain.llms import HuggingFacePipeline
-# template for an instrution with no input
-prompt = PromptTemplate(
-    input_variables=["instruction"],
-    template="{instruction}")
-# template for an instruction with input
-prompt_with_context = PromptTemplate(
-    input_variables=["instruction", "context"],
-    template="{instruction}\n\nInput:\n{context}")
-hf_pipeline = HuggingFacePipeline(pipeline=generate_text)
-llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
-llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
-```
-Example predicting using a simple instruction:
-```python
-print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())
-```
-Example predicting using an instruction with context:
-```python
-context = """George Washington (February 22, 1732[b] – December 14, 1799) was an American military officer, statesman,
-and Founding Father who served as the first president of the United States from 1789 to 1797."""
-print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())
 ```

 [EleutherAI’s](https://www.eleuther.ai/) [Pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) and fine-tuned
 on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data) generated by Databricks employees and released under a permissive license (CC-BY-SA)
+`dolly-v2-7b-olive-optimized` is an optimized ONNX model of `dolly-v2-7b` generated by [Olive](https://github.com/microsoft/Olive) that is meant to be used with ONNX Runtime and DirectML.
+## Usage
+To use the model with the `transformers` library on a machine with ONNX Runtime and DirectML, first make sure you have the `transformers`, `accelerate`, `optimum`, `onnxruntime-directml` and `onnx` libraries installed:
 ```python
+pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2" "optimum>=1.8.8,<2" "onnxruntime-directml>=1.15.1,<2" "onnx>=1.14.0<2"
 ```
+You can then download [instruct_pipeline.py](https://huggingface.co/microsoft/dolly-v2-7b-olive-optimized/raw/main/instruct_pipeline.py) and construct the pipeline from the loaded model and tokenizer:
 ```python
+from transformers import AutoTokenizer, TextStreamer
+from optimum.onnxruntime import ORTModelForCausalLM
 from instruct_pipeline import InstructionTextGenerationPipeline
+tokenizer = AutoTokenizer.from_pretrained("microsoft/dolly-v2-7b-olive-optimized", padding_side="left")
+model = ORTModelForCausalLM.from_pretrained("microsoft/dolly-v2-7b-olive-optimized", provider="DmlExecutionProvider", use_cache=True, use_merged=True, use_io_binding=False)
+streamer = TextStreamer(tokenizer, skip_prompt=True)
+generate_text = InstructionTextGenerationPipeline(model=model, streamer=streamer, tokenizer=tokenizer, max_new_tokens=128)
+generate_text("Explain to me the difference between nuclear fission and fusion.")
 ```