openchat
/

openchat_v3.2

@@ -18,9 +18,11 @@ license: llama2
 ## <a id="models"></a> Usage
-To use these models, we highly recommend installing the OpenChat OpenAI-compatible API server from [OpenChat repo](https://github.com/imoneoi/openchat), and run the serving commands in the table below. The server is optimized for high-throughput deployment using vLLM and can run on a GPU with at least 48GB RAM, or two consumer GPUs with tensor parallel. To enable tensor parallel, append `--tensor-parallel-size 2` to the serving command.
-When started, the server listens at `localhost:18888` for requests and is compatible with the [OpenAI ChatCompletion API specifications](https://platform.openai.com/docs/api-reference/chat). See the example request below for reference. Additionally, you can access the [OpenChat Web UI](https://github.com/imoneoi/openchat/#web-ui) for a user-friendly experience.
 <details>
   <summary>Example request (click to expand)</summary>
@@ -33,14 +35,15 @@ curl http://localhost:18888/v1/chat/completions \
     "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
   }'
 ```
 </details>
 | Model         | Size | Context | Weights                                                                 | Serving                                                                                                    |
 |---------------|------|---------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
-| OpenChat 3.1 | 13B  | 4096    | [Huggingface](https://huggingface.co/openchat/openchat_v3.1) | `python -m ochat.serving.openai_api_server --model_type openchat_v3.1_llama2 --model openchat/openchat_v3.1 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120`      |
-| OpenChat 3.2   | 13B  | 4096    | [Huggingface](https://huggingface.co/openchat/openchat_v3.2)     | `python -m ochat.serving.openai_api_server --model_type openchat_v3.2 --model openchat/openchat_v3.2 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120`        |
-To run inference with Huggingface Transformers (slow and not recommended), follow the conversation template provided below:
 <details>
   <summary>Conversation templates (click to expand)</summary>

 ## <a id="models"></a> Usage
+To use these models, we highly recommend installing the OpenChat package by following the [installation guide](https://github.com/imoneoi/openchat/#installation) and using the OpenChat OpenAI-compatible API server by running the serving command from the table below. The server is optimized for high-throughput deployment using [vLLM](https://github.com/vllm-project/vllm) and can run on a GPU with at least 48GB RAM or two consumer GPUs with tensor parallelism. To enable tensor parallelism, append `--tensor-parallel-size 2` to the serving command.
+When started, the server listens at `localhost:18888` for requests and is compatible with the [OpenAI ChatCompletion API specifications](https://platform.openai.com/docs/api-reference/chat). See the example request below for reference. Additionally, you can access the [OpenChat Web UI](#web-ui) for a user-friendly experience.
+To deploy the server as an online service, use `--api-keys sk-KEY1 sk-KEY2 ...` to specify allowed API keys and `--disable-log-requests --disable-log-stats --log-file openchat.log` for logging only to a file. We recommend using a [HTTPS gateway](https://fastapi.tiangolo.com/es/deployment/concepts/#security-https) in front of the server for security purposes.
 <details>
   <summary>Example request (click to expand)</summary>
     "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
   }'
 ```
 </details>
 | Model         | Size | Context | Weights                                                                 | Serving                                                                                                    |
 |---------------|------|---------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
+| OpenChat 3.1 | 13B  | 4096    | [Huggingface](https://huggingface.co/openchat/openchat_v3.1) | `python -m ochat.serving.openai_api_server --model-type openchat_v3.1_llama2 --model openchat/openchat_v3.1 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120`      |
+| OpenChat 3.2   | 13B  | 4096    | [Huggingface](https://huggingface.co/openchat/openchat_v3.2)     | `python -m ochat.serving.openai_api_server --model-type openchat_v3.2 --model openchat/openchat_v3.2 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120`        |
+For inference with Huggingface Transformers (slow and not recommended), follow the conversation template provided below:
 <details>
   <summary>Conversation templates (click to expand)</summary>