Spaces:
Paused
Paused
| # Linux | |
| These instructions are for Ubuntu x86_64 (other linux would be similar with different command instead of apt-get). | |
| ## Install: | |
| * First one needs a Python 3.10 environment. We recommend using Miniconda. | |
| Download [MiniConda for Linux](https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh). After downloading, run: | |
| ```bash | |
| bash ./Miniconda3-py310_23.1.0-1-Linux-x86_64.sh | |
| # follow license agreement and add to bash if required | |
| ``` | |
| Enter new shell and should also see `(base)` in prompt. Then, create new env: | |
| ```bash | |
| conda create -n h2ogpt -y | |
| conda activate h2ogpt | |
| conda install python=3.10 -c conda-forge -y | |
| ``` | |
| You should see `(h2ogpt)` in shell prompt. | |
| Alternatively, on newer Ubuntu systems you can get Python 3.10 environment setup by doing: | |
| ```bash | |
| sudo apt-get install -y build-essential gcc python3.10-dev | |
| virtualenv -p python3 h2ogpt | |
| source h2ogpt/bin/activate | |
| ``` | |
| * Test your python: | |
| ```bash | |
| python --version | |
| ``` | |
| should say 3.10.xx and: | |
| ```bash | |
| python -c "import os, sys ; print('hello world')" | |
| ``` | |
| should print `hello world`. Then clone: | |
| ```bash | |
| git clone https://github.com/h2oai/h2ogpt.git | |
| cd h2ogpt | |
| ``` | |
| On some systems, `pip` still refers back to the system one, then one can use `python -m pip` or `pip3` instead of `pip` or try `python3` instead of `python`. | |
| * For GPU: Install CUDA ToolKit with ability to compile using nvcc for some packages like llama-cpp-python, AutoGPTQ, exllama, and flash attention: | |
| ```bash | |
| conda install cudatoolkit-dev -c conda-forge -y | |
| export CUDA_HOME=$CONDA_PREFIX | |
| ``` | |
| which gives CUDA 11.7, or if you prefer follow [CUDA Toolkit](INSTALL.md#installing-cuda-toolkit), then do: | |
| ```bash | |
| export CUDA_HOME=/usr/local/cuda-11.7 | |
| ``` | |
| If you do not plan to use one of those packages, you can just use the non-dev version: | |
| ```bash | |
| conda install cudatoolkit=11.7 -c conda-forge -y | |
| export CUDA_HOME=$CONDA_PREFIX | |
| ``` | |
| * Install dependencies: | |
| ```bash | |
| # fix any bad env | |
| pip uninstall -y pandoc pypandoc pypandoc-binary | |
| # broad support, but no training-time or data creation dependencies | |
| # CPU only: | |
| pip install -r requirements.txt --extra-index https://download.pytorch.org/whl/cpu | |
| # GPU only: | |
| pip install -r requirements.txt --extra-index https://download.pytorch.org/whl/cu117 | |
| ``` | |
| * Install document question-answer dependencies: | |
| ```bash | |
| # May be required for jq package: | |
| sudo apt-get -y install autoconf libtool | |
| # Required for Doc Q/A: LangChain: | |
| pip install -r reqs_optional/requirements_optional_langchain.txt | |
| # Required for CPU: LLaMa/GPT4All: | |
| pip install -r reqs_optional/requirements_optional_gpt4all.txt | |
| # Optional: PyMuPDF/ArXiv: | |
| pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt | |
| # Optional: Selenium/PlayWright: | |
| pip install -r reqs_optional/requirements_optional_langchain.urls.txt | |
| # Optional: support docx, pptx, ArXiv, etc. required by some python packages | |
| sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libtesseract-dev libreoffice | |
| # Improved OCR with DocTR: | |
| conda install -y -c conda-forge pygobject | |
| pip install -r reqs_optional/requirements_optional_doctr.txt | |
| # go back to older onnx so Tesseract OCR still works | |
| pip install onnxruntime==1.15.0 onnxruntime-gpu==1.15.0 | |
| # Optional: for supporting unstructured package | |
| python -m nltk.downloader all | |
| # Optional but required for PlayWright | |
| playwright install --with-deps | |
| * GPU Optional: For AutoGPTQ support on x86_64 linux | |
| ```bash | |
| pip uninstall -y auto-gptq ; pip install https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.4.2/auto_gptq-0.4.2+cu118-cp310-cp310-linux_x86_64.whl | |
| # in-transformers support of AutoGPTQ | |
| pip install git+https://github.com/huggingface/optimum.git | |
| ``` | |
| This avoids issues with missing cuda extensions etc. if this does not apply to your system, run: | |
| ```bash | |
| pip uninstall -y auto-gptq ; GITHUB_ACTIONS=true pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --no-cache-dir | |
| ``` | |
| If one sees `CUDA extension not installed` in output after loading model, one needs to compile AutoGPTQ, else will use double memory and be slower on GPU. | |
| See [AutoGPTQ](README_GPU.md#autogptq) about running AutoGPT models. | |
| * GPU Optional: For exllama support on x86_64 linux | |
| ```bash | |
| pip uninstall -y exllama ; pip install https://github.com/jllllll/exllama/releases/download/0.0.13/exllama-0.0.13+cu118-cp310-cp310-linux_x86_64.whl --no-cache-dir | |
| ``` | |
| See [exllama](README_GPU.md#exllama) about running exllama models. | |
| * GPU Optional: Support LLaMa.cpp with CUDA: | |
| * Download/Install [CUDA llama-cpp-python wheel](https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels), E.g.: | |
| ```bash | |
| pip uninstall -y llama-cpp-python llama-cpp-python-cuda | |
| # GGMLv3 ONLY: | |
| pip install https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.73+cu117-cp310-cp310-linux_x86_64.whl | |
| # GGUF ONLY: | |
| pip install https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.1.83+cu117-cp310-cp310-linux_x86_64.whl | |
| ``` | |
| * If any issues, then must compile llama-cpp-python with CUDA support: | |
| ```bash | |
| pip uninstall -y llama-cpp-python llama-cpp-python-cuda | |
| export LLAMA_CUBLAS=1 | |
| export CMAKE_ARGS=-DLLAMA_CUBLAS=on | |
| export FORCE_CMAKE=1 | |
| CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.73 --no-cache-dir --verbose | |
| ``` | |
| * By default, we set `n_gpu_layers` to large value, so llama.cpp offloads all layers for maximum GPU performance. You can control this by passing `--llamacpp_dict="{'n_gpu_layers':20}"` for value 20, or setting in UI. For highest performance, offload *all* layers. | |
| That is, one gets maximum performance if one sees in startup of h2oGPT all layers offloaded: | |
| ```text | |
| llama_model_load_internal: offloaded 35/35 layers to GPU | |
| ``` | |
| but this requires sufficient GPU memory. Reduce if you have low memory GPU, say 15. | |
| * Pass to `generate.py` the option `--max_seq_len=2048` or some other number if you want model have controlled smaller context, else default (relatively large) value is used that will be slower on CPU. | |
| * For LLaMa2, can set `max_tokens` to a larger value for longer output. | |
| * If one sees `/usr/bin/nvcc` mentioned in errors, that file needs to be removed as would likely conflict with version installed for conda. | |
| * Note that once `llama-cpp-python` is compiled to support CUDA, it no longer works for CPU mode, so one would have to reinstall it without the above options to recovers CPU mode or have a separate h2oGPT env for CPU mode. | |
| * Control Core Count for chroma < 0.4 using chromamigdb package: | |
| * Duckdb used by Chroma < 0.4 uses DuckDB 0.8.1 that has no control over number of threads per database, `import duckdb` leads to all virtual cores as threads and each db consumes another number of threads equal to virtual cores. To prevent this, one can rebuild duckdb using [this modification](https://github.com/h2oai/duckdb/commit/dcd8c1ffc53dd020623630efb99ba6a3a4cbc5ad) or one can try to use the prebuild wheel for x86_64 built on Ubuntu 20. | |
| ```bash | |
| pip install https://h2o-release.s3.amazonaws.com/h2ogpt/duckdb-0.8.2.dev4025%2Bg9698e9e6a8.d20230907-cp310-cp310-linux_x86_64.whl --no-cache-dir --force-reinstall --no-deps | |
| ``` | |
| ### Compile Install Issues | |
| * `/usr/local/cuda/include/crt/host_config.h:132:2: error: #error -- unsupported GNU version! gcc versions later than 11 are not supported!` | |
| * gcc > 11 is not currently supported by nvcc. Install GCC with a maximum version: | |
| ``` | |
| MAX_GCC_VERSION=11 | |
| sudo apt install gcc-$MAX_GCC_VERSION g++-$MAX_GCC_VERSION | |
| sudo update-alternatives --config gcc | |
| # pick version 11 | |
| sudo update-alternatives --config g++ | |
| # pick version 11 | |
| ``` | |
| --- | |
| ## Run | |
| * Check that can see CUDA from Torch: | |
| ```python | |
| import torch | |
| print(torch.cuda.is_available()) | |
| ``` | |
| should print True. | |
| * Place all documents in `user_path` or upload in UI ([Help with UI](README_ui.md)). | |
| UI using GPU with at least 24GB with streaming: | |
| ```bash | |
| python generate.py --base_model=h2oai/h2ogpt-4096-llama2-13b-chat --load_8bit=True --score_model=None --langchain_mode='UserData' --user_path=user_path | |
| ``` | |
| Same with a smaller model without quantization: | |
| ```bash | |
| python generate.py --base_model=h2oai/h2ogpt-4096-llama2-7b-chat --score_model=None --langchain_mode='UserData' --user_path=user_path | |
| ``` | |
| UI using LLaMa.cpp LLaMa2 model: | |
| ```bash | |
| python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --model_path_llama=https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin --max_seq_len=4096 | |
| ``` | |
| which works on CPU or GPU (assuming llama cpp python package compiled against CUDA or Metal). | |
| If using OpenAI for the LLM is ok, but you want documents to be parsed and embedded locally, then do: | |
| ```bash | |
| OPENAI_API_KEY=<key> python generate.py --inference_server=openai_chat --base_model=gpt-3.5-turbo --score_model=None | |
| ``` | |
| where `<key>` should be replaced by your OpenAI key that probably starts with `sk-`. OpenAI is **not** recommended for private document question-answer, but it can be a good reference for testing purposes or when privacy is not required. | |
| Perhaps you want better image caption performance and focus local GPU on that, then do: | |
| ```bash | |
| OPENAI_API_KEY=<key> python generate.py --inference_server=openai_chat --base_model=gpt-3.5-turbo --score_model=None --captions_model=Salesforce/blip2-flan-t5-xl | |
| ``` | |
| For Azure OpenAI: | |
| ```bash | |
| OPENAI_API_KEY=<key> python generate.py --inference_server="openai_azure_chat:<deployment_name>:<base_url>:<api_version>" --base_model=gpt-3.5-turbo --h2ocolors=False --langchain_mode=UserData | |
| ``` | |
| where the entry `<deployment_name>` is required for Azure, others are optional and can be filled with string `None` or have empty input between `:`. Azure OpenAI is a bit safer for private access to Azure-based docs. | |
| Add `--share=True` to make gradio server visible via sharable URL. | |
| If you see an error about protobuf, try: | |
| ```bash | |
| pip install protobuf==3.20.0 | |
| ``` | |
| See [CPU](README_CPU.md) and [GPU](README_GPU.md) for some other general aspects about using h2oGPT on CPU or GPU, such as which models to try. | |
| #### Google Colab | |
| * A Google Colab version of a 3B GPU model is at: | |
| [ h2oGPT GPU](https://colab.research.google.com/drive/143-KFHs2iCqXTQLI2pFCDiR69z0dR8iE?usp=sharing) | |
| A local copy of that GPU Google Colab is [h2oGPT_GPU.ipynb](h2oGPT_GPU.ipynb). | |
| * A Google Colab version of a 7B LLaMa CPU model is at: | |
| [ h2oGPT CPU](https://colab.research.google.com/drive/13RiBdAFZ6xqDwDKfW6BG_-tXfXiqPNQe?usp=sharing) | |
| A local copy of that CPU Google Colab is [h2oGPT_CPU.ipynb](h2oGPT_CPU.ipynb). | |