Spaces:
Sleeping
Sleeping
ruiheesi
commited on
Commit
·
bfd0858
1
Parent(s):
0e9571b
Add applicaiton file
Browse files- Dockerfile +51 -0
- LICENSE +21 -0
- README.md +47 -11
- app.py +128 -0
- caNano_embedding_pack_5_14.pickle +3 -0
- environment.yml +12 -0
- src/embedding_qa.py +194 -0
- src/gpt_local_config.cfg +11 -0
- src/readme.md +35 -0
- static/caNanoLablogo.jpg +0 -0
- templates/index.html +149 -0
- templates/login.html +60 -0
- templates/result.html +0 -0
Dockerfile
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Use the official Ubuntu 20.04 image as the base
|
| 2 |
+
FROM ubuntu:20.04
|
| 3 |
+
|
| 4 |
+
# Set environment variables
|
| 5 |
+
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
|
| 6 |
+
|
| 7 |
+
# Install system dependencies
|
| 8 |
+
RUN apt-get update && apt-get install -y \
|
| 9 |
+
wget \
|
| 10 |
+
bzip2 \
|
| 11 |
+
ca-certificates \
|
| 12 |
+
curl \
|
| 13 |
+
git \
|
| 14 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 15 |
+
|
| 16 |
+
# Install Miniconda
|
| 17 |
+
ENV CONDA_DIR=/opt/conda
|
| 18 |
+
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh && \
|
| 19 |
+
/bin/bash /tmp/miniconda.sh -b -p $CONDA_DIR && \
|
| 20 |
+
rm /tmp/miniconda.sh
|
| 21 |
+
|
| 22 |
+
# Add Miniconda to the PATH
|
| 23 |
+
ENV PATH=$CONDA_DIR/bin:$PATH
|
| 24 |
+
|
| 25 |
+
# Install Mamba
|
| 26 |
+
# RUN conda install -y -c conda-forge mamba && \
|
| 27 |
+
# mamba --version
|
| 28 |
+
|
| 29 |
+
# Create a Conda environment using Mamba
|
| 30 |
+
COPY environment.yml /tmp/environment.yml
|
| 31 |
+
RUN conda env create -n caNanoWikiAI -f /tmp/environment.yml && \
|
| 32 |
+
rm /tmp/environment.yml
|
| 33 |
+
|
| 34 |
+
# Activate the Conda environment by default
|
| 35 |
+
ENV PATH=$CONDA_DIR/envs/caNanoWikiAI/bin:$PATH
|
| 36 |
+
|
| 37 |
+
# Set the working directory in the container
|
| 38 |
+
WORKDIR /app
|
| 39 |
+
|
| 40 |
+
# Copy your application files
|
| 41 |
+
COPY . /app
|
| 42 |
+
|
| 43 |
+
# Expose the container port
|
| 44 |
+
EXPOSE 5000
|
| 45 |
+
|
| 46 |
+
# Set environment variables (optional)
|
| 47 |
+
ENV FLASK_APP=app.py
|
| 48 |
+
ENV FLASK_RUN_HOST=0.0.0.0
|
| 49 |
+
|
| 50 |
+
# Define the command to run your application
|
| 51 |
+
CMD [ "python", "app.py" ]
|
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2023 Rui He
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in all
|
| 13 |
+
copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
README.md
CHANGED
|
@@ -1,11 +1,47 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# caNanoWiki_AI Web Application
|
| 2 |
+
This hobby project aims to provide a chatGPT powered digital assistant to help users look for answers in a Wiki. The project is inspired by the OpenAI cookbook project, but realizing there are a lot of concepts to understand and infrastructures to make it work.
|
| 3 |
+
|
| 4 |
+
Authors are Rui He and Weina Ke.
|
| 5 |
+
|
| 6 |
+
This is the backend of the caNanoLibrarian app, which is a LLM based natural language searching experience for a structure database.
|
| 7 |
+
|
| 8 |
+
We have the application on https://cananowikipda.azurewebsites.net/login, if you want to play with it, the passcode is caNanoWikiAI_DEMO_wkrh_51423923*. We have a limited budget for this project, please let us know if you want to continue explore it when the GPT complains about limit. Thanks.
|
| 9 |
+
|
| 10 |
+
Document of this repo was mostly generated by ChatGPT, enjoy!
|
| 11 |
+
|
| 12 |
+
This repository contains a Flask-based web application designed to interact with OpenAI's GPT-3.5 Turbo model. The application is primarily used for answering queries with context, leveraging the capabilities of the GPT-3.5 Turbo model.
|
| 13 |
+
|
| 14 |
+
## Key Features
|
| 15 |
+
|
| 16 |
+
1. **Authentication**: The application has a simple authentication system. A user must enter a passcode to access the main page of the application. If the passcode is correct, the user is authenticated and can access the application. The application also includes a timeout feature, where the user is automatically logged out after a certain period of inactivity or after a session duration.
|
| 17 |
+
|
| 18 |
+
2. **Query Processing**: The application allows users to input queries, which are then processed by the `embedding_qa.answer_query_with_context` function. This function uses a document dataframe and document embeddings (loaded from a pickle file) to provide context for the query.
|
| 19 |
+
|
| 20 |
+
3. **Interaction with GPT-3.5 Turbo**: The application uses OpenAI's GPT-3.5 Turbo model to generate responses to user queries. The parameters for the model, such as temperature and max tokens, are defined in the application.
|
| 21 |
+
|
| 22 |
+
4. **Web Interface**: The application provides a web interface for users to interact with. This includes a login page, a logout function, and an index page where users can input queries and view responses.
|
| 23 |
+
|
| 24 |
+
5. **Configuration**: The application uses a configuration file (`gpt_local_config.cfg`) to set the OpenAI API key and other parameters.
|
| 25 |
+
|
| 26 |
+
The application is designed to be run on a local server, with the host set to '0.0.0.0' and the port set to 5000.
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
|
| 30 |
+
To run the application, navigate to the directory containing the application and run the command `python app.py`. This will start the application on your local server.
|
| 31 |
+
|
| 32 |
+
## Dependencies
|
| 33 |
+
|
| 34 |
+
The application requires the following Python packages:
|
| 35 |
+
|
| 36 |
+
- Flask
|
| 37 |
+
- OpenAI
|
| 38 |
+
- Tiktoken
|
| 39 |
+
- Numpy
|
| 40 |
+
- Pandas
|
| 41 |
+
- Configparser
|
| 42 |
+
- Pickle
|
| 43 |
+
|
| 44 |
+
These can be installed using pip:
|
| 45 |
+
|
| 46 |
+
```bash
|
| 47 |
+
pip install flask openai tiktoken numpy pandas configparser pickle
|
app.py
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import sys
|
| 3 |
+
import time
|
| 4 |
+
import pickle
|
| 5 |
+
import openai
|
| 6 |
+
import configparser
|
| 7 |
+
from flask import Flask, render_template, request, redirect, url_for
|
| 8 |
+
dir_path = os.path.abspath(os.getcwd())
|
| 9 |
+
|
| 10 |
+
src_path = dir_path + "/src"
|
| 11 |
+
sys.path.append(src_path)
|
| 12 |
+
|
| 13 |
+
COMPLETIONS_MODEL = "gpt-3.5-turbo"
|
| 14 |
+
EMBEDDING_MODEL = "text-embedding-ada-002"
|
| 15 |
+
config_dir = dir_path + "/src/utils"
|
| 16 |
+
config = configparser.ConfigParser()
|
| 17 |
+
config.read(os.path.join(config_dir, 'gpt_local_config.cfg'))
|
| 18 |
+
openai.api_key = config.get('token', 'GPT_TOKEN')
|
| 19 |
+
|
| 20 |
+
import embedding_qa as emq
|
| 21 |
+
|
| 22 |
+
# Specify the path to your pickle file
|
| 23 |
+
pickle_file_path = 'caNano_embedding_pack_5_14.pickle'
|
| 24 |
+
|
| 25 |
+
# Load the pickle file
|
| 26 |
+
with open(pickle_file_path, 'rb') as file:
|
| 27 |
+
loaded_data = pickle.load(file)
|
| 28 |
+
|
| 29 |
+
document_df = loaded_data['df']
|
| 30 |
+
document_embedding = loaded_data['embedding']
|
| 31 |
+
|
| 32 |
+
COMPLETIONS_API_PARAMS = {
|
| 33 |
+
# We use temperature of 0.0 because it gives the
|
| 34 |
+
# most predictable, factual answer.
|
| 35 |
+
"temperature": 0.0,
|
| 36 |
+
"max_tokens": 4000,
|
| 37 |
+
"model": "gpt-3.5-turbo"
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
app = Flask("caNanoWiki_AI")
|
| 41 |
+
|
| 42 |
+
# Set the passcode for authentication
|
| 43 |
+
PASSCODE_auth = ""
|
| 44 |
+
|
| 45 |
+
# Define a variable to track if the user is authenticated
|
| 46 |
+
authenticated = False
|
| 47 |
+
last_activity_time = 0
|
| 48 |
+
|
| 49 |
+
# Timeout duration in seconds
|
| 50 |
+
timeout_duration = 5 * 60
|
| 51 |
+
|
| 52 |
+
# Session Length
|
| 53 |
+
session_duration = 30 * 60
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
@app.template_filter('nl2br')
|
| 57 |
+
def nl2br_filter(s):
|
| 58 |
+
return s.replace('\n', '<br>')
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
@app.route('/', methods=['GET', 'POST'])
|
| 62 |
+
def index():
|
| 63 |
+
global authenticated, last_activity_time, login_time
|
| 64 |
+
|
| 65 |
+
if not authenticated:
|
| 66 |
+
return redirect(url_for('login'))
|
| 67 |
+
|
| 68 |
+
# Check for timeout
|
| 69 |
+
current_time = time.time()
|
| 70 |
+
if current_time - last_activity_time > timeout_duration:
|
| 71 |
+
authenticated = False
|
| 72 |
+
return redirect(url_for('login'))
|
| 73 |
+
|
| 74 |
+
# Check for session timeout
|
| 75 |
+
if current_time - login_time > session_duration:
|
| 76 |
+
authenticated = False
|
| 77 |
+
return redirect(url_for('login'))
|
| 78 |
+
|
| 79 |
+
# Update last activity time
|
| 80 |
+
last_activity_time = current_time
|
| 81 |
+
|
| 82 |
+
user_input = ""
|
| 83 |
+
processed_input = None
|
| 84 |
+
if request.method == 'POST':
|
| 85 |
+
user_input = request.form['user_input']
|
| 86 |
+
|
| 87 |
+
processed_input, chosen_sec_idxes = emq.answer_query_with_context(
|
| 88 |
+
user_input,
|
| 89 |
+
document_df,
|
| 90 |
+
document_embedding
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
return render_template(
|
| 94 |
+
'index.html',
|
| 95 |
+
processed_input=processed_input,
|
| 96 |
+
source_sections=chosen_sec_idxes,
|
| 97 |
+
user_input=user_input,
|
| 98 |
+
authenticated=authenticated)
|
| 99 |
+
|
| 100 |
+
return render_template('index.html', authenticated=authenticated)
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
@app.route('/login', methods=['GET', 'POST'])
|
| 104 |
+
def login():
|
| 105 |
+
global authenticated, last_activity_time, login_time
|
| 106 |
+
|
| 107 |
+
if request.method == 'POST':
|
| 108 |
+
password = request.form['passcode']
|
| 109 |
+
if password == PASSCODE_auth:
|
| 110 |
+
authenticated = True
|
| 111 |
+
last_activity_time = time.time()
|
| 112 |
+
login_time = time.time()
|
| 113 |
+
return redirect(url_for('index'))
|
| 114 |
+
else:
|
| 115 |
+
return render_template('login.html', message='Incorrect password')
|
| 116 |
+
|
| 117 |
+
return render_template('login.html')
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
@app.route('/logout')
|
| 121 |
+
def logout():
|
| 122 |
+
global authenticated
|
| 123 |
+
authenticated = False
|
| 124 |
+
return redirect(url_for('login'))
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
if __name__ == '__main__':
|
| 128 |
+
app.run(host='0.0.0.0', port=7860)
|
caNano_embedding_pack_5_14.pickle
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d845a0157f30e461987c065554b35b79b7c3db868b5841ea8b0202ca3ea221f8
|
| 3 |
+
size 4816021
|
environment.yml
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: caNanoWikiAI
|
| 2 |
+
channels:
|
| 3 |
+
- conda-forge
|
| 4 |
+
- defaults
|
| 5 |
+
dependencies:
|
| 6 |
+
- python=3.10.9
|
| 7 |
+
- openai=0.27.5
|
| 8 |
+
- numpy=1.24.3
|
| 9 |
+
- pandas=2.0.1
|
| 10 |
+
- tiktoken=0.4.0
|
| 11 |
+
- configparser=5.3.0
|
| 12 |
+
- flask=2.3.2
|
src/embedding_qa.py
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import openai
|
| 3 |
+
import tiktoken
|
| 4 |
+
import warnings
|
| 5 |
+
import numpy as np
|
| 6 |
+
import pandas as pd
|
| 7 |
+
import configparser
|
| 8 |
+
|
| 9 |
+
# Mute the PerformanceWarning
|
| 10 |
+
warnings.filterwarnings("ignore", category=Warning)
|
| 11 |
+
dir_path = os.path.abspath(os.getcwd())
|
| 12 |
+
config_dir = dir_path + "/src"
|
| 13 |
+
COMPLETIONS_MODEL = "gpt-3.5-turbo"
|
| 14 |
+
EMBEDDING_MODEL = "text-embedding-ada-002"
|
| 15 |
+
config = configparser.ConfigParser()
|
| 16 |
+
config.read(os.path.join(config_dir, 'gpt_local_config.cfg'))
|
| 17 |
+
# openai.api_key = config.get('token', 'GPT_TOKEN')
|
| 18 |
+
openai.api_key = os.environ.get("GPT_TOKEN")
|
| 19 |
+
SEPARATOR = "\n* "
|
| 20 |
+
ENCODING = "gpt2" # encoding for text-davinci-003
|
| 21 |
+
MAX_SECTION_LEN = 4000
|
| 22 |
+
encoding = tiktoken.get_encoding(ENCODING)
|
| 23 |
+
separator_len = len(encoding.encode(SEPARATOR))
|
| 24 |
+
|
| 25 |
+
# The embedding functions were inspired by example
|
| 26 |
+
# "Question answering using embeddings-based search"
|
| 27 |
+
# in the OpenAI Cookbook repo (https://github.com/openai/openai-cookbook)
|
| 28 |
+
# which hosts a great number of example applications
|
| 29 |
+
# using OpenAI APIs. The content is fast evolving and the
|
| 30 |
+
# current example is far different then what I saw before.
|
| 31 |
+
# It is a great resource to learn and get inspired!
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def get_embedding(
|
| 35 |
+
text: str,
|
| 36 |
+
model: str = EMBEDDING_MODEL
|
| 37 |
+
) -> list[float]:
|
| 38 |
+
|
| 39 |
+
result = openai.Embedding.create(
|
| 40 |
+
model=model,
|
| 41 |
+
input=text
|
| 42 |
+
)
|
| 43 |
+
return result["data"][0]["embedding"]
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def compute_doc_embeddings(
|
| 47 |
+
df: pd.DataFrame
|
| 48 |
+
) -> dict[tuple[str, str], list[float]]:
|
| 49 |
+
"""
|
| 50 |
+
Create an embedding for each row in the dataframe
|
| 51 |
+
using the OpenAI Embeddings API.
|
| 52 |
+
|
| 53 |
+
Return a dictionary that maps between each embedding
|
| 54 |
+
vector and the index of the row that it corresponds to.
|
| 55 |
+
"""
|
| 56 |
+
return {
|
| 57 |
+
idx: get_embedding(r.content) for idx, r in df.iterrows()
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
|
| 62 |
+
"""
|
| 63 |
+
Read the document embeddings and their keys from a CSV.
|
| 64 |
+
|
| 65 |
+
fname is the path to a CSV with exactly these named columns:
|
| 66 |
+
"title", "heading", "0", "1", ...
|
| 67 |
+
up to the length of the embedding vectors.
|
| 68 |
+
"""
|
| 69 |
+
|
| 70 |
+
df = pd.read_csv(fname, header=0)
|
| 71 |
+
max_dim = max([
|
| 72 |
+
int(c) for c in df.columns if c != "title" and c != "heading"
|
| 73 |
+
])
|
| 74 |
+
return {
|
| 75 |
+
(r.title, r.heading): [
|
| 76 |
+
r[str(i)] for i in range(max_dim + 1)
|
| 77 |
+
] for _, r in df.iterrows()
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def vector_similarity(x: list[float], y: list[float]) -> float:
|
| 82 |
+
"""
|
| 83 |
+
Returns the similarity between two vectors.
|
| 84 |
+
Because OpenAI Embeddings are normalized to length 1,
|
| 85 |
+
the cosine similarity is the same as the dot product.
|
| 86 |
+
"""
|
| 87 |
+
return np.dot(np.array(x), np.array(y))
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def order_document_sections_by_query_similarity(
|
| 91 |
+
query: str,
|
| 92 |
+
contexts: dict[(str, str), np.array]
|
| 93 |
+
) -> list[(float, (str, str))]:
|
| 94 |
+
"""
|
| 95 |
+
Find the query embedding for the supplied query,
|
| 96 |
+
and compare it against all of the pre-calculated document embeddings
|
| 97 |
+
to find the most relevant sections.
|
| 98 |
+
|
| 99 |
+
Return the list of document sections,
|
| 100 |
+
sorted by relevance in descending order.
|
| 101 |
+
"""
|
| 102 |
+
query_embedding = get_embedding(query)
|
| 103 |
+
|
| 104 |
+
document_similarities = sorted([
|
| 105 |
+
(vector_similarity(
|
| 106 |
+
query_embedding,
|
| 107 |
+
doc_embedding
|
| 108 |
+
), doc_index) for doc_index, doc_embedding in contexts.items()
|
| 109 |
+
], reverse=True)
|
| 110 |
+
|
| 111 |
+
return document_similarities
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def construct_prompt(
|
| 115 |
+
question: str,
|
| 116 |
+
context_embeddings: dict,
|
| 117 |
+
df: pd.DataFrame,
|
| 118 |
+
show_section=False
|
| 119 |
+
) -> str:
|
| 120 |
+
"""
|
| 121 |
+
Fetch relevant
|
| 122 |
+
"""
|
| 123 |
+
most_relevant_doc_secs = order_document_sections_by_query_similarity(
|
| 124 |
+
question,
|
| 125 |
+
context_embeddings
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
+
chosen_sections = []
|
| 129 |
+
chosen_sections_len = 0
|
| 130 |
+
chosen_sections_indexes = []
|
| 131 |
+
|
| 132 |
+
for _, section_index in most_relevant_doc_secs:
|
| 133 |
+
# Add contexts until we run out of space.
|
| 134 |
+
document_section = df.loc[section_index]
|
| 135 |
+
chosen_sections_len += document_section.tokens.values[0] + \
|
| 136 |
+
separator_len
|
| 137 |
+
if chosen_sections_len > MAX_SECTION_LEN:
|
| 138 |
+
break
|
| 139 |
+
|
| 140 |
+
chosen_sections.append(
|
| 141 |
+
SEPARATOR +
|
| 142 |
+
document_section.content.values[0].replace("\n", " ")
|
| 143 |
+
)
|
| 144 |
+
chosen_sections_indexes.append(str(section_index))
|
| 145 |
+
|
| 146 |
+
# Useful diagnostic information
|
| 147 |
+
if show_section:
|
| 148 |
+
print(f"Selected {len(chosen_sections)} document sections:")
|
| 149 |
+
print("\n".join(chosen_sections_indexes))
|
| 150 |
+
|
| 151 |
+
string_list = [str(item) for item in chosen_sections]
|
| 152 |
+
chosen_sections_str = ''.join(string_list)
|
| 153 |
+
header = "Answer the question strictly using the provided context," + \
|
| 154 |
+
" and if the answer is not contained within the text below," + \
|
| 155 |
+
" say 'Sorry, your inquiry is not in the Wiki. For further" + \
|
| 156 |
+
" assistance, please contact caNanoLab-Support@ISB-CGC.org' " + \
|
| 157 |
+
"\n\nContext:\n"
|
| 158 |
+
prompt = header + chosen_sections_str + "\n\n Q: " + question + "\n A:"
|
| 159 |
+
|
| 160 |
+
return prompt, chosen_sections_indexes
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
def answer_query_with_context(
|
| 164 |
+
query: str,
|
| 165 |
+
df: pd.DataFrame,
|
| 166 |
+
document_embeddings: dict[(str, str), np.array],
|
| 167 |
+
show_prompt: bool = False,
|
| 168 |
+
show_source: bool = False
|
| 169 |
+
) -> str:
|
| 170 |
+
prompt, chosen_sections_indexes = construct_prompt(
|
| 171 |
+
query,
|
| 172 |
+
document_embeddings,
|
| 173 |
+
df
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
if show_prompt:
|
| 177 |
+
print(prompt)
|
| 178 |
+
|
| 179 |
+
response = openai.ChatCompletion.create(
|
| 180 |
+
model="gpt-3.5-turbo",
|
| 181 |
+
messages=[{
|
| 182 |
+
"role": "user",
|
| 183 |
+
"content": prompt
|
| 184 |
+
}],
|
| 185 |
+
temperature=0,
|
| 186 |
+
max_tokens=500
|
| 187 |
+
# top_p=1,
|
| 188 |
+
# frequency_penalty=0,
|
| 189 |
+
# presence_penalty=0
|
| 190 |
+
)
|
| 191 |
+
msg = response.choices[0]['message']['content']
|
| 192 |
+
chosen_sections_indexes = "<br>".join(chosen_sections_indexes)
|
| 193 |
+
|
| 194 |
+
return msg, chosen_sections_indexes
|
src/gpt_local_config.cfg
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[token]
|
| 2 |
+
GPT_TOKEN =
|
| 3 |
+
[model]
|
| 4 |
+
model_for_fine_tune = davinci
|
| 5 |
+
fine_tune_model_id =
|
| 6 |
+
model_for_chat = gpt-3.5-turbo
|
| 7 |
+
model_for_img_recog =
|
| 8 |
+
[tools]
|
| 9 |
+
data_praperation_script = prepare_data.sh
|
| 10 |
+
[data]
|
| 11 |
+
test = test.csv
|
src/readme.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenAI Embedding and Query Processing
|
| 2 |
+
|
| 3 |
+
This Python script is designed to interact with OpenAI's GPT-3.5 Turbo model and the OpenAI Embedding API. It provides functionality for creating and loading embeddings, calculating vector similarity, ordering document sections by query similarity, and constructing prompts for the GPT-3.5 Turbo model.
|
| 4 |
+
|
| 5 |
+
## Key Features
|
| 6 |
+
|
| 7 |
+
1. **Embedding Creation and Loading**: The script includes functions for creating embeddings for each row in a dataframe using the OpenAI Embedding API (`compute_doc_embeddings`) and for loading embeddings from a CSV file (`load_embeddings`).
|
| 8 |
+
|
| 9 |
+
2. **Vector Similarity**: The `vector_similarity` function calculates the similarity between two vectors. This is used to compare the embedding of a user's query with the embeddings of document sections.
|
| 10 |
+
|
| 11 |
+
3. **Document Section Ordering**: The `order_document_sections_by_query_similarity` function compares the embedding of a user's query with the embeddings of document sections and returns a list of document sections sorted by relevance in descending order.
|
| 12 |
+
|
| 13 |
+
4. **Prompt Construction**: The `construct_prompt` function constructs a prompt for the GPT-3.5 Turbo model based on a user's query and the most relevant document sections.
|
| 14 |
+
|
| 15 |
+
5. **Query Answering**: The `answer_query_with_context` function uses the GPT-3.5 Turbo model to generate a response to a user's query. It constructs a prompt based on the user's query and the most relevant document sections, sends this prompt to the GPT-3.5 Turbo model, and returns the model's response.
|
| 16 |
+
|
| 17 |
+
## Usage
|
| 18 |
+
|
| 19 |
+
To use this script, you will need to import it into your Python project and call the functions as needed. For example, you might use the `compute_doc_embeddings` function to create embeddings for your document sections, the `order_document_sections_by_query_similarity` function to order the sections by relevance to a user's query, and the `answer_query_with_context` function to generate a response to the query.
|
| 20 |
+
|
| 21 |
+
## Dependencies
|
| 22 |
+
|
| 23 |
+
This script requires the following Python packages:
|
| 24 |
+
|
| 25 |
+
- OpenAI
|
| 26 |
+
- Tiktoken
|
| 27 |
+
- Numpy
|
| 28 |
+
- Pandas
|
| 29 |
+
- Configparser
|
| 30 |
+
- Warnings
|
| 31 |
+
|
| 32 |
+
These can be installed using pip:
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
pip install openai tiktoken numpy pandas configparser warnings
|
static/caNanoLablogo.jpg
ADDED
|
templates/index.html
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html>
|
| 3 |
+
<head>
|
| 4 |
+
<title>caNanoWiki AI</title>
|
| 5 |
+
<style>
|
| 6 |
+
body {
|
| 7 |
+
font-family: Arial, sans-serif;
|
| 8 |
+
background-color: #001f3f; /* Dark blue */
|
| 9 |
+
color: #fff;
|
| 10 |
+
margin: 0;
|
| 11 |
+
padding: 20px;
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
h1 {
|
| 15 |
+
font-size: 32px;
|
| 16 |
+
text-align: center;
|
| 17 |
+
margin-bottom: 40px;
|
| 18 |
+
}
|
| 19 |
+
|
| 20 |
+
.logo {
|
| 21 |
+
display: block;
|
| 22 |
+
margin: 0 auto;
|
| 23 |
+
margin-bottom: 40px;
|
| 24 |
+
width: 200px;
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
form {
|
| 28 |
+
text-align: center;
|
| 29 |
+
margin-bottom: 40px;
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
label {
|
| 33 |
+
display: block;
|
| 34 |
+
font-size: 20px;
|
| 35 |
+
margin-bottom: 10px;
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
input[type="text"] {
|
| 39 |
+
font-size: 18px;
|
| 40 |
+
padding: 10px;
|
| 41 |
+
width: 500px;
|
| 42 |
+
border-radius: 10px;
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
button {
|
| 46 |
+
font-size: 18px;
|
| 47 |
+
padding: 10px 20px;
|
| 48 |
+
background-color: #00bfff;
|
| 49 |
+
color: #fff;
|
| 50 |
+
border: none;
|
| 51 |
+
border-radius: 10px;
|
| 52 |
+
cursor: pointer;
|
| 53 |
+
margin-top: 10px;
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
button:hover {
|
| 57 |
+
background-color: #0088cc;
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
.result {
|
| 61 |
+
background-color: #fff;
|
| 62 |
+
color: #000;
|
| 63 |
+
padding: 20px;
|
| 64 |
+
border-radius: 10px;
|
| 65 |
+
text-align: center;
|
| 66 |
+
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.2);
|
| 67 |
+
margin: 0 auto;
|
| 68 |
+
max-width: 600px;
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
.loading {
|
| 72 |
+
font-size: 18px;
|
| 73 |
+
text-align: center;
|
| 74 |
+
margin-bottom: 20px;
|
| 75 |
+
}
|
| 76 |
+
.logout-button {
|
| 77 |
+
position: absolute;
|
| 78 |
+
top: 10px;
|
| 79 |
+
right: 10px;
|
| 80 |
+
color: #fff;
|
| 81 |
+
background-color: #333;
|
| 82 |
+
padding: 10px;
|
| 83 |
+
text-decoration: none;
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
.another-box.folded {
|
| 87 |
+
display: flex;
|
| 88 |
+
justify-content: center;
|
| 89 |
+
align-items: center;
|
| 90 |
+
background-color: #999;
|
| 91 |
+
color: #fff;
|
| 92 |
+
border-radius: 10px;
|
| 93 |
+
padding: 20px;
|
| 94 |
+
cursor: pointer;
|
| 95 |
+
/* Additional styling for the folded state */
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
.another-box {
|
| 99 |
+
background-color: #fff;
|
| 100 |
+
color: #000;
|
| 101 |
+
border-radius: 10px;
|
| 102 |
+
padding: 20px;
|
| 103 |
+
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.2);
|
| 104 |
+
margin-top: 20px;
|
| 105 |
+
/* Additional styling for the unfolded state */
|
| 106 |
+
}
|
| 107 |
+
</style>
|
| 108 |
+
</head>
|
| 109 |
+
<body>
|
| 110 |
+
<h1>Welcome to caNanoWiki Personal Digital Asistant</h1>
|
| 111 |
+
|
| 112 |
+
<img src="{{ url_for('static', filename='caNanoLablogo.jpg') }}" alt="caNanoWiki AI Logo" class="logo">
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
{% if authenticated %}
|
| 116 |
+
<a class="logout-button" href="{{ url_for('logout') }}">Logout</a>
|
| 117 |
+
{% endif %}
|
| 118 |
+
|
| 119 |
+
<form method="POST" action="/">
|
| 120 |
+
<label for="user_input">How can I help?</label>
|
| 121 |
+
<input type="text" id="user_input" name="user_input" value="{{ user_input }}" autofocus>
|
| 122 |
+
<br>
|
| 123 |
+
<button type="submit">Search</button>
|
| 124 |
+
</form>
|
| 125 |
+
|
| 126 |
+
{% if processing %}
|
| 127 |
+
<p class="loading">Working on it...</p>
|
| 128 |
+
{% endif %}
|
| 129 |
+
|
| 130 |
+
{% if processed_input %}
|
| 131 |
+
<div class="result">
|
| 132 |
+
<p>{{ processed_input }}</p>
|
| 133 |
+
</div>
|
| 134 |
+
<div class="Source-Info" onclick="toggleFoldedState(this)">
|
| 135 |
+
<h2>Selected Wiki Sources</h2>
|
| 136 |
+
<p>{{ source_sections | nl2br | safe }}</p>
|
| 137 |
+
</div>
|
| 138 |
+
{% else %}
|
| 139 |
+
<div class="result" style="display: none;"></div>
|
| 140 |
+
{% endif %}
|
| 141 |
+
|
| 142 |
+
<script>
|
| 143 |
+
function toggleFoldedState(element) {
|
| 144 |
+
element.classList.toggle('folded');
|
| 145 |
+
}
|
| 146 |
+
</script>
|
| 147 |
+
</body>
|
| 148 |
+
</html>
|
| 149 |
+
|
templates/login.html
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html>
|
| 3 |
+
<head>
|
| 4 |
+
<title>Login</title>
|
| 5 |
+
<style>
|
| 6 |
+
body {
|
| 7 |
+
background-color: #11182B;
|
| 8 |
+
font-family: Arial, sans-serif;
|
| 9 |
+
color: #FFFFFF;
|
| 10 |
+
display: flex;
|
| 11 |
+
align-items: center;
|
| 12 |
+
justify-content: center;
|
| 13 |
+
height: 100vh;
|
| 14 |
+
}
|
| 15 |
+
|
| 16 |
+
.login-container {
|
| 17 |
+
width: 350px;
|
| 18 |
+
padding: 20px;
|
| 19 |
+
background-color: #223E6D;
|
| 20 |
+
border-radius: 10px;
|
| 21 |
+
}
|
| 22 |
+
|
| 23 |
+
.login-container label, .login-container input {
|
| 24 |
+
display: block;
|
| 25 |
+
width: 100%;
|
| 26 |
+
margin-bottom: 10px;
|
| 27 |
+
}
|
| 28 |
+
|
| 29 |
+
.login-container input {
|
| 30 |
+
padding: 10px;
|
| 31 |
+
border-radius: 5px;
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
.login-container button {
|
| 35 |
+
display: block;
|
| 36 |
+
width: 100%;
|
| 37 |
+
padding: 10px;
|
| 38 |
+
border: none;
|
| 39 |
+
border-radius: 5px;
|
| 40 |
+
background-color: #2D82B7;
|
| 41 |
+
color: #FFFFFF;
|
| 42 |
+
cursor: pointer;
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
.login-container button:hover {
|
| 46 |
+
background-color: #1B577B;
|
| 47 |
+
}
|
| 48 |
+
</style>
|
| 49 |
+
</head>
|
| 50 |
+
<body>
|
| 51 |
+
<div class="login-container">
|
| 52 |
+
<h1>Login</h1>
|
| 53 |
+
<form method="POST" action="/login">
|
| 54 |
+
<label for="passcode">Please enter the passcode:</label>
|
| 55 |
+
<input type="password" id="passcode" name="passcode">
|
| 56 |
+
<button type="submit">Submit</button>
|
| 57 |
+
</form>
|
| 58 |
+
</div>
|
| 59 |
+
</body>
|
| 60 |
+
</html>
|
templates/result.html
ADDED
|
File without changes
|