Spaces:

ruiheCat
/

caNanoWiki

Sleeping

App Files Files Community

ruiheesi commited on Mar 10, 2024

Commit

bfd0858

1 Parent(s): 0e9571b

Add applicaiton file

Browse files

Files changed (13) hide show

Dockerfile +51 -0
LICENSE +21 -0
README.md +47 -11
app.py +128 -0
caNano_embedding_pack_5_14.pickle +3 -0
environment.yml +12 -0
src/embedding_qa.py +194 -0
src/gpt_local_config.cfg +11 -0
src/readme.md +35 -0
static/caNanoLablogo.jpg +0 -0
templates/index.html +149 -0
templates/login.html +60 -0
templates/result.html +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,51 @@

+# Use the official Ubuntu 20.04 image as the base
+FROM ubuntu:20.04
+# Set environment variables
+ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    wget \
+    bzip2 \
+    ca-certificates \
+    curl \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Install Miniconda
+ENV CONDA_DIR=/opt/conda
+RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh && \
+    /bin/bash /tmp/miniconda.sh -b -p $CONDA_DIR && \
+    rm /tmp/miniconda.sh
+# Add Miniconda to the PATH
+ENV PATH=$CONDA_DIR/bin:$PATH
+# Install Mamba
+# RUN conda install -y -c conda-forge mamba && \
+#     mamba --version
+# Create a Conda environment using Mamba
+COPY environment.yml /tmp/environment.yml
+RUN conda env create -n caNanoWikiAI -f /tmp/environment.yml && \
+    rm /tmp/environment.yml
+# Activate the Conda environment by default
+ENV PATH=$CONDA_DIR/envs/caNanoWikiAI/bin:$PATH
+# Set the working directory in the container
+WORKDIR /app
+# Copy your application files
+COPY . /app
+# Expose the container port
+EXPOSE 5000
+# Set environment variables (optional)
+ENV FLASK_APP=app.py
+ENV FLASK_RUN_HOST=0.0.0.0
+# Define the command to run your application
+CMD [ "python", "app.py" ]

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Rui He
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,11 +1,47 @@
----
-title: CaNanoWiki
-emoji: 🐠
-colorFrom: yellow
-colorTo: red
-sdk: docker
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# caNanoWiki_AI Web Application
+This hobby project aims to provide a chatGPT powered digital assistant to help users look for answers in a Wiki. The project is inspired by the OpenAI cookbook  project, but realizing there are a lot of concepts to understand and infrastructures to make it work.
+Authors are Rui He and Weina Ke.
+This is the backend of the caNanoLibrarian app, which is a LLM based natural language searching experience for a structure database.
+We have the application on https://cananowikipda.azurewebsites.net/login, if you want to play with it, the passcode is caNanoWikiAI_DEMO_wkrh_51423923*. We have a limited budget for this project, please let us know if you want to continue explore it when the GPT complains about limit. Thanks.
+Document of this repo was mostly generated by ChatGPT, enjoy!
+This repository contains a Flask-based web application designed to interact with OpenAI's GPT-3.5 Turbo model. The application is primarily used for answering queries with context, leveraging the capabilities of the GPT-3.5 Turbo model.
+## Key Features
+1. **Authentication**: The application has a simple authentication system. A user must enter a passcode to access the main page of the application. If the passcode is correct, the user is authenticated and can access the application. The application also includes a timeout feature, where the user is automatically logged out after a certain period of inactivity or after a session duration.
+2. **Query Processing**: The application allows users to input queries, which are then processed by the `embedding_qa.answer_query_with_context` function. This function uses a document dataframe and document embeddings (loaded from a pickle file) to provide context for the query.
+3. **Interaction with GPT-3.5 Turbo**: The application uses OpenAI's GPT-3.5 Turbo model to generate responses to user queries. The parameters for the model, such as temperature and max tokens, are defined in the application.
+4. **Web Interface**: The application provides a web interface for users to interact with. This includes a login page, a logout function, and an index page where users can input queries and view responses.
+5. **Configuration**: The application uses a configuration file (`gpt_local_config.cfg`) to set the OpenAI API key and other parameters.
+The application is designed to be run on a local server, with the host set to '0.0.0.0' and the port set to 5000.
+## Usage
+To run the application, navigate to the directory containing the application and run the command `python app.py`. This will start the application on your local server.
+## Dependencies
+The application requires the following Python packages:
+- Flask
+- OpenAI
+- Tiktoken
+- Numpy
+- Pandas
+- Configparser
+- Pickle
+These can be installed using pip:
+```bash
+pip install flask openai tiktoken numpy pandas configparser pickle

app.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import os
+import sys
+import time
+import pickle
+import openai
+import configparser
+from flask import Flask, render_template, request, redirect, url_for
+dir_path = os.path.abspath(os.getcwd())
+src_path = dir_path + "/src"
+sys.path.append(src_path)
+COMPLETIONS_MODEL = "gpt-3.5-turbo"
+EMBEDDING_MODEL = "text-embedding-ada-002"
+config_dir = dir_path + "/src/utils"
+config = configparser.ConfigParser()
+config.read(os.path.join(config_dir, 'gpt_local_config.cfg'))
+openai.api_key = config.get('token', 'GPT_TOKEN')
+import embedding_qa as emq
+# Specify the path to your pickle file
+pickle_file_path = 'caNano_embedding_pack_5_14.pickle'
+# Load the pickle file
+with open(pickle_file_path, 'rb') as file:
+    loaded_data = pickle.load(file)
+document_df = loaded_data['df']
+document_embedding = loaded_data['embedding']
+COMPLETIONS_API_PARAMS = {
+    # We use temperature of 0.0 because it gives the
+    # most predictable, factual answer.
+    "temperature": 0.0,
+    "max_tokens": 4000,
+    "model": "gpt-3.5-turbo"
+}
+app = Flask("caNanoWiki_AI")
+# Set the passcode for authentication
+PASSCODE_auth = ""
+# Define a variable to track if the user is authenticated
+authenticated = False
+last_activity_time = 0
+# Timeout duration in seconds
+timeout_duration = 5 * 60
+# Session Length
+session_duration = 30 * 60
+@app.template_filter('nl2br')
+def nl2br_filter(s):
+    return s.replace('\n', '<br>')
+@app.route('/', methods=['GET', 'POST'])
+def index():
+    global authenticated, last_activity_time, login_time
+    if not authenticated:
+        return redirect(url_for('login'))
+    # Check for timeout
+    current_time = time.time()
+    if current_time - last_activity_time > timeout_duration:
+        authenticated = False
+        return redirect(url_for('login'))
+    # Check for session timeout
+    if current_time - login_time > session_duration:
+        authenticated = False
+        return redirect(url_for('login'))
+    # Update last activity time
+    last_activity_time = current_time
+    user_input = ""
+    processed_input = None
+    if request.method == 'POST':
+        user_input = request.form['user_input']
+        processed_input, chosen_sec_idxes = emq.answer_query_with_context(
+            user_input,
+            document_df,
+            document_embedding
+        )
+        return render_template(
+            'index.html',
+            processed_input=processed_input,
+            source_sections=chosen_sec_idxes,
+            user_input=user_input,
+            authenticated=authenticated)
+    return render_template('index.html', authenticated=authenticated)
+@app.route('/login', methods=['GET', 'POST'])
+def login():
+    global authenticated, last_activity_time, login_time
+    if request.method == 'POST':
+        password = request.form['passcode']
+        if password == PASSCODE_auth:
+            authenticated = True
+            last_activity_time = time.time()
+            login_time = time.time()
+            return redirect(url_for('index'))
+        else:
+            return render_template('login.html', message='Incorrect password')
+    return render_template('login.html')
+@app.route('/logout')
+def logout():
+    global authenticated
+    authenticated = False
+    return redirect(url_for('login'))
+if __name__ == '__main__':
+    app.run(host='0.0.0.0', port=7860)

caNano_embedding_pack_5_14.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d845a0157f30e461987c065554b35b79b7c3db868b5841ea8b0202ca3ea221f8
+size 4816021

environment.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+name: caNanoWikiAI
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - python=3.10.9
+  - openai=0.27.5
+  - numpy=1.24.3
+  - pandas=2.0.1
+  - tiktoken=0.4.0
+  - configparser=5.3.0
+  - flask=2.3.2

src/embedding_qa.py ADDED Viewed

	@@ -0,0 +1,194 @@

+import os
+import openai
+import tiktoken
+import warnings
+import numpy as np
+import pandas as pd
+import configparser
+# Mute the PerformanceWarning
+warnings.filterwarnings("ignore", category=Warning)
+dir_path = os.path.abspath(os.getcwd())
+config_dir = dir_path + "/src"
+COMPLETIONS_MODEL = "gpt-3.5-turbo"
+EMBEDDING_MODEL = "text-embedding-ada-002"
+config = configparser.ConfigParser()
+config.read(os.path.join(config_dir, 'gpt_local_config.cfg'))
+# openai.api_key = config.get('token', 'GPT_TOKEN')
+openai.api_key = os.environ.get("GPT_TOKEN")
+SEPARATOR = "\n* "
+ENCODING = "gpt2"  # encoding for text-davinci-003
+MAX_SECTION_LEN = 4000
+encoding = tiktoken.get_encoding(ENCODING)
+separator_len = len(encoding.encode(SEPARATOR))
+# The embedding functions were inspired by example
+# "Question answering using embeddings-based search"
+# in the OpenAI Cookbook repo (https://github.com/openai/openai-cookbook)
+# which hosts a great number of example applications
+# using OpenAI APIs. The content is fast evolving and the
+# current example is far different then what I saw before.
+# It is a great resource to learn and get inspired!
+def get_embedding(
+    text: str,
+    model: str = EMBEDDING_MODEL
+) -> list[float]:
+    result = openai.Embedding.create(
+      model=model,
+      input=text
+    )
+    return result["data"][0]["embedding"]
+def compute_doc_embeddings(
+    df: pd.DataFrame
+) -> dict[tuple[str, str], list[float]]:
+    """
+    Create an embedding for each row in the dataframe
+    using the OpenAI Embeddings API.
+    Return a dictionary that maps between each embedding
+    vector and the index of the row that it corresponds to.
+    """
+    return {
+        idx: get_embedding(r.content) for idx, r in df.iterrows()
+    }
+def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
+    """
+    Read the document embeddings and their keys from a CSV.
+    fname is the path to a CSV with exactly these named columns:
+        "title", "heading", "0", "1", ...
+        up to the length of the embedding vectors.
+    """
+    df = pd.read_csv(fname, header=0)
+    max_dim = max([
+        int(c) for c in df.columns if c != "title" and c != "heading"
+    ])
+    return {
+           (r.title, r.heading): [
+                r[str(i)] for i in range(max_dim + 1)
+            ] for _, r in df.iterrows()
+    }
+def vector_similarity(x: list[float], y: list[float]) -> float:
+    """
+    Returns the similarity between two vectors.
+    Because OpenAI Embeddings are normalized to length 1,
+    the cosine similarity is the same as the dot product.
+    """
+    return np.dot(np.array(x), np.array(y))
+def order_document_sections_by_query_similarity(
+    query: str,
+    contexts: dict[(str, str), np.array]
+) -> list[(float, (str, str))]:
+    """
+    Find the query embedding for the supplied query,
+    and compare it against all of the pre-calculated document embeddings
+    to find the most relevant sections.
+    Return the list of document sections,
+    sorted by relevance in descending order.
+    """
+    query_embedding = get_embedding(query)
+    document_similarities = sorted([
+        (vector_similarity(
+            query_embedding,
+            doc_embedding
+        ), doc_index) for doc_index, doc_embedding in contexts.items()
+    ], reverse=True)
+    return document_similarities
+def construct_prompt(
+    question: str,
+    context_embeddings: dict,
+    df: pd.DataFrame,
+    show_section=False
+) -> str:
+    """
+    Fetch relevant
+    """
+    most_relevant_doc_secs = order_document_sections_by_query_similarity(
+        question,
+        context_embeddings
+    )
+    chosen_sections = []
+    chosen_sections_len = 0
+    chosen_sections_indexes = []
+    for _, section_index in most_relevant_doc_secs:
+        # Add contexts until we run out of space.
+        document_section = df.loc[section_index]
+        chosen_sections_len += document_section.tokens.values[0] + \
+            separator_len
+        if chosen_sections_len > MAX_SECTION_LEN:
+            break
+        chosen_sections.append(
+            SEPARATOR +
+            document_section.content.values[0].replace("\n", " ")
+        )
+        chosen_sections_indexes.append(str(section_index))
+    # Useful diagnostic information
+    if show_section:
+        print(f"Selected {len(chosen_sections)} document sections:")
+        print("\n".join(chosen_sections_indexes))
+    string_list = [str(item) for item in chosen_sections]
+    chosen_sections_str = ''.join(string_list)
+    header = "Answer the question strictly using the provided context," + \
+        " and if the answer is not contained within the text below," + \
+        " say 'Sorry, your inquiry is not in the Wiki. For further" + \
+        " assistance, please contact caNanoLab-Support@ISB-CGC.org' " + \
+        "\n\nContext:\n"
+    prompt = header + chosen_sections_str + "\n\n Q: " + question + "\n A:"
+    return prompt, chosen_sections_indexes
+def answer_query_with_context(
+    query: str,
+    df: pd.DataFrame,
+    document_embeddings: dict[(str, str), np.array],
+    show_prompt: bool = False,
+    show_source: bool = False
+) -> str:
+    prompt, chosen_sections_indexes = construct_prompt(
+        query,
+        document_embeddings,
+        df
+    )
+    if show_prompt:
+        print(prompt)
+    response = openai.ChatCompletion.create(
+        model="gpt-3.5-turbo",
+        messages=[{
+            "role": "user",
+            "content": prompt
+        }],
+        temperature=0,
+        max_tokens=500
+        # top_p=1,
+        # frequency_penalty=0,
+        # presence_penalty=0
+    )
+    msg = response.choices[0]['message']['content']
+    chosen_sections_indexes = "<br>".join(chosen_sections_indexes)
+    return msg, chosen_sections_indexes

src/gpt_local_config.cfg ADDED Viewed

	@@ -0,0 +1,11 @@

+[token]
+GPT_TOKEN =
+[model]
+model_for_fine_tune = davinci
+fine_tune_model_id =
+model_for_chat = gpt-3.5-turbo
+model_for_img_recog =
+[tools]
+data_praperation_script = prepare_data.sh
+[data]
+test = test.csv

src/readme.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# OpenAI Embedding and Query Processing
+This Python script is designed to interact with OpenAI's GPT-3.5 Turbo model and the OpenAI Embedding API. It provides functionality for creating and loading embeddings, calculating vector similarity, ordering document sections by query similarity, and constructing prompts for the GPT-3.5 Turbo model.
+## Key Features
+1. **Embedding Creation and Loading**: The script includes functions for creating embeddings for each row in a dataframe using the OpenAI Embedding API (`compute_doc_embeddings`) and for loading embeddings from a CSV file (`load_embeddings`).
+2. **Vector Similarity**: The `vector_similarity` function calculates the similarity between two vectors. This is used to compare the embedding of a user's query with the embeddings of document sections.
+3. **Document Section Ordering**: The `order_document_sections_by_query_similarity` function compares the embedding of a user's query with the embeddings of document sections and returns a list of document sections sorted by relevance in descending order.
+4. **Prompt Construction**: The `construct_prompt` function constructs a prompt for the GPT-3.5 Turbo model based on a user's query and the most relevant document sections.
+5. **Query Answering**: The `answer_query_with_context` function uses the GPT-3.5 Turbo model to generate a response to a user's query. It constructs a prompt based on the user's query and the most relevant document sections, sends this prompt to the GPT-3.5 Turbo model, and returns the model's response.
+## Usage
+To use this script, you will need to import it into your Python project and call the functions as needed. For example, you might use the `compute_doc_embeddings` function to create embeddings for your document sections, the `order_document_sections_by_query_similarity` function to order the sections by relevance to a user's query, and the `answer_query_with_context` function to generate a response to the query.
+## Dependencies
+This script requires the following Python packages:
+- OpenAI
+- Tiktoken
+- Numpy
+- Pandas
+- Configparser
+- Warnings
+These can be installed using pip:
+```bash
+pip install openai tiktoken numpy pandas configparser warnings

static/caNanoLablogo.jpg ADDED Viewed

templates/index.html ADDED Viewed

	@@ -0,0 +1,149 @@

+<!DOCTYPE html>
+<html>
+<head>
+    <title>caNanoWiki AI</title>
+    <style>
+        body {
+            font-family: Arial, sans-serif;
+            background-color: #001f3f;  /* Dark blue */
+            color: #fff;
+            margin: 0;
+            padding: 20px;
+        }
+        h1 {
+            font-size: 32px;
+            text-align: center;
+            margin-bottom: 40px;
+        }
+        .logo {
+            display: block;
+            margin: 0 auto;
+            margin-bottom: 40px;
+            width: 200px;
+        }
+        form {
+            text-align: center;
+            margin-bottom: 40px;
+        }
+        label {
+            display: block;
+            font-size: 20px;
+            margin-bottom: 10px;
+        }
+        input[type="text"] {
+            font-size: 18px;
+            padding: 10px;
+            width: 500px;
+            border-radius: 10px;
+        }
+        button {
+            font-size: 18px;
+            padding: 10px 20px;
+            background-color: #00bfff;
+            color: #fff;
+            border: none;
+            border-radius: 10px;
+            cursor: pointer;
+            margin-top: 10px;
+        }
+        button:hover {
+            background-color: #0088cc;
+        }
+        .result {
+            background-color: #fff;
+            color: #000;
+            padding: 20px;
+            border-radius: 10px;
+            text-align: center;
+            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.2);
+            margin: 0 auto;
+            max-width: 600px;
+        }
+        .loading {
+            font-size: 18px;
+            text-align: center;
+            margin-bottom: 20px;
+        }
+        .logout-button {
+            position: absolute;
+            top: 10px;
+            right: 10px;
+            color: #fff;
+            background-color: #333;
+            padding: 10px;
+            text-decoration: none;
+        }
+        .another-box.folded {
+            display: flex;
+            justify-content: center;
+            align-items: center;
+            background-color: #999;
+            color: #fff;
+            border-radius: 10px;
+            padding: 20px;
+            cursor: pointer;
+            /* Additional styling for the folded state */
+        }
+        .another-box {
+            background-color: #fff;
+            color: #000;
+            border-radius: 10px;
+            padding: 20px;
+            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.2);
+            margin-top: 20px;
+            /* Additional styling for the unfolded state */
+        }
+    </style>
+</head>
+<body>
+    <h1>Welcome to caNanoWiki Personal Digital Asistant</h1>
+    <img src="{{ url_for('static', filename='caNanoLablogo.jpg') }}" alt="caNanoWiki AI Logo" class="logo">
+    {% if authenticated %}
+        <a class="logout-button" href="{{ url_for('logout') }}">Logout</a>
+    {% endif %}
+    <form method="POST" action="/">
+        <label for="user_input">How can I help?</label>
+        <input type="text" id="user_input" name="user_input" value="{{ user_input }}" autofocus>
+        <br>
+        <button type="submit">Search</button>
+    </form>
+    {% if processing %}
+        <p class="loading">Working on it...</p>
+    {% endif %}
+    {% if processed_input %}
+        <div class="result">
+            <p>{{ processed_input }}</p>
+        </div>
+        <div class="Source-Info" onclick="toggleFoldedState(this)">
+            <h2>Selected Wiki Sources</h2>
+            <p>{{ source_sections | nl2br | safe }}</p>
+        </div>
+    {% else %}
+        <div class="result" style="display: none;"></div>
+    {% endif %}
+<script>
+    function toggleFoldedState(element) {
+        element.classList.toggle('folded');
+    }
+</script>
+</body>
+</html>

templates/login.html ADDED Viewed

	@@ -0,0 +1,60 @@

+<!DOCTYPE html>
+<html>
+<head>
+    <title>Login</title>
+    <style>
+        body {
+            background-color: #11182B;
+            font-family: Arial, sans-serif;
+            color: #FFFFFF;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            height: 100vh;
+        }
+        .login-container {
+            width: 350px;
+            padding: 20px;
+            background-color: #223E6D;
+            border-radius: 10px;
+        }
+        .login-container label, .login-container input {
+            display: block;
+            width: 100%;
+            margin-bottom: 10px;
+        }
+        .login-container input {
+            padding: 10px;
+            border-radius: 5px;
+        }
+        .login-container button {
+            display: block;
+            width: 100%;
+            padding: 10px;
+            border: none;
+            border-radius: 5px;
+            background-color: #2D82B7;
+            color: #FFFFFF;
+            cursor: pointer;
+        }
+        .login-container button:hover {
+            background-color: #1B577B;
+        }
+    </style>
+</head>
+<body>
+    <div class="login-container">
+        <h1>Login</h1>
+        <form method="POST" action="/login">
+            <label for="passcode">Please enter the passcode:</label>
+            <input type="password" id="passcode" name="passcode">
+            <button type="submit">Submit</button>
+        </form>
+    </div>
+</body>
+</html>

templates/result.html ADDED Viewed

File without changes