harshinde commited on
Commit
7cc73a9
·
verified ·
1 Parent(s): 4307791

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +125 -14
  2. requirements.txt +9 -0
README.md CHANGED
@@ -1,14 +1,125 @@
1
- ---
2
- title: PDF Chatbot With LangChain And Streamlit
3
- emoji: 🦀
4
- colorFrom: purple
5
- colorTo: red
6
- sdk: streamlit
7
- sdk_version: 1.40.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: PDF Chatbot with LangChain, Hugging Face, and Streamlit
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PDF Chatbot with LangChain, Hugging Face, and Streamlit
2
+
3
+ This project is a chatbot application that enables users to upload multiple PDF files and interact with their content through natural language queries. Using the LangChain library with Hugging Face embeddings and language models, the application extracts and vectorizes PDF content, allowing users to ask questions based on the uploaded documents. The project is deployed using Streamlit and Docker.
4
+
5
+ ## Table of Contents
6
+
7
+ - [Features](#features)
8
+ - [Getting Started](#getting-started)
9
+ - [Project Structure](#project-structure)
10
+ - [File Descriptions](#file-descriptions)
11
+ - [Usage](#usage)
12
+ - [Configuration Options](#configuration-options)
13
+ - [Troubleshooting](#troubleshooting)
14
+ - [Dependencies](#dependencies)
15
+ - [License](#license)
16
+
17
+ ## Features
18
+
19
+ - **Upload Multiple PDFs**: Easily upload multiple PDF files for processing.
20
+ - **Chunked Text Splitting**: Text is split into manageable chunks for efficient vectorization and retrieval.
21
+ - **Document Embedding**: Use FAISS and Hugging Face sentence transformers to embed PDF content for vector-based similarity search.
22
+ - **Question Answering**: Leveraging a Hugging Face model, the app generates relevant responses based on the content of the uploaded PDFs.
23
+ - **User-Friendly Interface**: Built with Streamlit, providing a simple, interactive web interface.
24
+ - **Dockerized Deployment**: Easily deployable with Docker for consistent environment configuration.
25
+
26
+ ## Getting Started
27
+
28
+ ### Prerequisites
29
+
30
+ - **Docker**: Install [Docker](https://docs.docker.com/get-docker/).
31
+ - **Python 3.8 or Higher**: Required to run the application locally or configure the environment.
32
+
33
+ ### Installation
34
+
35
+ 1. **Clone the Repository**:
36
+ ```bash
37
+ git clone https://github.com/your-username/pdf-chatbot.git
38
+ cd pdf-chatbot
39
+
40
+ Set Up Environment Variables:
41
+
42
+ Create a .env file in the root directory with your Hugging Face API token:
43
+ plaintext
44
+ Copy code
45
+ HUGGINGFACEHUB_API_TOKEN=your_hugging_face_token
46
+ Build and Run the Docker Container:
47
+
48
+ Build and run the application in a Docker container:
49
+ bash
50
+ Copy code
51
+ docker-compose up --build
52
+ Access the Application:
53
+
54
+ Open your web browser and go to http://localhost:8501 to start interacting with the app.
55
+
56
+ Project Structure
57
+ app.py: Main Streamlit application that uses Hugging Face API for embeddings and text generation.
58
+ main.py: Alternative app configuration, using a local PyTorch-compatible pipeline for text generation.
59
+ Dockerfile: Docker configuration to create a containerized environment for the application.
60
+ docker-compose.yml: Docker Compose setup to run the application, exposing the Streamlit port.
61
+ requirements.txt: Lists all required Python libraries.
62
+
63
+ File Descriptions
64
+ app.py: The primary Streamlit app file, which includes:
65
+ PDF file upload handling
66
+ Text extraction from PDFs
67
+ Document chunking for efficient vectorization
68
+ Similarity search and question answering using Hugging Face models
69
+ main.py: Contains an alternative setup using HuggingFacePipeline for text generation, which may be more suitable if using a GPU locally.
70
+ Dockerfile: Specifies the Docker environment, installs required dependencies, and sets up the application.
71
+ docker-compose.yml: Defines Docker services for running the Streamlit app, configures environment variables, and exposes port 8501.
72
+ requirements.txt: Contains all Python dependencies necessary for the application.
73
+
74
+
75
+ Usage
76
+ Upload PDF Files:
77
+
78
+ Click on the "Upload PDF Files" section to upload one or more PDFs.
79
+ The uploaded PDFs will be loaded and preprocessed for interaction.
80
+ Ask a Question:
81
+
82
+ After uploading, type your question in the input box (e.g., "What is the main topic of this document?").
83
+ The app performs a similarity search within the PDF content and generates a response based on your question.
84
+ Receive Responses:
85
+
86
+ The application retrieves relevant chunks from the PDFs, generates a response using a language model, and displays the answer.
87
+
88
+ Configuration Options
89
+ Embedding Model: The default embedding model is all-MiniLM-L6-v2. You can configure a different embedding model from Hugging Face by modifying the model name in app.py.
90
+ Text Generation Model: The google/flan-t5-small model is used for text generation, providing an efficient balance between response quality and resource usage. For larger documents or more complex questions, consider adjusting model size, though this may increase memory usage.
91
+ Device Configuration:
92
+ GPU Support: If CUDA is available on your device, the application will utilize it; otherwise, it defaults to CPU. Adjust device settings in main.py as needed.
93
+ Memory Optimization: To avoid memory issues on limited-resource machines, try reducing the number of documents uploaded simultaneously or using a smaller language model.
94
+
95
+
96
+ Troubleshooting
97
+ CUDA Out of Memory: If you encounter a CUDA OutOfMemoryError, consider:
98
+ Using a smaller model (e.g., google/flan-t5-small).
99
+ Reducing the number of uploaded PDFs or the chunk size.
100
+ Running on CPU by setting device = "cpu" explicitly in main.py.
101
+ Connection Issues: Ensure your Hugging Face API token in .env is valid and accessible.
102
+ Docker Errors: If Docker fails to build, make sure all dependencies in requirements.txt are compatible with your environment.
103
+
104
+ Dependencies
105
+ The project relies on the following libraries:
106
+
107
+ streamlit: Provides the web interface for the application.
108
+ langchain: Integrates language models and document handling.
109
+ faiss-cpu: Enables fast similarity search and clustering.
110
+ pymupdf: Extracts text from PDFs.
111
+ requests: Handles API requests to Hugging Face.
112
+ transformers: Provides models and tokenizers from Hugging Face.
113
+ sentence-transformers: Facilitates sentence embedding for similarity search.
114
+ python-dotenv: Manages environment variables from a .env file.
115
+ langchain-community: Extends LangChain's functionality for specific integrations.
116
+
117
+
118
+ pip install -r requirements.txt
119
+
120
+ License
121
+ This project is licensed under the MIT License. See the LICENSE file for more information.
122
+
123
+ Feel free to contribute to the project by submitting pull requests or reporting issues. Happy chatting with your PDFs!
124
+
125
+ This README includes details for installation, setup, project structure, usage instructions, troubleshooting, and dependencies to help users fully understand and operate the PDF chatbot. Let me know if you’d like to add anything else!
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+ langchain
3
+ faiss-cpu
4
+ pymupdf
5
+ requests
6
+ transformers
7
+ sentence-transformers
8
+ python-dotenv
9
+ langchain-community