Enhance project structure and documentation
Browse files- Update Dockerfile to include additional application files.
- Revise README.md for improved clarity and detail on application usage and technology stack.
- Expand chainlit.md with usage instructions and examples.
- Implement main.py as the command-line entry point for running the application and updating the vector database.
- Create .env.example for environment variable configuration.
- Add comprehensive bug report and feature request templates.
- Establish Python CI workflow for automated testing and linting.
- Develop CONTRIBUTING.md to guide new contributors.
- Include LICENSE and SECURITY.md for legal and security guidelines.
- .env.example +25 -0
- .github/ISSUE_TEMPLATE/bug_report.md +37 -0
- .github/ISSUE_TEMPLATE/feature_request.md +25 -0
- .github/workflows/python-ci.yml +45 -0
- CONTRIBUTING.md +130 -0
- Dockerfile +3 -1
- LICENSE +21 -0
- README.md +111 -3
- SECURITY.md +35 -0
- chainlit.md +31 -3
- main.py +54 -3
.env.example
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TheDataGuy Chat Configuration
|
| 2 |
+
# Copy this file to .env and fill in your values
|
| 3 |
+
|
| 4 |
+
# OpenAI API Key - Required for LLM and embeddings
|
| 5 |
+
OPENAI_API_KEY=your_openai_api_key_here
|
| 6 |
+
|
| 7 |
+
# Vector Store Configuration
|
| 8 |
+
VECTOR_STORAGE_PATH=./db/vector_store_tdg
|
| 9 |
+
QDRANT_COLLECTION=thedataguy_documents
|
| 10 |
+
|
| 11 |
+
# Model Configuration
|
| 12 |
+
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
|
| 13 |
+
LLM_MODEL=gpt-4o-mini
|
| 14 |
+
LLM_TEMPERATURE=0
|
| 15 |
+
|
| 16 |
+
# For evaluation and synthetic data generation (optional)
|
| 17 |
+
SDG_LLM_MODEL=gpt-4.1
|
| 18 |
+
EVAL_LLM_MODEL=gpt-4.1
|
| 19 |
+
|
| 20 |
+
# Blog Configuration
|
| 21 |
+
DATA_DIR=data/
|
| 22 |
+
BLOG_BASE_URL=https://thedataguy.pro/blog/
|
| 23 |
+
|
| 24 |
+
# Search Configuration
|
| 25 |
+
MAX_SEARCH_RESULTS=5
|
.github/ISSUE_TEMPLATE/bug_report.md
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: Bug Report
|
| 3 |
+
about: Create a report to help us improve
|
| 4 |
+
title: '[BUG] '
|
| 5 |
+
labels: bug
|
| 6 |
+
assignees: ''
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Bug Description
|
| 10 |
+
A clear and concise description of what the bug is.
|
| 11 |
+
|
| 12 |
+
## Steps to Reproduce
|
| 13 |
+
1. Go to '...'
|
| 14 |
+
2. Click on '....'
|
| 15 |
+
3. Scroll down to '....'
|
| 16 |
+
4. See error
|
| 17 |
+
|
| 18 |
+
## Expected Behavior
|
| 19 |
+
A clear and concise description of what you expected to happen.
|
| 20 |
+
|
| 21 |
+
## Actual Behavior
|
| 22 |
+
What actually happened instead.
|
| 23 |
+
|
| 24 |
+
## Screenshots
|
| 25 |
+
If applicable, add screenshots to help explain your problem.
|
| 26 |
+
|
| 27 |
+
## Environment
|
| 28 |
+
- OS: [e.g. Windows, macOS, Linux]
|
| 29 |
+
- Browser: [e.g. Chrome, Safari, Firefox]
|
| 30 |
+
- Version: [e.g. 1.0.0]
|
| 31 |
+
- Python Version: [e.g. 3.13.0]
|
| 32 |
+
|
| 33 |
+
## Additional Context
|
| 34 |
+
Add any other context about the problem here, such as:
|
| 35 |
+
- Error messages or logs
|
| 36 |
+
- Relevant configuration details
|
| 37 |
+
- Any recent changes that might have caused the issue
|
.github/ISSUE_TEMPLATE/feature_request.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: Feature Request
|
| 3 |
+
about: Suggest an idea for this project
|
| 4 |
+
title: '[FEATURE] '
|
| 5 |
+
labels: enhancement
|
| 6 |
+
assignees: ''
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Feature Description
|
| 10 |
+
A clear and concise description of the feature you'd like to see implemented.
|
| 11 |
+
|
| 12 |
+
## Use Case
|
| 13 |
+
Describe the context and use case for this feature. How would it benefit the project and its users?
|
| 14 |
+
|
| 15 |
+
## Proposed Solution
|
| 16 |
+
If you have ideas about how to implement this feature, describe them here.
|
| 17 |
+
|
| 18 |
+
## Alternatives Considered
|
| 19 |
+
Have you considered any alternative solutions or features? If so, please describe them.
|
| 20 |
+
|
| 21 |
+
## Additional Context
|
| 22 |
+
Add any other context, screenshots, or mockups about the feature request here.
|
| 23 |
+
|
| 24 |
+
## Impact
|
| 25 |
+
How would this feature impact the current functionality? Would it require any changes to existing features?
|
.github/workflows/python-ci.yml
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: Python CI
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
push:
|
| 5 |
+
branches: [ main ]
|
| 6 |
+
pull_request:
|
| 7 |
+
branches: [ main ]
|
| 8 |
+
|
| 9 |
+
jobs:
|
| 10 |
+
test:
|
| 11 |
+
runs-on: ubuntu-latest
|
| 12 |
+
strategy:
|
| 13 |
+
matrix:
|
| 14 |
+
python-version: ['3.13']
|
| 15 |
+
|
| 16 |
+
steps:
|
| 17 |
+
- uses: actions/checkout@v3
|
| 18 |
+
|
| 19 |
+
- name: Set up Python ${{ matrix.python-version }}
|
| 20 |
+
uses: actions/setup-python@v4
|
| 21 |
+
with:
|
| 22 |
+
python-version: ${{ matrix.python-version }}
|
| 23 |
+
|
| 24 |
+
- name: Install dependencies
|
| 25 |
+
run: |
|
| 26 |
+
python -m pip install --upgrade pip
|
| 27 |
+
pip install uv
|
| 28 |
+
uv init
|
| 29 |
+
uv sync
|
| 30 |
+
|
| 31 |
+
- name: Lint with flake8
|
| 32 |
+
run: |
|
| 33 |
+
uv pip install flake8
|
| 34 |
+
# stop the build if there are Python syntax errors or undefined names
|
| 35 |
+
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
|
| 36 |
+
# exit-zero treats all errors as warnings
|
| 37 |
+
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
|
| 38 |
+
|
| 39 |
+
- name: Check if vector store can be built
|
| 40 |
+
run: |
|
| 41 |
+
python py-src/pipeline.py --ci --output-dir ./artifacts
|
| 42 |
+
env:
|
| 43 |
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
| 44 |
+
VECTOR_STORAGE_PATH: ./db/vector_store_ci
|
| 45 |
+
EMBEDDING_MODEL: Snowflake/snowflake-arctic-embed-l
|
CONTRIBUTING.md
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Contributing to TheDataGuy Chat
|
| 2 |
+
|
| 3 |
+
Thank you for your interest in contributing to the TheDataGuy Chat project! This document provides guidelines and instructions for contributing to this repository.
|
| 4 |
+
|
| 5 |
+
## Project Overview
|
| 6 |
+
|
| 7 |
+
TheDataGuy Chat is a Q&A chatbot powered by the content from [TheDataGuy blog](https://thedataguy.pro/blog/). It uses RAG (Retrieval Augmented Generation) to provide informative answers about topics such as RAGAS, RAG evaluation, building research agents, metric-driven development, and data science best practices.
|
| 8 |
+
|
| 9 |
+
## Development Environment Setup
|
| 10 |
+
|
| 11 |
+
### Prerequisites
|
| 12 |
+
|
| 13 |
+
- Python 3.13 or higher
|
| 14 |
+
- [uv](https://github.com/astral-sh/uv) for Python package management
|
| 15 |
+
- Docker (optional, for containerized development)
|
| 16 |
+
- OpenAI API key
|
| 17 |
+
|
| 18 |
+
### Local Setup
|
| 19 |
+
|
| 20 |
+
1. Clone the repository:
|
| 21 |
+
```bash
|
| 22 |
+
git clone https://github.com/mafzaal/lets-talk.git
|
| 23 |
+
cd lets-talk
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
2. Create a `.env` file with the necessary environment variables:
|
| 27 |
+
```
|
| 28 |
+
OPENAI_API_KEY=your_openai_api_key
|
| 29 |
+
VECTOR_STORAGE_PATH=./db/vector_store_tdg
|
| 30 |
+
LLM_MODEL=gpt-4o-mini
|
| 31 |
+
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
3. Install dependencies:
|
| 35 |
+
```bash
|
| 36 |
+
uv init && uv sync
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
4. Build the vector store:
|
| 40 |
+
```bash
|
| 41 |
+
./scripts/build-vector-store.sh
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
5. Run the application:
|
| 45 |
+
```bash
|
| 46 |
+
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Using Docker
|
| 50 |
+
|
| 51 |
+
1. Build the Docker image:
|
| 52 |
+
```bash
|
| 53 |
+
docker build -t lets-talk .
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
2. Run the container:
|
| 57 |
+
```bash
|
| 58 |
+
docker run -p 7860:7860 --env-file ./.env lets-talk
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
## Project Structure
|
| 62 |
+
|
| 63 |
+
```
|
| 64 |
+
lets-talk/
|
| 65 |
+
├── data/ # Raw blog post content
|
| 66 |
+
├── py-src/ # Python source code
|
| 67 |
+
│ ├── lets_talk/ # Core application modules
|
| 68 |
+
│ │ ├── agent.py # Agent implementation
|
| 69 |
+
│ │ ├── config.py # Configuration settings
|
| 70 |
+
│ │ ├── models.py # Data models
|
| 71 |
+
│ │ ├── prompts.py # LLM prompt templates
|
| 72 |
+
│ │ ├── rag.py # RAG implementation
|
| 73 |
+
│ │ ├── rss_tool.py # RSS feed integration
|
| 74 |
+
│ │ ├── tools.py # Tool implementations
|
| 75 |
+
│ │ └── utils/ # Utility functions
|
| 76 |
+
│ ├── app.py # Main application entry point
|
| 77 |
+
│ ├── pipeline.py # Data processing pipeline
|
| 78 |
+
│ └── notebooks/ # Jupyter notebooks for analysis
|
| 79 |
+
├── db/ # Vector database storage
|
| 80 |
+
├── evals/ # Evaluation datasets and results
|
| 81 |
+
└── scripts/ # Utility scripts
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
## Adding New Blog Posts
|
| 85 |
+
|
| 86 |
+
When new blog posts are published on TheDataGuy.pro, follow these steps to add them to the chat application:
|
| 87 |
+
|
| 88 |
+
1. Add the markdown content to the `data/` directory in a new folder named after the post slug
|
| 89 |
+
2. Run the vector store update script:
|
| 90 |
+
```bash
|
| 91 |
+
python py-src/pipeline.py --force-recreate
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## Workflow
|
| 95 |
+
|
| 96 |
+
1. **Fork** the repository on GitHub
|
| 97 |
+
2. **Clone** your fork to your local machine
|
| 98 |
+
3. Create a new **branch** for your feature or bug fix
|
| 99 |
+
4. Make your changes
|
| 100 |
+
5. Run the tests to ensure everything works
|
| 101 |
+
6. **Commit** your changes with clear, descriptive commit messages
|
| 102 |
+
7. **Push** your branch to your fork on GitHub
|
| 103 |
+
8. Submit a **Pull Request** to the main repository
|
| 104 |
+
|
| 105 |
+
## Code Style
|
| 106 |
+
|
| 107 |
+
- Follow PEP 8 style guidelines for Python code
|
| 108 |
+
- Use meaningful variable and function names
|
| 109 |
+
- Add docstrings to all functions and classes
|
| 110 |
+
- Include type hints where appropriate
|
| 111 |
+
|
| 112 |
+
## Testing
|
| 113 |
+
|
| 114 |
+
- Write tests for new features and bug fixes
|
| 115 |
+
- Ensure all tests pass before submitting a Pull Request
|
| 116 |
+
- Use the Ragas evaluation framework to test RAG performance
|
| 117 |
+
|
| 118 |
+
## Documentation
|
| 119 |
+
|
| 120 |
+
- Update relevant documentation when making changes
|
| 121 |
+
- Add docstrings to all functions, classes, and modules
|
| 122 |
+
- Keep the README and other documentation up to date
|
| 123 |
+
|
| 124 |
+
## License
|
| 125 |
+
|
| 126 |
+
By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).
|
| 127 |
+
|
| 128 |
+
## Contact
|
| 129 |
+
|
| 130 |
+
If you have any questions or need further clarification, please reach out to the project maintainer at [contact form](https://thedataguy.pro/contact/).
|
Dockerfile
CHANGED
|
@@ -26,11 +26,13 @@ RUN uv sync
|
|
| 26 |
|
| 27 |
# Copy the app to the container
|
| 28 |
COPY --chown=user ./py-src/ $HOME/app
|
| 29 |
-
|
|
|
|
| 30 |
|
| 31 |
#TODO: Fix this to download
|
| 32 |
#copy posts to container
|
| 33 |
COPY --chown=user ./data/ $HOME/app/data
|
|
|
|
| 34 |
# Expose the port
|
| 35 |
EXPOSE 7860
|
| 36 |
|
|
|
|
| 26 |
|
| 27 |
# Copy the app to the container
|
| 28 |
COPY --chown=user ./py-src/ $HOME/app
|
| 29 |
+
COPY --chown=user ./.chainlit/ $HOME/app
|
| 30 |
+
COPY --chown=user ./chainlit.md $HOME/app
|
| 31 |
|
| 32 |
#TODO: Fix this to download
|
| 33 |
#copy posts to container
|
| 34 |
COPY --chown=user ./data/ $HOME/app/data
|
| 35 |
+
|
| 36 |
# Expose the port
|
| 37 |
EXPOSE 7860
|
| 38 |
|
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2025 Muhammad Afzaal
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in all
|
| 13 |
+
copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
README.md
CHANGED
|
@@ -9,7 +9,7 @@ pinned: false
|
|
| 9 |
|
| 10 |
# Welcome to TheDataGuy Chat! 👋
|
| 11 |
|
| 12 |
-
This is a Q&A chatbot powered by TheDataGuy blog posts. Ask questions about topics covered in the blog, such as:
|
| 13 |
|
| 14 |
- RAGAS and RAG evaluation
|
| 15 |
- Building research agents
|
|
@@ -21,15 +21,80 @@ This is a Q&A chatbot powered by TheDataGuy blog posts. Ask questions about topi
|
|
| 21 |
Under the hood, this application uses:
|
| 22 |
|
| 23 |
1. **Snowflake Arctic Embeddings**: To convert text into vector representations
|
|
|
|
|
|
|
|
|
|
| 24 |
2. **Qdrant Vector Database**: To store and search for similar content
|
|
|
|
|
|
|
|
|
|
| 25 |
3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
|
|
|
|
|
|
|
|
|
|
| 26 |
4. **LangChain**: For building the RAG workflow
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
5. **Chainlit**: For the chat interface
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
```bash
|
| 35 |
docker build -t lets-talk .
|
|
@@ -38,4 +103,47 @@ docker run -p 7860:7860 \
|
|
| 38 |
lets-talk
|
| 39 |
```
|
| 40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
| 9 |
|
| 10 |
# Welcome to TheDataGuy Chat! 👋
|
| 11 |
|
| 12 |
+
This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:
|
| 13 |
|
| 14 |
- RAGAS and RAG evaluation
|
| 15 |
- Building research agents
|
|
|
|
| 21 |
Under the hood, this application uses:
|
| 22 |
|
| 23 |
1. **Snowflake Arctic Embeddings**: To convert text into vector representations
|
| 24 |
+
- Base model: `Snowflake/snowflake-arctic-embed-l`
|
| 25 |
+
- Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)
|
| 26 |
+
|
| 27 |
2. **Qdrant Vector Database**: To store and search for similar content
|
| 28 |
+
- Efficiently indexes blog post content for fast semantic search
|
| 29 |
+
- Supports real-time updates when new blog posts are published
|
| 30 |
+
|
| 31 |
3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
|
| 32 |
+
- Primary model: OpenAI `gpt-4o-mini` for production inference
|
| 33 |
+
- Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation
|
| 34 |
+
|
| 35 |
4. **LangChain**: For building the RAG workflow
|
| 36 |
+
- Orchestrates the retrieval and generation components
|
| 37 |
+
- Provides flexible components for LLM application development
|
| 38 |
+
- Structured for easy maintenance and future enhancements
|
| 39 |
+
|
| 40 |
5. **Chainlit**: For the chat interface
|
| 41 |
+
- Offers an interactive UI with message threading
|
| 42 |
+
- Supports file uploads and custom components
|
| 43 |
+
|
| 44 |
+
## Technology Stack
|
| 45 |
+
|
| 46 |
+
### Core Components
|
| 47 |
+
- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`)
|
| 48 |
+
- **Embedding Model**: Snowflake Arctic Embeddings
|
| 49 |
+
- **LLM**: OpenAI GPT-4o-mini
|
| 50 |
+
- **Framework**: LangChain + Chainlit
|
| 51 |
+
- **Development Language**: Python 3.13
|
| 52 |
+
|
| 53 |
+
### Advanced Features
|
| 54 |
+
- **Evaluation**: Ragas metrics for evaluating RAG performance:
|
| 55 |
+
- Faithfulness
|
| 56 |
+
- Context Relevancy
|
| 57 |
+
- Answer Relevancy
|
| 58 |
+
- Topic Adherence
|
| 59 |
+
- **Synthetic Data Generation**: For training and testing
|
| 60 |
+
- **Vector Store Updates**: Automated pipeline to update when new blog content is published
|
| 61 |
+
- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content
|
| 62 |
+
|
| 63 |
+
## Project Structure
|
| 64 |
|
| 65 |
+
```
|
| 66 |
+
lets-talk/
|
| 67 |
+
├── data/ # Raw blog post content
|
| 68 |
+
├── py-src/ # Python source code
|
| 69 |
+
│ ├── lets_talk/ # Core application modules
|
| 70 |
+
│ │ ├── agent.py # Agent implementation
|
| 71 |
+
│ │ ├── config.py # Configuration settings
|
| 72 |
+
│ │ ├── models.py # Data models
|
| 73 |
+
│ │ ├── prompts.py # LLM prompt templates
|
| 74 |
+
│ │ ├── rag.py # RAG implementation
|
| 75 |
+
│ │ ├── rss_tool.py # RSS feed integration
|
| 76 |
+
│ │ └── tools.py # Tool implementations
|
| 77 |
+
│ ├── app.py # Main application entry point
|
| 78 |
+
│ └── pipeline.py # Data processing pipeline
|
| 79 |
+
├── db/ # Vector database storage
|
| 80 |
+
├── evals/ # Evaluation datasets and results
|
| 81 |
+
└── notebooks/ # Jupyter notebooks for analysis
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
## Environment Setup
|
| 85 |
+
|
| 86 |
+
The application requires the following environment variables:
|
| 87 |
|
| 88 |
+
```
|
| 89 |
+
OPENAI_API_KEY=your_openai_api_key
|
| 90 |
+
VECTOR_STORAGE_PATH=./db/vector_store_tdg
|
| 91 |
+
LLM_MODEL=gpt-4o-mini
|
| 92 |
+
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
|
| 93 |
+
```
|
| 94 |
|
| 95 |
+
## Running Locally
|
| 96 |
+
|
| 97 |
+
### Using Docker
|
| 98 |
|
| 99 |
```bash
|
| 100 |
docker build -t lets-talk .
|
|
|
|
| 103 |
lets-talk
|
| 104 |
```
|
| 105 |
|
| 106 |
+
### Using Python
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
# Install dependencies
|
| 110 |
+
uv init && uv sync
|
| 111 |
+
|
| 112 |
+
# Run the application
|
| 113 |
+
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Deployment
|
| 117 |
+
|
| 118 |
+
The application is designed to be deployed on:
|
| 119 |
+
|
| 120 |
+
- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
|
| 121 |
+
- **Production**: Azure Container Apps (planned)
|
| 122 |
+
|
| 123 |
+
## Evaluation
|
| 124 |
+
|
| 125 |
+
This project includes extensive evaluation capabilities using the Ragas framework:
|
| 126 |
+
|
| 127 |
+
- **Synthetic Data Generation**: For creating test datasets
|
| 128 |
+
- **Metric Evaluation**: Measuring faithfulness, relevance, and more
|
| 129 |
+
- **Fine-tuning Analysis**: Comparing different embedding models
|
| 130 |
+
|
| 131 |
+
## Future Enhancements
|
| 132 |
+
|
| 133 |
+
- **Agentic Reasoning**: Adding more sophisticated agent capabilities
|
| 134 |
+
- **Web UI Integration**: Custom Svelte component for the blog
|
| 135 |
+
- **CI/CD**: GitHub Actions workflow for automated deployment
|
| 136 |
+
- **Monitoring**: LangSmith integration for observability
|
| 137 |
+
|
| 138 |
+
## License
|
| 139 |
+
|
| 140 |
+
This project is available under the MIT License.
|
| 141 |
+
|
| 142 |
+
## Acknowledgements
|
| 143 |
+
|
| 144 |
+
- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
|
| 145 |
+
- [Ragas](https://docs.ragas.io/) for evaluation framework
|
| 146 |
+
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
|
| 147 |
+
- [Chainlit](https://docs.chainlit.io/) for the chat interface
|
| 148 |
+
|
| 149 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
SECURITY.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Security Policy
|
| 2 |
+
|
| 3 |
+
## Supported Versions
|
| 4 |
+
|
| 5 |
+
Use this section to tell people about which versions of your project are currently being supported with security updates.
|
| 6 |
+
|
| 7 |
+
| Version | Supported |
|
| 8 |
+
| ------- | ------------------ |
|
| 9 |
+
| 0.1.x | :white_check_mark: |
|
| 10 |
+
|
| 11 |
+
## Reporting a Vulnerability
|
| 12 |
+
|
| 13 |
+
We take the security of TheDataGuy Chat seriously. If you believe you've found a security vulnerability, please follow these steps:
|
| 14 |
+
|
| 15 |
+
1. **Do not** disclose the vulnerability publicly
|
| 16 |
+
2. **Do not** create a public GitHub issue for the vulnerability
|
| 17 |
+
3. Email your findings to [contact form](https://thedataguy.pro/contact/)
|
| 18 |
+
|
| 19 |
+
Please include the following in your report:
|
| 20 |
+
|
| 21 |
+
- A description of the vulnerability
|
| 22 |
+
- Steps to reproduce the issue
|
| 23 |
+
- Potential impact of the vulnerability
|
| 24 |
+
- Any potential solutions you've identified
|
| 25 |
+
|
| 26 |
+
## What to Expect
|
| 27 |
+
|
| 28 |
+
When you report a vulnerability:
|
| 29 |
+
|
| 30 |
+
- You'll receive acknowledgment of your report within 48 hours
|
| 31 |
+
- We'll investigate and provide an estimated timeline for a fix
|
| 32 |
+
- We'll keep you updated as we work on resolving the issue
|
| 33 |
+
- Once fixed, we'll publicly acknowledge your responsible disclosure (unless you prefer to remain anonymous)
|
| 34 |
+
|
| 35 |
+
Thank you for helping to keep TheDataGuy Chat and its users safe!
|
chainlit.md
CHANGED
|
@@ -1,6 +1,34 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
|
|
|
| 1 |
+
# Welcome to TheDataGuy Chat! 👋
|
| 2 |
|
| 3 |
+
## About
|
| 4 |
|
| 5 |
+
This chat application allows you to ask questions about topics covered in [TheDataGuy](https://thedataguy.pro)'s blog, including:
|
| 6 |
+
|
| 7 |
+
- **RAGAS**: Evaluation frameworks for LLM applications
|
| 8 |
+
- **Research Agents**: Building and evaluating AI agents
|
| 9 |
+
- **Metric-Driven Development**: Data-centric approaches to development
|
| 10 |
+
- **RAG Systems**: Retrieval Augmented Generation techniques
|
| 11 |
+
- **Data Science Best Practices**: Strategies for effective data work
|
| 12 |
+
|
| 13 |
+
## How To Use
|
| 14 |
+
|
| 15 |
+
1. **Ask a question** related to any topic covered in the blog
|
| 16 |
+
2. The system will **search for relevant content** from the blog posts
|
| 17 |
+
3. You'll receive an **informative response** with links to the original articles
|
| 18 |
+
|
| 19 |
+
## Examples
|
| 20 |
+
|
| 21 |
+
Try asking questions like:
|
| 22 |
+
- "What is RAGAS and how does it help evaluate LLM applications?"
|
| 23 |
+
- "How can I build a research agent with RSS feed support?"
|
| 24 |
+
- "What are the key principles of metric-driven development?"
|
| 25 |
+
- "How do I evaluate RAG systems effectively?"
|
| 26 |
+
|
| 27 |
+
## Under The Hood
|
| 28 |
+
|
| 29 |
+
This application uses Snowflake Arctic Embeddings, Qdrant Vector Database, LangChain, and GPT-4o-mini to provide accurate and helpful responses based on blog content.
|
| 30 |
+
|
| 31 |
+
For more details, check out the [GitHub repository](https://github.com/mafzaal/lets-talk).
|
| 32 |
+
|
| 33 |
+
Happy chatting! 💬
|
| 34 |
|
main.py
CHANGED
|
@@ -1,9 +1,60 @@
|
|
| 1 |
|
| 2 |
|
| 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
def main():
|
| 5 |
-
"""Main function to update blog data"""
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
if __name__ == "__main__":
|
| 9 |
-
main()
|
|
|
|
| 1 |
|
| 2 |
|
| 3 |
|
| 4 |
+
#!/usr/bin/env python3
|
| 5 |
+
"""
|
| 6 |
+
TheDataGuy Chat - Main Entry Point
|
| 7 |
+
|
| 8 |
+
This script serves as the main entry point for the TheDataGuy Chat application.
|
| 9 |
+
It provides a command-line interface to run the app and update the vector database.
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import os
|
| 13 |
+
import sys
|
| 14 |
+
import argparse
|
| 15 |
+
from dotenv import load_dotenv
|
| 16 |
+
|
| 17 |
+
# Load environment variables from .env file
|
| 18 |
+
load_dotenv()
|
| 19 |
+
|
| 20 |
def main():
|
| 21 |
+
"""Main function to run the application or update blog data"""
|
| 22 |
+
parser = argparse.ArgumentParser(description="TheDataGuy Chat - RAG-powered blog assistant")
|
| 23 |
+
|
| 24 |
+
# Define commands
|
| 25 |
+
subparsers = parser.add_subparsers(dest="command", help="Command to run")
|
| 26 |
+
|
| 27 |
+
# Run app command
|
| 28 |
+
run_parser = subparsers.add_parser("run", help="Run the chat application")
|
| 29 |
+
run_parser.add_argument("--host", default="0.0.0.0", help="Host to bind to")
|
| 30 |
+
run_parser.add_argument("--port", type=int, default=7860, help="Port to bind to")
|
| 31 |
+
|
| 32 |
+
# Update vector store command
|
| 33 |
+
update_parser = subparsers.add_parser("update", help="Update the vector database")
|
| 34 |
+
update_parser.add_argument("--force", action="store_true", help="Force recreation of the vector store")
|
| 35 |
+
|
| 36 |
+
# Parse arguments
|
| 37 |
+
args = parser.parse_args()
|
| 38 |
+
|
| 39 |
+
# Handle commands
|
| 40 |
+
if args.command == "run":
|
| 41 |
+
# Import here to avoid circular imports
|
| 42 |
+
import chainlit as cl
|
| 43 |
+
os.system(f"chainlit run py-src/app.py --host {args.host} --port {args.port}")
|
| 44 |
+
|
| 45 |
+
elif args.command == "update":
|
| 46 |
+
# Import here to avoid loading heavy dependencies if not needed
|
| 47 |
+
from py_src.pipeline import create_vector_database
|
| 48 |
+
force_flag = "--force-recreate" if args.force else ""
|
| 49 |
+
print(f"Updating vector database (force={args.force})")
|
| 50 |
+
create_vector_database(force_recreate=args.force)
|
| 51 |
+
|
| 52 |
+
else:
|
| 53 |
+
# Show help if no command provided
|
| 54 |
+
parser.print_help()
|
| 55 |
+
return 1
|
| 56 |
+
|
| 57 |
+
return 0
|
| 58 |
|
| 59 |
if __name__ == "__main__":
|
| 60 |
+
sys.exit(main())
|