File size: 4,558 Bytes
2754790 a092eef 2754790 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# Contributing to TheDataGuy Chat
Thank you for your interest in contributing to the TheDataGuy Chat project! This document provides guidelines and instructions for contributing to this repository.
## Project Overview
TheDataGuy Chat is a Q&A chatbot powered by the content from [TheDataGuy blog](https://thedataguy.pro/blog/). It uses RAG (Retrieval Augmented Generation) to provide informative answers about topics such as RAGAS, RAG evaluation, building research agents, metric-driven development, and data science best practices.
## Development Environment Setup
### Prerequisites
- Python 3.13 or higher
- [uv](https://github.com/astral-sh/uv) for Python package management
- Docker (optional, for containerized development)
- OpenAI API key
### Local Setup
1. Clone the repository:
```bash
git clone https://github.com/mafzaal/lets-talk.git
cd lets-talk
```
2. Create a `.env` file with the necessary environment variables:
```
OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
# Vector Database Creation Configuration (optional)
FORCE_RECREATE=False # Whether to force recreation of the vector store
OUTPUT_DIR=./stats # Directory to save stats and artifacts
USE_CHUNKING=True # Whether to split documents into chunks
SHOULD_SAVE_STATS=True # Whether to save statistics about the documents
```
3. Install dependencies:
```bash
uv init && uv sync
```
4. Build the vector store:
```bash
./scripts/build-vector-store.sh
```
5. Run the application:
```bash
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
```
### Using Docker
1. Build the Docker image:
```bash
docker build -t lets-talk .
```
2. Run the container:
```bash
docker run -p 7860:7860 --env-file ./.env lets-talk
```
## Project Structure
```
lets-talk/
βββ data/ # Raw blog post content
βββ py-src/ # Python source code
β βββ lets_talk/ # Core application modules
β β βββ agent.py # Agent implementation
β β βββ config.py # Configuration settings
β β βββ models.py # Data models
β β βββ prompts.py # LLM prompt templates
β β βββ rag.py # RAG implementation
β β βββ rss_tool.py # RSS feed integration
β β βββ tools.py # Tool implementations
β β βββ utils/ # Utility functions
β βββ app.py # Main application entry point
β βββ pipeline.py # Data processing pipeline
β βββ notebooks/ # Jupyter notebooks for analysis
βββ db/ # Vector database storage
βββ evals/ # Evaluation datasets and results
βββ scripts/ # Utility scripts
```
## Adding New Blog Posts
When new blog posts are published on TheDataGuy.pro, follow these steps to add them to the chat application:
1. Add the markdown content to the `data/` directory in a new folder named after the post slug
2. Run the vector store update script:
```bash
python py-src/pipeline.py --force-recreate
```
## Workflow
1. **Fork** the repository on GitHub
2. **Clone** your fork to your local machine
3. Create a new **branch** for your feature or bug fix
4. Make your changes
5. Run the tests to ensure everything works
6. **Commit** your changes with clear, descriptive commit messages
7. **Push** your branch to your fork on GitHub
8. Submit a **Pull Request** to the main repository
## Code Style
- Follow PEP 8 style guidelines for Python code
- Use meaningful variable and function names
- Add docstrings to all functions and classes
- Include type hints where appropriate
## Testing
- Write tests for new features and bug fixes
- Ensure all tests pass before submitting a Pull Request
- Use the Ragas evaluation framework to test RAG performance
## Documentation
- Update relevant documentation when making changes
- Add docstrings to all functions, classes, and modules
- Keep the README and other documentation up to date
## License
By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).
## Contact
If you have any questions or need further clarification, please reach out to the project maintainer at [contact form](https://thedataguy.pro/contact/).
|