File size: 4,558 Bytes
2754790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a092eef
 
 
 
 
 
2754790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# Contributing to TheDataGuy Chat

Thank you for your interest in contributing to the TheDataGuy Chat project! This document provides guidelines and instructions for contributing to this repository.

## Project Overview

TheDataGuy Chat is a Q&A chatbot powered by the content from [TheDataGuy blog](https://thedataguy.pro/blog/). It uses RAG (Retrieval Augmented Generation) to provide informative answers about topics such as RAGAS, RAG evaluation, building research agents, metric-driven development, and data science best practices.

## Development Environment Setup

### Prerequisites

- Python 3.13 or higher
- [uv](https://github.com/astral-sh/uv) for Python package management
- Docker (optional, for containerized development)
- OpenAI API key

### Local Setup

1. Clone the repository:
   ```bash
   git clone https://github.com/mafzaal/lets-talk.git
   cd lets-talk
   ```

2. Create a `.env` file with the necessary environment variables:
   ```
   OPENAI_API_KEY=your_openai_api_key
   VECTOR_STORAGE_PATH=./db/vector_store_tdg
   LLM_MODEL=gpt-4o-mini
   EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
   
   # Vector Database Creation Configuration (optional)
   FORCE_RECREATE=False      # Whether to force recreation of the vector store
   OUTPUT_DIR=./stats        # Directory to save stats and artifacts
   USE_CHUNKING=True         # Whether to split documents into chunks
   SHOULD_SAVE_STATS=True    # Whether to save statistics about the documents
   ```

3. Install dependencies:
   ```bash
   uv init && uv sync
   ```

4. Build the vector store:
   ```bash
   ./scripts/build-vector-store.sh
   ```

5. Run the application:
   ```bash
   chainlit run py-src/app.py --host 0.0.0.0 --port 7860
   ```

### Using Docker

1. Build the Docker image:
   ```bash
   docker build -t lets-talk .
   ```

2. Run the container:
   ```bash
   docker run -p 7860:7860 --env-file ./.env lets-talk
   ```

## Project Structure

```
lets-talk/
β”œβ”€β”€ data/                  # Raw blog post content
β”œβ”€β”€ py-src/                # Python source code
β”‚   β”œβ”€β”€ lets_talk/         # Core application modules
β”‚   β”‚   β”œβ”€β”€ agent.py       # Agent implementation
β”‚   β”‚   β”œβ”€β”€ config.py      # Configuration settings
β”‚   β”‚   β”œβ”€β”€ models.py      # Data models
β”‚   β”‚   β”œβ”€β”€ prompts.py     # LLM prompt templates
β”‚   β”‚   β”œβ”€β”€ rag.py         # RAG implementation
β”‚   β”‚   β”œβ”€β”€ rss_tool.py    # RSS feed integration
β”‚   β”‚   β”œβ”€β”€ tools.py       # Tool implementations
β”‚   β”‚   └── utils/         # Utility functions
β”‚   β”œβ”€β”€ app.py             # Main application entry point
β”‚   β”œβ”€β”€ pipeline.py        # Data processing pipeline
β”‚   └── notebooks/         # Jupyter notebooks for analysis
β”œβ”€β”€ db/                    # Vector database storage
β”œβ”€β”€ evals/                 # Evaluation datasets and results
└── scripts/               # Utility scripts
```

## Adding New Blog Posts

When new blog posts are published on TheDataGuy.pro, follow these steps to add them to the chat application:

1. Add the markdown content to the `data/` directory in a new folder named after the post slug
2. Run the vector store update script:
   ```bash
   python py-src/pipeline.py --force-recreate
   ```

## Workflow

1. **Fork** the repository on GitHub
2. **Clone** your fork to your local machine
3. Create a new **branch** for your feature or bug fix
4. Make your changes
5. Run the tests to ensure everything works
6. **Commit** your changes with clear, descriptive commit messages
7. **Push** your branch to your fork on GitHub
8. Submit a **Pull Request** to the main repository

## Code Style

- Follow PEP 8 style guidelines for Python code
- Use meaningful variable and function names
- Add docstrings to all functions and classes
- Include type hints where appropriate

## Testing

- Write tests for new features and bug fixes
- Ensure all tests pass before submitting a Pull Request
- Use the Ragas evaluation framework to test RAG performance

## Documentation

- Update relevant documentation when making changes
- Add docstrings to all functions, classes, and modules
- Keep the README and other documentation up to date

## License

By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).

## Contact

If you have any questions or need further clarification, please reach out to the project maintainer at [contact form](https://thedataguy.pro/contact/).