File size: 5,362 Bytes
f39f9cb
 
 
 
 
 
 
 
 
04abf37
 
2754790
04abf37
 
 
 
 
 
 
 
 
 
 
2754790
 
 
04abf37
2754790
 
 
04abf37
2754790
 
 
04abf37
2754790
 
 
 
04abf37
2754790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04abf37
2754790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04abf37
2754790
 
 
 
 
3379e0a
 
2754790
4c86010
a092eef
 
 
 
 
 
 
 
 
 
2754790
 
 
4c86010
 
f6edec6
4c86010
 
 
f6edec6
 
2754790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f39f9cb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
title: Lets Talk
emoji: 🐨
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
---

# Welcome to TheDataGuy Chat! πŸ‘‹

This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:

- RAGAS and RAG evaluation
- Building research agents
- Metric-driven development
- Data science best practices

## How it works

Under the hood, this application uses:

1. **Snowflake Arctic Embeddings**: To convert text into vector representations
   - Base model: `Snowflake/snowflake-arctic-embed-l`
   - Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)

2. **Qdrant Vector Database**: To store and search for similar content
   - Efficiently indexes blog post content for fast semantic search
   - Supports real-time updates when new blog posts are published

3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
   - Primary model: OpenAI `gpt-4o-mini` for production inference
   - Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation

4. **LangChain**: For building the RAG workflow
   - Orchestrates the retrieval and generation components
   - Provides flexible components for LLM application development
   - Structured for easy maintenance and future enhancements

5. **Chainlit**: For the chat interface
   - Offers an interactive UI with message threading
   - Supports file uploads and custom components

## Technology Stack

### Core Components
- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`)
- **Embedding Model**: Snowflake Arctic Embeddings
- **LLM**: OpenAI GPT-4o-mini
- **Framework**: LangChain + Chainlit
- **Development Language**: Python 3.13

### Advanced Features
- **Evaluation**: Ragas metrics for evaluating RAG performance:
  - Faithfulness
  - Context Relevancy
  - Answer Relevancy
  - Topic Adherence
- **Synthetic Data Generation**: For training and testing
- **Vector Store Updates**: Automated pipeline to update when new blog content is published
- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content

## Project Structure

```
lets-talk/
β”œβ”€β”€ data/                  # Raw blog post content
β”œβ”€β”€ py-src/                # Python source code
β”‚   β”œβ”€β”€ lets_talk/         # Core application modules
β”‚   β”‚   β”œβ”€β”€ agent.py       # Agent implementation
β”‚   β”‚   β”œβ”€β”€ config.py      # Configuration settings
β”‚   β”‚   β”œβ”€β”€ models.py      # Data models
β”‚   β”‚   β”œβ”€β”€ prompts.py     # LLM prompt templates
β”‚   β”‚   β”œβ”€β”€ rag.py         # RAG implementation
β”‚   β”‚   β”œβ”€β”€ rss_tool.py    # RSS feed integration
β”‚   β”‚   └── tools.py       # Tool implementations
β”‚   β”œβ”€β”€ app.py             # Main application entry point
β”‚   └── pipeline.py        # Data processing pipeline
β”œβ”€β”€ db/                    # Vector database storage
β”œβ”€β”€ evals/                 # Evaluation datasets and results
└── notebooks/             # Jupyter notebooks for analysis
```

## Environment Setup

The application requires the following environment variables:

```
OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
```

Additional configuration options for vector database creation:

```
# Vector Database Creation Configuration
FORCE_RECREATE=False      # Whether to force recreation of the vector store
OUTPUT_DIR=./stats        # Directory to save stats and artifacts
USE_CHUNKING=True         # Whether to split documents into chunks
SHOULD_SAVE_STATS=True    # Whether to save statistics about the documents
```

## Running Locally

### Using Docker

```bash
docker build -t lets-talk .
docker run -p 7860:7860 \
    --env-file ./.env \
    lets-talk
```

### Using Python

```bash
# Install dependencies
uv init && uv sync

# Run the application
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
```

## Deployment

The application is designed to be deployed on:

- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
- **Production**: Azure Container Apps (planned)

## Evaluation

This project includes extensive evaluation capabilities using the Ragas framework:

- **Synthetic Data Generation**: For creating test datasets
- **Metric Evaluation**: Measuring faithfulness, relevance, and more
- **Fine-tuning Analysis**: Comparing different embedding models

## Future Enhancements

- **Agentic Reasoning**: Adding more sophisticated agent capabilities
- **Web UI Integration**: Custom Svelte component for the blog
- **CI/CD**: GitHub Actions workflow for automated deployment
- **Monitoring**: LangSmith integration for observability

## License

This project is available under the MIT License.

## Acknowledgements

- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
- [Ragas](https://docs.ragas.io/) for evaluation framework
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
- [Chainlit](https://docs.chainlit.io/) for the chat interface

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference