File size: 5,362 Bytes
f39f9cb 04abf37 2754790 04abf37 2754790 04abf37 2754790 04abf37 2754790 04abf37 2754790 04abf37 2754790 04abf37 2754790 04abf37 2754790 3379e0a 2754790 4c86010 a092eef 2754790 4c86010 f6edec6 4c86010 f6edec6 2754790 f39f9cb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
title: Lets Talk
emoji: π¨
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
---
# Welcome to TheDataGuy Chat! π
This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:
- RAGAS and RAG evaluation
- Building research agents
- Metric-driven development
- Data science best practices
## How it works
Under the hood, this application uses:
1. **Snowflake Arctic Embeddings**: To convert text into vector representations
- Base model: `Snowflake/snowflake-arctic-embed-l`
- Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)
2. **Qdrant Vector Database**: To store and search for similar content
- Efficiently indexes blog post content for fast semantic search
- Supports real-time updates when new blog posts are published
3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
- Primary model: OpenAI `gpt-4o-mini` for production inference
- Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation
4. **LangChain**: For building the RAG workflow
- Orchestrates the retrieval and generation components
- Provides flexible components for LLM application development
- Structured for easy maintenance and future enhancements
5. **Chainlit**: For the chat interface
- Offers an interactive UI with message threading
- Supports file uploads and custom components
## Technology Stack
### Core Components
- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`)
- **Embedding Model**: Snowflake Arctic Embeddings
- **LLM**: OpenAI GPT-4o-mini
- **Framework**: LangChain + Chainlit
- **Development Language**: Python 3.13
### Advanced Features
- **Evaluation**: Ragas metrics for evaluating RAG performance:
- Faithfulness
- Context Relevancy
- Answer Relevancy
- Topic Adherence
- **Synthetic Data Generation**: For training and testing
- **Vector Store Updates**: Automated pipeline to update when new blog content is published
- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content
## Project Structure
```
lets-talk/
βββ data/ # Raw blog post content
βββ py-src/ # Python source code
β βββ lets_talk/ # Core application modules
β β βββ agent.py # Agent implementation
β β βββ config.py # Configuration settings
β β βββ models.py # Data models
β β βββ prompts.py # LLM prompt templates
β β βββ rag.py # RAG implementation
β β βββ rss_tool.py # RSS feed integration
β β βββ tools.py # Tool implementations
β βββ app.py # Main application entry point
β βββ pipeline.py # Data processing pipeline
βββ db/ # Vector database storage
βββ evals/ # Evaluation datasets and results
βββ notebooks/ # Jupyter notebooks for analysis
```
## Environment Setup
The application requires the following environment variables:
```
OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
```
Additional configuration options for vector database creation:
```
# Vector Database Creation Configuration
FORCE_RECREATE=False # Whether to force recreation of the vector store
OUTPUT_DIR=./stats # Directory to save stats and artifacts
USE_CHUNKING=True # Whether to split documents into chunks
SHOULD_SAVE_STATS=True # Whether to save statistics about the documents
```
## Running Locally
### Using Docker
```bash
docker build -t lets-talk .
docker run -p 7860:7860 \
--env-file ./.env \
lets-talk
```
### Using Python
```bash
# Install dependencies
uv init && uv sync
# Run the application
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
```
## Deployment
The application is designed to be deployed on:
- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
- **Production**: Azure Container Apps (planned)
## Evaluation
This project includes extensive evaluation capabilities using the Ragas framework:
- **Synthetic Data Generation**: For creating test datasets
- **Metric Evaluation**: Measuring faithfulness, relevance, and more
- **Fine-tuning Analysis**: Comparing different embedding models
## Future Enhancements
- **Agentic Reasoning**: Adding more sophisticated agent capabilities
- **Web UI Integration**: Custom Svelte component for the blog
- **CI/CD**: GitHub Actions workflow for automated deployment
- **Monitoring**: LangSmith integration for observability
## License
This project is available under the MIT License.
## Acknowledgements
- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
- [Ragas](https://docs.ragas.io/) for evaluation framework
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
- [Chainlit](https://docs.chainlit.io/) for the chat interface
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|