lets_talk / README.md
mafzaal's picture
Add vector database creation configuration and update related scripts
a092eef
---
title: Lets Talk
emoji: 🐨
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
---
# Welcome to TheDataGuy Chat! πŸ‘‹
This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:
- RAGAS and RAG evaluation
- Building research agents
- Metric-driven development
- Data science best practices
## How it works
Under the hood, this application uses:
1. **Snowflake Arctic Embeddings**: To convert text into vector representations
- Base model: `Snowflake/snowflake-arctic-embed-l`
- Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)
2. **Qdrant Vector Database**: To store and search for similar content
- Efficiently indexes blog post content for fast semantic search
- Supports real-time updates when new blog posts are published
3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
- Primary model: OpenAI `gpt-4o-mini` for production inference
- Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation
4. **LangChain**: For building the RAG workflow
- Orchestrates the retrieval and generation components
- Provides flexible components for LLM application development
- Structured for easy maintenance and future enhancements
5. **Chainlit**: For the chat interface
- Offers an interactive UI with message threading
- Supports file uploads and custom components
## Technology Stack
### Core Components
- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`)
- **Embedding Model**: Snowflake Arctic Embeddings
- **LLM**: OpenAI GPT-4o-mini
- **Framework**: LangChain + Chainlit
- **Development Language**: Python 3.13
### Advanced Features
- **Evaluation**: Ragas metrics for evaluating RAG performance:
- Faithfulness
- Context Relevancy
- Answer Relevancy
- Topic Adherence
- **Synthetic Data Generation**: For training and testing
- **Vector Store Updates**: Automated pipeline to update when new blog content is published
- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content
## Project Structure
```
lets-talk/
β”œβ”€β”€ data/ # Raw blog post content
β”œβ”€β”€ py-src/ # Python source code
β”‚ β”œβ”€β”€ lets_talk/ # Core application modules
β”‚ β”‚ β”œβ”€β”€ agent.py # Agent implementation
β”‚ β”‚ β”œβ”€β”€ config.py # Configuration settings
β”‚ β”‚ β”œβ”€β”€ models.py # Data models
β”‚ β”‚ β”œβ”€β”€ prompts.py # LLM prompt templates
β”‚ β”‚ β”œβ”€β”€ rag.py # RAG implementation
β”‚ β”‚ β”œβ”€β”€ rss_tool.py # RSS feed integration
β”‚ β”‚ └── tools.py # Tool implementations
β”‚ β”œβ”€β”€ app.py # Main application entry point
β”‚ └── pipeline.py # Data processing pipeline
β”œβ”€β”€ db/ # Vector database storage
β”œβ”€β”€ evals/ # Evaluation datasets and results
└── notebooks/ # Jupyter notebooks for analysis
```
## Environment Setup
The application requires the following environment variables:
```
OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
```
Additional configuration options for vector database creation:
```
# Vector Database Creation Configuration
FORCE_RECREATE=False # Whether to force recreation of the vector store
OUTPUT_DIR=./stats # Directory to save stats and artifacts
USE_CHUNKING=True # Whether to split documents into chunks
SHOULD_SAVE_STATS=True # Whether to save statistics about the documents
```
## Running Locally
### Using Docker
```bash
docker build -t lets-talk .
docker run -p 7860:7860 \
--env-file ./.env \
lets-talk
```
### Using Python
```bash
# Install dependencies
uv init && uv sync
# Run the application
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
```
## Deployment
The application is designed to be deployed on:
- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
- **Production**: Azure Container Apps (planned)
## Evaluation
This project includes extensive evaluation capabilities using the Ragas framework:
- **Synthetic Data Generation**: For creating test datasets
- **Metric Evaluation**: Measuring faithfulness, relevance, and more
- **Fine-tuning Analysis**: Comparing different embedding models
## Future Enhancements
- **Agentic Reasoning**: Adding more sophisticated agent capabilities
- **Web UI Integration**: Custom Svelte component for the blog
- **CI/CD**: GitHub Actions workflow for automated deployment
- **Monitoring**: LangSmith integration for observability
## License
This project is available under the MIT License.
## Acknowledgements
- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
- [Ragas](https://docs.ragas.io/) for evaluation framework
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
- [Chainlit](https://docs.chainlit.io/) for the chat interface
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference