title: Lets Talk
emoji: π¨
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
Welcome to TheDataGuy Chat! π
This is a Q&A chatbot powered by TheDataGuy blog blog posts. Ask questions about topics covered in the blog, such as:
- RAGAS and RAG evaluation
- Building research agents
- Metric-driven development
- Data science best practices
How it works
Under the hood, this application uses:
Snowflake Arctic Embeddings: To convert text into vector representations
- Base model:
Snowflake/snowflake-arctic-embed-l - Fine-tuned model:
mafzaal/thedataguy_arctic_ft(custom-tuned using blog-specific query-context pairs)
- Base model:
Qdrant Vector Database: To store and search for similar content
- Efficiently indexes blog post content for fast semantic search
- Supports real-time updates when new blog posts are published
GPT-4o-mini: To generate helpful responses based on retrieved content
- Primary model: OpenAI
gpt-4o-minifor production inference - Evaluation model: OpenAI
gpt-4.1for complex tasks including synthetic data generation and evaluation
- Primary model: OpenAI
LangChain: For building the RAG workflow
- Orchestrates the retrieval and generation components
- Provides flexible components for LLM application development
- Structured for easy maintenance and future enhancements
Chainlit: For the chat interface
- Offers an interactive UI with message threading
- Supports file uploads and custom components
Technology Stack
Core Components
- Vector Database: Qdrant (stores embeddings via
pipeline.py) - Embedding Model: Snowflake Arctic Embeddings
- LLM: OpenAI GPT-4o-mini
- Framework: LangChain + Chainlit
- Development Language: Python 3.13
Advanced Features
- Evaluation: Ragas metrics for evaluating RAG performance:
- Faithfulness
- Context Relevancy
- Answer Relevancy
- Topic Adherence
- Synthetic Data Generation: For training and testing
- Vector Store Updates: Automated pipeline to update when new blog content is published
- Fine-tuned Embeddings: Custom embeddings tuned for technical content
Project Structure
lets-talk/
βββ data/ # Raw blog post content
βββ py-src/ # Python source code
β βββ lets_talk/ # Core application modules
β β βββ agent.py # Agent implementation
β β βββ config.py # Configuration settings
β β βββ models.py # Data models
β β βββ prompts.py # LLM prompt templates
β β βββ rag.py # RAG implementation
β β βββ rss_tool.py # RSS feed integration
β β βββ tools.py # Tool implementations
β βββ app.py # Main application entry point
β βββ pipeline.py # Data processing pipeline
βββ db/ # Vector database storage
βββ evals/ # Evaluation datasets and results
βββ notebooks/ # Jupyter notebooks for analysis
Environment Setup
The application requires the following environment variables:
OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
Additional configuration options for vector database creation:
# Vector Database Creation Configuration
FORCE_RECREATE=False # Whether to force recreation of the vector store
OUTPUT_DIR=./stats # Directory to save stats and artifacts
USE_CHUNKING=True # Whether to split documents into chunks
SHOULD_SAVE_STATS=True # Whether to save statistics about the documents
Running Locally
Using Docker
docker build -t lets-talk .
docker run -p 7860:7860 \
--env-file ./.env \
lets-talk
Using Python
# Install dependencies
uv init && uv sync
# Run the application
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
Deployment
The application is designed to be deployed on:
- Development: Hugging Face Spaces (Live Demo)
- Production: Azure Container Apps (planned)
Evaluation
This project includes extensive evaluation capabilities using the Ragas framework:
- Synthetic Data Generation: For creating test datasets
- Metric Evaluation: Measuring faithfulness, relevance, and more
- Fine-tuning Analysis: Comparing different embedding models
Future Enhancements
- Agentic Reasoning: Adding more sophisticated agent capabilities
- Web UI Integration: Custom Svelte component for the blog
- CI/CD: GitHub Actions workflow for automated deployment
- Monitoring: LangSmith integration for observability
License
This project is available under the MIT License.
Acknowledgements
- TheDataGuy blog for the content
- Ragas for evaluation framework
- LangChain for RAG components
- Chainlit for the chat interface
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference