|
|
--- |
|
|
title: Lets Talk |
|
|
emoji: π¨ |
|
|
colorFrom: green |
|
|
colorTo: blue |
|
|
sdk: docker |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# Welcome to TheDataGuy Chat! π |
|
|
|
|
|
This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as: |
|
|
|
|
|
- RAGAS and RAG evaluation |
|
|
- Building research agents |
|
|
- Metric-driven development |
|
|
- Data science best practices |
|
|
|
|
|
## How it works |
|
|
|
|
|
Under the hood, this application uses: |
|
|
|
|
|
1. **Snowflake Arctic Embeddings**: To convert text into vector representations |
|
|
- Base model: `Snowflake/snowflake-arctic-embed-l` |
|
|
- Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs) |
|
|
|
|
|
2. **Qdrant Vector Database**: To store and search for similar content |
|
|
- Efficiently indexes blog post content for fast semantic search |
|
|
- Supports real-time updates when new blog posts are published |
|
|
|
|
|
3. **GPT-4o-mini**: To generate helpful responses based on retrieved content |
|
|
- Primary model: OpenAI `gpt-4o-mini` for production inference |
|
|
- Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation |
|
|
|
|
|
4. **LangChain**: For building the RAG workflow |
|
|
- Orchestrates the retrieval and generation components |
|
|
- Provides flexible components for LLM application development |
|
|
- Structured for easy maintenance and future enhancements |
|
|
|
|
|
5. **Chainlit**: For the chat interface |
|
|
- Offers an interactive UI with message threading |
|
|
- Supports file uploads and custom components |
|
|
|
|
|
## Technology Stack |
|
|
|
|
|
### Core Components |
|
|
- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`) |
|
|
- **Embedding Model**: Snowflake Arctic Embeddings |
|
|
- **LLM**: OpenAI GPT-4o-mini |
|
|
- **Framework**: LangChain + Chainlit |
|
|
- **Development Language**: Python 3.13 |
|
|
|
|
|
### Advanced Features |
|
|
- **Evaluation**: Ragas metrics for evaluating RAG performance: |
|
|
- Faithfulness |
|
|
- Context Relevancy |
|
|
- Answer Relevancy |
|
|
- Topic Adherence |
|
|
- **Synthetic Data Generation**: For training and testing |
|
|
- **Vector Store Updates**: Automated pipeline to update when new blog content is published |
|
|
- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
``` |
|
|
lets-talk/ |
|
|
βββ data/ # Raw blog post content |
|
|
βββ py-src/ # Python source code |
|
|
β βββ lets_talk/ # Core application modules |
|
|
β β βββ agent.py # Agent implementation |
|
|
β β βββ config.py # Configuration settings |
|
|
β β βββ models.py # Data models |
|
|
β β βββ prompts.py # LLM prompt templates |
|
|
β β βββ rag.py # RAG implementation |
|
|
β β βββ rss_tool.py # RSS feed integration |
|
|
β β βββ tools.py # Tool implementations |
|
|
β βββ app.py # Main application entry point |
|
|
β βββ pipeline.py # Data processing pipeline |
|
|
βββ db/ # Vector database storage |
|
|
βββ evals/ # Evaluation datasets and results |
|
|
βββ notebooks/ # Jupyter notebooks for analysis |
|
|
``` |
|
|
|
|
|
## Environment Setup |
|
|
|
|
|
The application requires the following environment variables: |
|
|
|
|
|
``` |
|
|
OPENAI_API_KEY=your_openai_api_key |
|
|
VECTOR_STORAGE_PATH=./db/vector_store_tdg |
|
|
LLM_MODEL=gpt-4o-mini |
|
|
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l |
|
|
CHUNK_SIZE=1000 |
|
|
CHUNK_OVERLAP=200 |
|
|
``` |
|
|
|
|
|
Additional configuration options for vector database creation: |
|
|
|
|
|
``` |
|
|
# Vector Database Creation Configuration |
|
|
FORCE_RECREATE=False # Whether to force recreation of the vector store |
|
|
OUTPUT_DIR=./stats # Directory to save stats and artifacts |
|
|
USE_CHUNKING=True # Whether to split documents into chunks |
|
|
SHOULD_SAVE_STATS=True # Whether to save statistics about the documents |
|
|
``` |
|
|
|
|
|
## Running Locally |
|
|
|
|
|
### Using Docker |
|
|
|
|
|
```bash |
|
|
docker build -t lets-talk . |
|
|
docker run -p 7860:7860 \ |
|
|
--env-file ./.env \ |
|
|
lets-talk |
|
|
``` |
|
|
|
|
|
### Using Python |
|
|
|
|
|
```bash |
|
|
# Install dependencies |
|
|
uv init && uv sync |
|
|
|
|
|
# Run the application |
|
|
chainlit run py-src/app.py --host 0.0.0.0 --port 7860 |
|
|
``` |
|
|
|
|
|
## Deployment |
|
|
|
|
|
The application is designed to be deployed on: |
|
|
|
|
|
- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk)) |
|
|
- **Production**: Azure Container Apps (planned) |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
This project includes extensive evaluation capabilities using the Ragas framework: |
|
|
|
|
|
- **Synthetic Data Generation**: For creating test datasets |
|
|
- **Metric Evaluation**: Measuring faithfulness, relevance, and more |
|
|
- **Fine-tuning Analysis**: Comparing different embedding models |
|
|
|
|
|
## Future Enhancements |
|
|
|
|
|
- **Agentic Reasoning**: Adding more sophisticated agent capabilities |
|
|
- **Web UI Integration**: Custom Svelte component for the blog |
|
|
- **CI/CD**: GitHub Actions workflow for automated deployment |
|
|
- **Monitoring**: LangSmith integration for observability |
|
|
|
|
|
## License |
|
|
|
|
|
This project is available under the MIT License. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content |
|
|
- [Ragas](https://docs.ragas.io/) for evaluation framework |
|
|
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components |
|
|
- [Chainlit](https://docs.chainlit.io/) for the chat interface |
|
|
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|