lets_talk / README.md
mafzaal's picture
Add vector database creation configuration and update related scripts
a092eef
metadata
title: Lets Talk
emoji: 🐨
colorFrom: green
colorTo: blue
sdk: docker
pinned: false

Welcome to TheDataGuy Chat! πŸ‘‹

This is a Q&A chatbot powered by TheDataGuy blog blog posts. Ask questions about topics covered in the blog, such as:

  • RAGAS and RAG evaluation
  • Building research agents
  • Metric-driven development
  • Data science best practices

How it works

Under the hood, this application uses:

  1. Snowflake Arctic Embeddings: To convert text into vector representations

    • Base model: Snowflake/snowflake-arctic-embed-l
    • Fine-tuned model: mafzaal/thedataguy_arctic_ft (custom-tuned using blog-specific query-context pairs)
  2. Qdrant Vector Database: To store and search for similar content

    • Efficiently indexes blog post content for fast semantic search
    • Supports real-time updates when new blog posts are published
  3. GPT-4o-mini: To generate helpful responses based on retrieved content

    • Primary model: OpenAI gpt-4o-mini for production inference
    • Evaluation model: OpenAI gpt-4.1 for complex tasks including synthetic data generation and evaluation
  4. LangChain: For building the RAG workflow

    • Orchestrates the retrieval and generation components
    • Provides flexible components for LLM application development
    • Structured for easy maintenance and future enhancements
  5. Chainlit: For the chat interface

    • Offers an interactive UI with message threading
    • Supports file uploads and custom components

Technology Stack

Core Components

  • Vector Database: Qdrant (stores embeddings via pipeline.py)
  • Embedding Model: Snowflake Arctic Embeddings
  • LLM: OpenAI GPT-4o-mini
  • Framework: LangChain + Chainlit
  • Development Language: Python 3.13

Advanced Features

  • Evaluation: Ragas metrics for evaluating RAG performance:
    • Faithfulness
    • Context Relevancy
    • Answer Relevancy
    • Topic Adherence
  • Synthetic Data Generation: For training and testing
  • Vector Store Updates: Automated pipeline to update when new blog content is published
  • Fine-tuned Embeddings: Custom embeddings tuned for technical content

Project Structure

lets-talk/
β”œβ”€β”€ data/                  # Raw blog post content
β”œβ”€β”€ py-src/                # Python source code
β”‚   β”œβ”€β”€ lets_talk/         # Core application modules
β”‚   β”‚   β”œβ”€β”€ agent.py       # Agent implementation
β”‚   β”‚   β”œβ”€β”€ config.py      # Configuration settings
β”‚   β”‚   β”œβ”€β”€ models.py      # Data models
β”‚   β”‚   β”œβ”€β”€ prompts.py     # LLM prompt templates
β”‚   β”‚   β”œβ”€β”€ rag.py         # RAG implementation
β”‚   β”‚   β”œβ”€β”€ rss_tool.py    # RSS feed integration
β”‚   β”‚   └── tools.py       # Tool implementations
β”‚   β”œβ”€β”€ app.py             # Main application entry point
β”‚   └── pipeline.py        # Data processing pipeline
β”œβ”€β”€ db/                    # Vector database storage
β”œβ”€β”€ evals/                 # Evaluation datasets and results
└── notebooks/             # Jupyter notebooks for analysis

Environment Setup

The application requires the following environment variables:

OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

Additional configuration options for vector database creation:

# Vector Database Creation Configuration
FORCE_RECREATE=False      # Whether to force recreation of the vector store
OUTPUT_DIR=./stats        # Directory to save stats and artifacts
USE_CHUNKING=True         # Whether to split documents into chunks
SHOULD_SAVE_STATS=True    # Whether to save statistics about the documents

Running Locally

Using Docker

docker build -t lets-talk .
docker run -p 7860:7860 \
    --env-file ./.env \
    lets-talk

Using Python

# Install dependencies
uv init && uv sync

# Run the application
chainlit run py-src/app.py --host 0.0.0.0 --port 7860

Deployment

The application is designed to be deployed on:

  • Development: Hugging Face Spaces (Live Demo)
  • Production: Azure Container Apps (planned)

Evaluation

This project includes extensive evaluation capabilities using the Ragas framework:

  • Synthetic Data Generation: For creating test datasets
  • Metric Evaluation: Measuring faithfulness, relevance, and more
  • Fine-tuning Analysis: Comparing different embedding models

Future Enhancements

  • Agentic Reasoning: Adding more sophisticated agent capabilities
  • Web UI Integration: Custom Svelte component for the blog
  • CI/CD: GitHub Actions workflow for automated deployment
  • Monitoring: LangSmith integration for observability

License

This project is available under the MIT License.

Acknowledgements

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference