Spaces:

mafzaal
/

lets_talk

Runtime error

App Files Files Community

lets_talk / README.md

mafzaal

Add vector database creation configuration and update related scripts

a092eef 6 months ago

preview code

raw

history blame contribute delete

5.36 kB

metadata

title: Lets Talk
emoji: 🐨
colorFrom: green
colorTo: blue
sdk: docker
pinned: false

Welcome to TheDataGuy Chat! 👋

This is a Q&A chatbot powered by TheDataGuy blog blog posts. Ask questions about topics covered in the blog, such as:

RAGAS and RAG evaluation
Building research agents
Metric-driven development
Data science best practices

How it works

Under the hood, this application uses:

Snowflake Arctic Embeddings: To convert text into vector representations
- Base model: Snowflake/snowflake-arctic-embed-l
- Fine-tuned model: mafzaal/thedataguy_arctic_ft (custom-tuned using blog-specific query-context pairs)
Qdrant Vector Database: To store and search for similar content
- Efficiently indexes blog post content for fast semantic search
- Supports real-time updates when new blog posts are published
GPT-4o-mini: To generate helpful responses based on retrieved content
- Primary model: OpenAI gpt-4o-mini for production inference
- Evaluation model: OpenAI gpt-4.1 for complex tasks including synthetic data generation and evaluation
LangChain: For building the RAG workflow
- Orchestrates the retrieval and generation components
- Provides flexible components for LLM application development
- Structured for easy maintenance and future enhancements
Chainlit: For the chat interface
- Offers an interactive UI with message threading
- Supports file uploads and custom components

Technology Stack

Core Components

Vector Database: Qdrant (stores embeddings via pipeline.py)
Embedding Model: Snowflake Arctic Embeddings
LLM: OpenAI GPT-4o-mini
Framework: LangChain + Chainlit
Development Language: Python 3.13

Advanced Features

Evaluation: Ragas metrics for evaluating RAG performance:
- Faithfulness
- Context Relevancy
- Answer Relevancy
- Topic Adherence
Synthetic Data Generation: For training and testing
Vector Store Updates: Automated pipeline to update when new blog content is published
Fine-tuned Embeddings: Custom embeddings tuned for technical content

Project Structure

lets-talk/
├── data/                  # Raw blog post content
├── py-src/                # Python source code
│   ├── lets_talk/         # Core application modules
│   │   ├── agent.py       # Agent implementation
│   │   ├── config.py      # Configuration settings
│   │   ├── models.py      # Data models
│   │   ├── prompts.py     # LLM prompt templates
│   │   ├── rag.py         # RAG implementation
│   │   ├── rss_tool.py    # RSS feed integration
│   │   └── tools.py       # Tool implementations
│   ├── app.py             # Main application entry point
│   └── pipeline.py        # Data processing pipeline
├── db/                    # Vector database storage
├── evals/                 # Evaluation datasets and results
└── notebooks/             # Jupyter notebooks for analysis

Environment Setup

The application requires the following environment variables:

OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

Additional configuration options for vector database creation:

# Vector Database Creation Configuration
FORCE_RECREATE=False      # Whether to force recreation of the vector store
OUTPUT_DIR=./stats        # Directory to save stats and artifacts
USE_CHUNKING=True         # Whether to split documents into chunks
SHOULD_SAVE_STATS=True    # Whether to save statistics about the documents

Running Locally

Using Docker

docker build -t lets-talk .
docker run -p 7860:7860 \
    --env-file ./.env \
    lets-talk

Using Python

# Install dependencies
uv init && uv sync

# Run the application
chainlit run py-src/app.py --host 0.0.0.0 --port 7860

Deployment

The application is designed to be deployed on:

Development: Hugging Face Spaces (Live Demo)
Production: Azure Container Apps (planned)

Evaluation

This project includes extensive evaluation capabilities using the Ragas framework:

Synthetic Data Generation: For creating test datasets
Metric Evaluation: Measuring faithfulness, relevance, and more
Fine-tuning Analysis: Comparing different embedding models

Future Enhancements

Agentic Reasoning: Adding more sophisticated agent capabilities
Web UI Integration: Custom Svelte component for the blog
CI/CD: GitHub Actions workflow for automated deployment
Monitoring: LangSmith integration for observability

License

This project is available under the MIT License.

Acknowledgements

TheDataGuy blog for the content
Ragas for evaluation framework
LangChain for RAG components
Chainlit for the chat interface

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference