jfang's picture
Upload 7 files
3718631 verified
# gprMax RAG Database System
## Overview
This is a production-ready Retrieval-Augmented Generation (RAG) system for gprMax documentation. It provides efficient vector search capabilities for the gprMax documentation, enabling intelligent context retrieval for the chatbot.
## Architecture
### Components
1. **Document Processor**: Extracts and chunks documentation from gprMax GitHub repository
2. **Embedding Model**: Qwen2.5-0.5B (will upgrade to Qwen3-Embedding-0.6B when available)
3. **Vector Database**: ChromaDB with persistent storage
4. **Retriever**: Search and context retrieval utilities
### Key Features
- Automatic documentation extraction from gprMax GitHub repository
- Intelligent chunking with configurable size and overlap
- Persistent vector database using ChromaDB
- Efficient similarity search with score thresholding
- Metadata tracking for reproducibility
## Installation
The database is **automatically generated** on first startup of the application. No manual installation required!
## Automatic Generation
When the app starts:
1. Checks if database exists at `rag-db/chroma_db/`
2. If not found, automatically runs `generate_db.py`
3. Clones gprMax repository and processes documentation
4. Creates ChromaDB with default embeddings (all-MiniLM-L6-v2)
5. Ready to use - this only happens once!
## Manual Generation (Optional)
If you need to manually regenerate the database:
```bash
cd rag-db
python generate_db.py --recreate
```
Custom settings:
```bash
python generate_db.py \
--db-path ./custom_db \
--temp-dir ./temp \
--device cuda \
--recreate
```
### 2. Use Retriever in Application
```python
from rag_db.retriever import create_retriever
# Initialize retriever
retriever = create_retriever(db_path="./rag-db/chroma_db")
# Search for relevant documents
results = retriever.search("How to create a source?", k=5)
# Get formatted context for LLM
context = retriever.get_context("antenna patterns", k=3)
# Get relevant source files
files = retriever.get_relevant_files("boundary conditions")
# Get database statistics
stats = retriever.get_stats()
```
### 3. Test Retriever
```bash
# Test with default query
python retriever.py
# Test with custom query
python retriever.py "How to model soil layers?"
```
## Database Schema
### Document Structure
```json
{
"id": "unique_hash",
"text": "document_chunk_text",
"metadata": {
"source": "docs/relative/path.rst",
"file_type": ".rst",
"chunk_index": 0,
"char_start": 0,
"char_end": 1000
}
}
```
### Metadata File
Generated `metadata.json` contains:
```json
{
"created_at": "2024-01-01T00:00:00",
"embedding_model": "Qwen/Qwen2.5-0.5B",
"collection_name": "gprmax_docs_v1",
"chunk_size": 1000,
"chunk_overlap": 200,
"total_documents": 1234
}
```
## Configuration
### Chunking Parameters
- `CHUNK_SIZE`: 1000 characters (optimal for context windows)
- `CHUNK_OVERLAP`: 200 characters (ensures continuity)
### Embedding Model
- Current: `Qwen/Qwen2.5-0.5B` (512-dim embeddings)
- Future: `Qwen/Qwen3-Embedding-0.6B` (when available)
### Database Settings
- Storage: ChromaDB persistent client
- Collection: `gprmax_docs_v1` (versioned for updates)
- Distance Metric: Cosine similarity
## Maintenance
### Regular Updates
Run monthly or when gprMax documentation updates:
```bash
# This will pull latest docs and update database
python generate_db.py
```
### Database Backup
```bash
# Backup database
cp -r chroma_db chroma_db_backup_$(date +%Y%m%d)
```
### Performance Tuning
- Adjust `CHUNK_SIZE` and `CHUNK_OVERLAP` in `generate_db.py`
- Modify batch sizes for large datasets
- Use GPU acceleration with `--device cuda`
## Integration with Main App
The RAG system integrates with the main Gradio app:
1. Import retriever in `app.py`
2. Use retriever to augment prompts with context
3. Display source references in UI
Example integration:
```python
# In app.py
from rag_db.retriever import create_retriever
retriever = create_retriever()
def augment_with_context(user_query):
context = retriever.get_context(user_query, k=3)
augmented_prompt = f"""
Context from documentation:
{context}
User question: {user_query}
"""
return augmented_prompt
```
## Troubleshooting
### Common Issues
1. **Database not found**
- Run `python generate_db.py` first
- Check `--db-path` parameter
2. **Out of memory**
- Use smaller batch sizes
- Use CPU instead of GPU
- Reduce chunk size
3. **Slow generation**
- Use GPU with `--device cuda`
- Reduce repository depth with shallow clone
- Use pre-generated database
### Logs
Check generation logs for detailed information:
```bash
python generate_db.py 2>&1 | tee generation.log
```
## Future Enhancements
1. **Model Upgrade**: Migrate to Qwen3-Embedding-0.6B when available
2. **Incremental Updates**: Add documents without full regeneration
3. **Multi-modal Support**: Include images and diagrams from docs
4. **Query Expansion**: Automatic query reformulation for better retrieval
5. **Caching Layer**: Redis cache for frequent queries
6. **Fine-tuned Embeddings**: Domain-specific embedding model for gprMax
## License
Same as parent project